Using Word-level Features to Better Predict Student Emotions during Spoken Tutoring Dialogues

advertisement
Using Word-level Features to
Better Predict Student Emotions
during Spoken Tutoring Dialogues
Mihai Rotaru
Diane J. Litman
DoD Group Meeting Presentation
Introduction




Why is important to detect/handle emotions?
Emotion annotation
Classification task
Previous work
2
(Spoken) Tutoring dialogues

Education




Classroom setting
Human (one-on-one) tutoring
Computer tutoring (ITS – Intelligent Tutoring Systems)
Addressing the learning gap between human and
computer tutoring



Dialogue-based ITS (Ex: Why2)
Improve the language understanding module of ITS
Incorporate affective reasoning


Connection between learning and student emotional state
Adding human-provided emotional scaffolding to a reading
tutor increases student persistence (Aist et al., 2002)
3
Human-Computer Excerpt
Tutor26: Let me ask my question in a more general way. What happens to
the velocity of a body when there is no force acting on it?
Student27: dammit (ASR: it is)
Tutor28 : Could you please repeat that?
Student29 : same (ASR: i same)
Tutor30 : Right. And what happens to the HORIZONTAL velocity of a body
when there is no HORIZONTAL force acting on it?
Student31: zero (ASR: the zero)
Tutor32 : Well... OK, let's consider how forces in the vertical direction affect
the motion of a body in a horizontal direction in a different scenario
<…omitted…>
Student33: oh gravity you already said this (ASR: beats gravity you're
exert less)
Tutor34: Fine. Are there any other forces acting on the apple as it falls?
Student35: no why are you doing this again (ASR: no y and to it yes)
Tutor36: Correct. The only force acting on the apple as it falls to the earth
is the earth's gravitational force. In which direction does gravitational force
act?
Student37: downward you computer (ASR: downward you computer)
Affective reasoning

Prerequisites


Dialogue-based ITS : Why2
Interaction via speech : ITSPOKE (Intelligent Tutoring SPOKEn
dialogue system)

Affective reasoning


Detect student emotions
Handle student emotions
5
• Back-end is Why2-Atlas system (VanLehn et al., 2002)
• Sphinx2 speech recognition and Cepstral text-to-speech
6
• Back-end is Why2-Atlas system (VanLehn et al., 2002)
• Sphinx2 speech recognition and Cepstral text-to-speech
7
• Back-end is Why2-Atlas system (VanLehn et al., 2002)
• Sphinx2 speech recognition and Cepstral text-to-speech
8
Student emotions

Emotion annotation



Perceived, intuitive expressions of emotion
Relative to other turns in context and tutoring task
3 Main emotion classes




Negative - e.g. uncertain, bored, irritated, confused, sad;
(question turns)
Positive - e.g. confident, enthusiastic
Neutral - no strong expression of negative or positive
emotion; (grounding turns)
Corpora


Human-Human (453 student turns from 10 dialogues)
Human-Computer (333 student turns from 15 dialogues)
9
Annotation example
Tutor: Uh let us talk of one car first.
Student: ok. (EMOTION = NEUTRAL)
Tutor: If there is a car, what is it that exerts force
on the car such that it accelerates forward?
Student: The engine. (EMOTION = POSITIVE)
Tutor: Uh well engine is part of the car, so how can
it exert force on itself?
Student: um… (EMOTION = NEGATIVE)
10
Classification task

3 Levels of Annotation Granularity


NPN - Negative, Positive, Neutral
NnN - Negative, Non-Negative


EnE - Emotional, Non-Emotional




positives and neutrals are conflated as Non-Negative
negatives and positives are conflated as Emotional
neutrals are Non-Emotional
useful for triggering system adaptation (HH corpus analysis)
Agreed subset
Predict the class of each student turn
11
Previous work - Features

Human-Human

5 feature types







3 feature types

amplitude, pitch, duration
Acoustic-prosodic

Lexical
Other automatic
Manual
Identifiers
Combinations

Human-Computer
Acoustic-prosodic








amplitude, pitch, duration
Lexical
Other automatic
Manual
Identifiers
Combinations
Current turn
Contextual


Local – previous two turns
Global – all turns so far
12
Previous work - Results
HH EnE
HC EnE
Kappa
0.55
0.30
Baseline
51.71%
58.64%
Accuracy
88.86%
72.91%
Rel. improv.
76.93%
34.50%
Litman and Forbes, ACL 2004
13
How to improve?

Use word-level features instead of turn-level features


Extend the pitch features set
Simplified word-level emotion model
14
Why word-level features?

Emotion might not be expressed over the entire turn

“This is great”
Angry
Happy
15
Why word-level features? (2)

Can approximate pitch contour better at sub-turn levels.

Especially for longer turns
350
300
250
200
150
100
50
This
is
great
16
Extended pitch features set

Previous work



Min, Max
Avg, Stdev
Extend with



Start, End
Regression coefficient and
regression error
Quadratic regression
coefficient
from Batliner et al. 2003
17
But wait…
Features
Student turn
321654615, asdakd, 342.234234
Asdhkas, a34334, 324,7657755
Machine
learning
Turn emotional class
Turn-level
Word-level
Word 1
321654615, asdakd, 342.234234
Asdhkas, a34334, 324,7657755
…
…
Word n
321654615, asdakd, 342.234234
Asdhkas, a34334, 324,7657755
?
Turn emotional class
321654615, asdakd, 342.234234
Asdhkas, a34334, 324,7657755
Sönmez et al., 1998
18
Word-level emotion model
Features
Student turn
321654615, asdakd, 342.234234
Asdhkas, a34334, 324,7657755
Machine
learning
Turn emotional class
Turn-level
Word-level
Word 1
321654615, asdakd, 342.234234
Asdhkas, a34334, 324,7657755
…
…
Word n
321654615, asdakd, 342.234234
Asdhkas, a34334, 324,7657755
Word-level emotion
…
Turn emotional class
Word-level emotion
19
Word-level emotion model

Training phase




Each word labeled with turn class
Extra features to identify the position of the word in the turn
(distance in words from the beginning and end of the turn)
Learn emotion model at the word level
Test phase


Predict each word class based on the learned model
Use majority/weighted voting to label the turn based on its
word classes

Ties are broken randomly
20
Questions to answer

Will word level feature work better than turn level
features for emotion prediction?


If yes, where does the advantage comes from?


Yes
Better prediction of longer turns
Is there a feature set that offers robust performance?

Yes. Combination of pitch and lexical features at word level.
21
Experiments

EnE classification, agreed turns
Two contrasting corpora

Two contrasting learners (WEKA)



IB1 – nearest neighbor classifier
ADA – boosted decision trees
22
Feature sets


Only pitch and lexical features
6 sets of features

Turn level:




Word level:





Lex-Turn – only lexical
Pitch-Turn – only pitch
PitchLex-Turn – lexical and prosodic
Lex-Word – only lexical + positional
Pitch-Word – only pitch + positional
PitchLex-Word – lexical and prosodic + positional
Baseline: majority class
10 x 10 cross validation
23
Results – IB1 on HH



Word-level features significantly outperform turn-level features
Word-level better than turn-level on longer turns
Best performers: Lex-Word, PitchLex-Word
100%
90%
85%
90%
Lex-Turn
80%
Lex-Word
75%
80%
70%
Pitch-Turn
65%
Pitch-Word
70%
60%
55%
PitchLexTurn
PitchLexWord
60%
50%
Baseline
Lex
Pitch
PitchLex
Turn level (square-pattern bars), Word level (no-pattern bars)
50%
single
short
medium
long
24
Results – ADA on HH

Turn-level performance increases a lot



Word-level significantly better than turn-level on features sets with pitch
Word-level better than turn-level on longer turns but the difference is
smaller
Best performers: Lex-Turn, Lex-Word, PitchLex-Word
100%
90%
85%
90%
80%
Lex-Turn
75%
Lex-Word
80%
70%
Pitch-Turn
65%
70%
Pitch-Word
60%
55%
PitchLexTurn
PitchLexWord
60%
50%
Baseline
Lex
Pitch
PitchLex
Turn level (square-pattern bars), Word level (no-pattern bars)
50%
single
short
medium
long
25
Results – IB1 on HC




Word-level features significantly outperform turn-level features
Lexical information less helpful than on HH corpus
Word-level better than turn-level on longer turns
Best performers: Pitch-Word, PitchLex-Word
75%
90%
70%
80%
Lex-Turn
65%
Lex-Word
70%
Pitch-Turn
60%
60%
Pitch-Word
55%
PitchLexTurn
PitchLexWord
50%
50%
Baseline
Lex
Pitch
PitchLex
Turn level (square-pattern bars), Word level (no-pattern bars)
40%
1
2
3
more3
26
Results – ADA on HC

Difference not significant anymore




IB1 better than ADA on word-level features
ADA has bigger variance on this corpus
Word-level better than turn-level on longer turns but the difference is
smaller
Best performers: Pitch-Turn, Pitch-Word, PitchLex-Turn, PitchLex-Word
75%
90%
70%
80%
Lex-Turn
65%
Lex-Word
70%
Pitch-Turn
60%
60%
Pitch-Word
55%
PitchLexTurn
PitchLexWord
50%
50%
Baseline
Lex
Pitch
PitchLex
Turn level (square-pattern bars), Word level (no-pattern bars)
40%
1
2
3
more3
27
Discussion

Lexical features at turn and word-level
are similar


Pitch features differ significantly



Performance dependent on corpus and
learner
Word-level better than turn-level (4/6)
PitchLex-Word a consistent best
performer
Our best accuracies comparable with
previous work
28
Conclusions & Future work

Word-level better than turn-level for emotion prediction




Even under a very simple word-level emotion model
Word-level better at predicting longer turns
PitchLex-Word a consistent best performer
Future work:

More refined word-level emotion models






HMMs
Co-training
Filter irrelevant words
Use the prosodic information left out
See if our conclusions generalize on detecting student uncertainty
Experiment with other sub-turn units (breath groups)
29
Download