Uploaded by Nguyen Tan Viet (K15 HL)

Tài liệu

advertisement
Emotion Recognition in Conversation
I Problem:
Given the transcript of a conversation along with speaker information of each constituent utterance,
the ERC task aims to identify the emotion of each utterance from several pre-defined emotions.
Formally, given the input sequence of N number of utterances [(u1, p1), (u2, p2), . . . , (uN , pN )],
where each utterance ui = [ui,1, ui,2, . . . , ui,T ] consists of T words ui,j and spoken by party pi, the
task is to predict the emotion label ei of each utterance ui. .
II Task type: deep learning, lexical feature extraction, neural feature extraction.
III Data and sota method.
MELD A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation
Data bao gom video, audio va text cua cac conversation nhieu hon hoac = 2 nguoi
SOTA: 66.52
CoMPM: Context Modeling with Speaker’s Pre-trained Memory
Tracking for Emotion Recognition in Conversation
Related work:
Many recent studies use external knowledge to improve the ERC performance. KET (Zhong et al.,
2019) is used as external knowledge based on
ConceptNet (Speer et al., 2017) and emotion lexicon NRC_VAD (Mohammad, 2018) as the commonsense knowledge. ConceptNet is a knowledge
graph that connects words and phrases in natural
language using labeled edges. NRC_VAD Lexicon has human ratings of valence, arousal, and
dominance for more than 20,000 English words.
COSMIC (Ghosal et al., 2020) and Psychological (Li et al., 2021) improve the performance of
emotion recognition by extracting commonsense
knowledge of the previous utterances. Commonsense knowledge feature is extracted and leveraged with COMET (Bosselut et al., 2019) trained
with ATOMIC (The Atlas of Machine Commonsense) (Sap et al., 2019). ATOMIC has 9 sentence
relation types with inferential if-then commonsense
knowledge expressed in text. ToDKAT (Zhu et al.,
2021) improves performance by combining commonsense knowledge using COMET and topic discovery using VHRED (Serban et al., 2017) to the
model.
Ekman (Ekman, 1992) constructs taxonomy of
six common emotions (Joy, Sadness, Fear, Anger,
Surprise, and Disgust) from human facial expressions. In addition, Ekman explains that a multimodal view is important for multiple emotions
recognition. The multi-modal data such as MELD
and IEMOCAP are some of the available standard
datasets for emotion recognition and they are composed of text, speech and vision-based data. Datcu
and Rothkrantz (2014) uses speech and visual information to recognize emotions, and (Alm et al.,
2005) attempts to recognize emotions based on text
information. MELD and ICON (Hazarika et al.,
2018a) show that the more multi-modal information is used, the better the performance and the text
information plays the most important role. Multimodal information is not always given in most social media, especially in chatbot systems where
they are mainly composed of text-based systems.
In this work, we design and introduce a text-based
emotion recognition system using neural networks.
Zhou et al. (2018); Zhang et al. (2018a) shows
that commonsense knowledge is important for understanding conversations and generating appropriate responses. Liu et al. (2020) reports that the lack
of external knowledge makes it difficult to classify
implicit emotions from the conversation history.
EDA (Bothe et al., 2020) expands the multi-modal
emotion datasets by extracting dialog acts from
MELD and IEMOCAP and finds out that there is
a correlation between dialogue acts and emotion
labels.
https://conceptnet.io/
PM: Pre-trained Memory Module
External knowledge is known to play an important
role in understanding conversation. Pre-trained language models can be trained on numerous corpora
and be used as an external knowledge base. Inspired by previous studies that the speaker’s knowledge helps to judge emotions, we extract and track
pre-trained memory from the speaker’s previous
utterances to utilize the emotions of the current
utterance ut. If the speaker has never appeared before the current turn, the result of the pre-trained
memory is considered a zero vector.
IEMOCAP The Interactive Emotional Dyadic Motion Capture
It contains approximately 12 hours of audiovisual data, including video, speech, motion capture of
face, text transcriptions.
IEMOCAP database is annotated by multiple annotators into categorical labels, such as anger,
happiness, sadness, neutrality, as well as dimensional labels such as valence, activation and
dominance
SOTA 68.73
Hybrid Curriculum Learning for Emotion Recognition in Conversation
Related works:
motion recognition in conversation (ERC) aims to detect
the emotion label for each utterance. Motivated by recent
studies which have proven that feeding training examples in
a meaningful order rather than considering them randomly
can boost the performance of models, we propose an ERCoriented hybrid curriculum learning framework. Our framework consists of two curricula: (1) conversation-level curriculum (CC); and (2) utterance-level curriculum (UC). In CC,
we construct a difficulty measurer based on “emotion shift”
frequency within a conversation, then the conversations are
scheduled in an “easy to hard” schema according to the difficulty score returned by the difficulty measurer. For UC, it is
implemented from an emotion-similarity perspective, which
progressively strengthens the model’s ability in identifying
the confusing emotions. With the proposed model-agnostic
hybrid curriculum learning strategy, we observe significant
performance boosts over a wide range of existing ERC models and we are able to achieve new state-of-the-art results on
four public ERC datasets.
Emotion recognition in conversations (ERC) has been
widely studied due to its potential application prospect. The
key point of ERC is how to effectively model the context of
each utterance and corresponding speaker. Existing works
generally resort to deep learning methods to capture contextual characteristics, which can be divided into sequencebased and graph-based methods. Another direction is to
improve the performance of existing models by incorporating various external knowledge, which we classified as
knowledge-based methods.
Sequence-based Methods Many previous works con-
sider contextual information as utterance sequences. ICON
(Hazarika et al. 2018a) and CMN (Hazarika et al. 2018b)
both utilize gated recurrent unit (GRU) to model the utterance sequences. DialogueRNN (Majumder et al. 2019) employs a GRU to capture the global context which is updated
by the speaker state GRUs. Jiao et al. (2019) propose a hierarchical neural network model that comprises two GRUs
for the modelling of tokens and utterances respectively. Hu,
Wei, and Huai (2021) introduce multi-turn reasoning modules on Bi-directional LSTM to model the ERC task from a
cognitive perspective.
Graph-based Methods In this category, some existing
works (Ghosal et al. 2019; Ishiwatari et al. 2020; Zhang
et al. 2019) utilize various graph neural networks to capture
multiple dependencies in the conversation. DialogXL (Shen
et al. 2021a) modifies the memory block in XLNet (Yang
et al. 2019) to store historical context and leverages the selfattention mechanism in XLNet to deal with the multi-turn
multi-party structure in conversation. Shen et al. (2021b) design a directed acyclic graph (DAG) to model the intrinsic
structure within a conversation, which achieves the state-ofthe-art performance without considering the introduction of
external knowledge.
Knowledge-based Methods KET (Zhong, Wang, and
Miao 2019) employs hierarchical transformers with concept
representations extracted from the ConceptNet (Speer and
Lowry-Duda 2017) for emotion detection, which is the first
ERC model integrates common-sense knowledge. COSMIC
(Ghosal et al. 2020) adopts a network structure very close
to DialogRNN and adds external commonsense knowledge
from ATOMIC (Sap et al. 2019) to improve its performance.
TODKAT (Zhu et al. 2021) leverages an encoder-decoder
architecture which incorporates topic representation with
commonsense knowledge from ATOMIC for ERC.
Daily dialog
Sota: 64.07
S+PAGE: A Speaker and Position-Aware Graph Neural Network Model for
Emotion Recognition in Conversation
Emotion recognition in conversation (ERC) has attracted
much attention in recent years for its necessity in widespread
applications. Existing ERC methods mostly model the self
and inter-speaker context separately, posing a major issue
for lacking enough interaction between them. In this paper,
we propose a novel Speaker and Position-Aware Graph neural network model for ERC (S+PAGE), which contains three
stages to combine the benefits of both Transformer and relational graph convolution network (R-GCN) for better contextual modeling. Firstly, a two-stream conversational Transformer is presented to extract the coarse self and inter-speaker
contextual features for each utterance. Then, a speaker and
position-aware conversation graph is constructed, and we
propose an enhanced R-GCN model, called PAG, to refine
the coarse features guided by a relative positional encoding.
Finally, both of the features from the former two stages are input into a conditional random field layer to model the emotion
transfer. Extensive experiments demonstrate that our model
achieves state-of-the-art performance on three ERC datasets.
EmoryNLP
Emotion Detection aims to classify a fine-grained emotion for each utterance in multiparty dialogue.
Our annotation is based on the primary emotions in the Feeling Wheel (Willcox, 1982)
{
"utterance_id": "s01_e02_c01_u002",
"speakers": ["Joey Tribbiani"],
"transcript": "Yeah, right!.......Y'serious?",
"tokens": [
["Yeah", ",", "right", "!"],
["......."],
["Y'serious", "?"]
],
"emotion": "Neutral"
},
{
"utterance_id": "s01_e02_c01_u003",
"speakers": ["Phoebe Buffay"],
"transcript": "Oh, yeah!",
"tokens": [
["Oh", ",", "yeah", "!"]
],
"emotion": "Joyful"
},
{
"utterance_id": "s01_e02_c01_u004",
"speakers": ["Rachel Green"],
"transcript": "Everything you need to know is in that first kiss.",
"tokens": [
["Everything", "you", "need", "to", "know", "is", "in", "that", "first", "kiss", "."]
],
"emotion": "Powerful"
}
SOTA: 46.11
Hybrid Curriculum Learning for Emotion Recognition in Conversation
Emocontext
In this task, you are given a textual dialogue i.e. a user utterance along with two turns of context,
you have to classify the emotion of user utterance as one of the emotion classes: happy, sad, angry
or others.
SOTA: 0.777
NELEC at SemEval-2019 Task 3: Think Twice Before Going Deep
Existing Machine Learning techniques yield
close to human performance on text-based
classification tasks. However, the presence of
multi-modal noise in chat data such as emoticons, slang, spelling mistakes, code-mixed
data, etc. makes existing deep-learning solutions perform poorly. The inability of deeplearning systems to robustly capture these
covariates puts a cap on their performance.
We propose NELEC : Neural and Lexical
Combiner, a system which elegantly combines textual and deep-learning based methods for sentiment classification. We evaluate our system as part of the third task
of ’Contextual Emotion Detection in Text’
as part of SemEval-2019 (Chatterjee et al.,
2019b). Our system performs significantly
better than the baseline, as well as our deeplearning model benchmarks. It achieved a
micro-averaged F1 score of 0.7765, ranking
3rd on the test-set leader-board. Our code
is available at https://github.com/
iamgroot42/nelec
2.2 NELEC : Neural and Lexical Combiner
Since neural features have a lot of shortcomings,
we shift our focus to lexical features. Using a
combination of both lexical (n-gram features, etc.)
and neural features (scores from neural classifiers), we trained a standard Light-GBM (Ke et al.,
2017) Model for 100 iterations, with feature subsampling of 0.7 and data sub-sampling of 0.7 using bagging with a frequency of 1.0. We use
10−2 ∗ ‖weights‖2 as regularization. We also experimented with a logistic regression model, but
it had a significant drop in performance for the
’happy’ and ’angry’ classes (Table 2). The total number of features used is 9270, out of which
9189 are sparse. The features we use in our model
are described in the sections below:
Turn Wise Word n-Grams
Turn Wise Char n-Grams
Valence Arousal Dominance
Emotion Intensity
https://lightgbm.readthedocs.io/en/latest/
SEMAINE (more suitable for AGI)
SEMAINE has created a large audiovisual database as part of an iterative approach to building
Sensitive Artificial Listener (SAL) agents that can engage a person in a sustained, emotionally
coloured conversation. Data used to build the agents came from interactions between users and an
'operator' simulating a SAL agent, in different configurations: Solid SAL (designed so that operators
displayed appropriate non-verbal behaviour) and Semi-automatic SAL (designed so that users'
experience approximated interacting with a machine). We then recorded user interactions with the
developed system, Automatic SAL, comparing the most communicatively competent version to
versions with reduced nonverbal skills. High quality recording was provided by 5 high-resolution,
high framerate cameras, and 4 microphones, recorded synchronously. Recordings total 150
participants, for a total of 959 conversations with individual SAL characters, lasting approximately 5
minutes each. Solid SAL recordings are transcribed and extensively annotated: 6-8 raters per clip
traced five affective dimensions and 27 associated categories. Other scenarios are labelled on the
same pattern, but less fully. Additional information includes FACS annotation on selected extracts,
identification of laughs, nods and shakes, and measures of user engagement with the automatic
system. The material is available through a web-accessible database.
SOTA: 0.189 (MAE)
EmotionLines
We introduce EmotionLines, the first dataset with emotions labeling on all utterances in each
dialogue only based on their textual content. Dialogues in EmotionLines are collected from Friends
TV scripts and private Facebook messenger dialogues. Then one of seven emotions, six Ekman’s
basic emotions plus the neutral emotion, is labeled on each utterance by 5 Amazon MTurkers. A
total of 29,245 utterances from 2,000 dialogues are labeled in EmotionLines.
SOTA: 63.03
Hierarchical Transformer Network for Utterance-level Emotion Recognition
While there have been significant advances in detecting emotions in text, in the field of utterancelevel emotion recognition (ULER), there are still
many problems to be solved. In this paper, we address some challenges in ULER in dialog systems.
(1) The same utterance can deliver different emotions when it is in different contexts or from different speakers. (2) Long-range contextual information
is hard to effectively capture. (3) Unlike the traditional text classification problem, this task is supported by a limited number of datasets, among
which most contain inadequate conversations or
speech. To address these problems, we propose a hierarchical transformer framework (apart from the
description of other studies, the “transformer” in this
paper usually refers to the encoder part of the transformer) with a lower-level transformer to model the
word-level input and an upper-level transformer to
capture the context of utterance-level embeddings.
We use a pretrained language model bidirectional
encoder representations from transformers (BERT)
as the lower-level transformer, which is equivalent
to introducing external data into the model and solve
the problem of data shortage to some extent. In addition, we add speaker embeddings to the model for
the first time, which enables our model to capture
the interaction between speakers. Experiments on
three dialog emotion datasets, Friends, EmotionPush, and EmoryNLP, demonstrate that our proposed hierarchical transformer network models
achieve 1.98%, 2.83%, and 3.94% improvement, respectively, over the state-of-the-art methods on each
dataset in terms of macro-F1.
HiTransformer-s: Hierarchical Transformer
with speaker embeddings
Download