Emotion Recognition in Conversation I Problem: Given the transcript of a conversation along with speaker information of each constituent utterance, the ERC task aims to identify the emotion of each utterance from several pre-defined emotions. Formally, given the input sequence of N number of utterances [(u1, p1), (u2, p2), . . . , (uN , pN )], where each utterance ui = [ui,1, ui,2, . . . , ui,T ] consists of T words ui,j and spoken by party pi, the task is to predict the emotion label ei of each utterance ui. . II Task type: deep learning, lexical feature extraction, neural feature extraction. III Data and sota method. MELD A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation Data bao gom video, audio va text cua cac conversation nhieu hon hoac = 2 nguoi SOTA: 66.52 CoMPM: Context Modeling with Speaker’s Pre-trained Memory Tracking for Emotion Recognition in Conversation Related work: Many recent studies use external knowledge to improve the ERC performance. KET (Zhong et al., 2019) is used as external knowledge based on ConceptNet (Speer et al., 2017) and emotion lexicon NRC_VAD (Mohammad, 2018) as the commonsense knowledge. ConceptNet is a knowledge graph that connects words and phrases in natural language using labeled edges. NRC_VAD Lexicon has human ratings of valence, arousal, and dominance for more than 20,000 English words. COSMIC (Ghosal et al., 2020) and Psychological (Li et al., 2021) improve the performance of emotion recognition by extracting commonsense knowledge of the previous utterances. Commonsense knowledge feature is extracted and leveraged with COMET (Bosselut et al., 2019) trained with ATOMIC (The Atlas of Machine Commonsense) (Sap et al., 2019). ATOMIC has 9 sentence relation types with inferential if-then commonsense knowledge expressed in text. ToDKAT (Zhu et al., 2021) improves performance by combining commonsense knowledge using COMET and topic discovery using VHRED (Serban et al., 2017) to the model. Ekman (Ekman, 1992) constructs taxonomy of six common emotions (Joy, Sadness, Fear, Anger, Surprise, and Disgust) from human facial expressions. In addition, Ekman explains that a multimodal view is important for multiple emotions recognition. The multi-modal data such as MELD and IEMOCAP are some of the available standard datasets for emotion recognition and they are composed of text, speech and vision-based data. Datcu and Rothkrantz (2014) uses speech and visual information to recognize emotions, and (Alm et al., 2005) attempts to recognize emotions based on text information. MELD and ICON (Hazarika et al., 2018a) show that the more multi-modal information is used, the better the performance and the text information plays the most important role. Multimodal information is not always given in most social media, especially in chatbot systems where they are mainly composed of text-based systems. In this work, we design and introduce a text-based emotion recognition system using neural networks. Zhou et al. (2018); Zhang et al. (2018a) shows that commonsense knowledge is important for understanding conversations and generating appropriate responses. Liu et al. (2020) reports that the lack of external knowledge makes it difficult to classify implicit emotions from the conversation history. EDA (Bothe et al., 2020) expands the multi-modal emotion datasets by extracting dialog acts from MELD and IEMOCAP and finds out that there is a correlation between dialogue acts and emotion labels. https://conceptnet.io/ PM: Pre-trained Memory Module External knowledge is known to play an important role in understanding conversation. Pre-trained language models can be trained on numerous corpora and be used as an external knowledge base. Inspired by previous studies that the speaker’s knowledge helps to judge emotions, we extract and track pre-trained memory from the speaker’s previous utterances to utilize the emotions of the current utterance ut. If the speaker has never appeared before the current turn, the result of the pre-trained memory is considered a zero vector. IEMOCAP The Interactive Emotional Dyadic Motion Capture It contains approximately 12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions. IEMOCAP database is annotated by multiple annotators into categorical labels, such as anger, happiness, sadness, neutrality, as well as dimensional labels such as valence, activation and dominance SOTA 68.73 Hybrid Curriculum Learning for Emotion Recognition in Conversation Related works: motion recognition in conversation (ERC) aims to detect the emotion label for each utterance. Motivated by recent studies which have proven that feeding training examples in a meaningful order rather than considering them randomly can boost the performance of models, we propose an ERCoriented hybrid curriculum learning framework. Our framework consists of two curricula: (1) conversation-level curriculum (CC); and (2) utterance-level curriculum (UC). In CC, we construct a difficulty measurer based on “emotion shift” frequency within a conversation, then the conversations are scheduled in an “easy to hard” schema according to the difficulty score returned by the difficulty measurer. For UC, it is implemented from an emotion-similarity perspective, which progressively strengthens the model’s ability in identifying the confusing emotions. With the proposed model-agnostic hybrid curriculum learning strategy, we observe significant performance boosts over a wide range of existing ERC models and we are able to achieve new state-of-the-art results on four public ERC datasets. Emotion recognition in conversations (ERC) has been widely studied due to its potential application prospect. The key point of ERC is how to effectively model the context of each utterance and corresponding speaker. Existing works generally resort to deep learning methods to capture contextual characteristics, which can be divided into sequencebased and graph-based methods. Another direction is to improve the performance of existing models by incorporating various external knowledge, which we classified as knowledge-based methods. Sequence-based Methods Many previous works con- sider contextual information as utterance sequences. ICON (Hazarika et al. 2018a) and CMN (Hazarika et al. 2018b) both utilize gated recurrent unit (GRU) to model the utterance sequences. DialogueRNN (Majumder et al. 2019) employs a GRU to capture the global context which is updated by the speaker state GRUs. Jiao et al. (2019) propose a hierarchical neural network model that comprises two GRUs for the modelling of tokens and utterances respectively. Hu, Wei, and Huai (2021) introduce multi-turn reasoning modules on Bi-directional LSTM to model the ERC task from a cognitive perspective. Graph-based Methods In this category, some existing works (Ghosal et al. 2019; Ishiwatari et al. 2020; Zhang et al. 2019) utilize various graph neural networks to capture multiple dependencies in the conversation. DialogXL (Shen et al. 2021a) modifies the memory block in XLNet (Yang et al. 2019) to store historical context and leverages the selfattention mechanism in XLNet to deal with the multi-turn multi-party structure in conversation. Shen et al. (2021b) design a directed acyclic graph (DAG) to model the intrinsic structure within a conversation, which achieves the state-ofthe-art performance without considering the introduction of external knowledge. Knowledge-based Methods KET (Zhong, Wang, and Miao 2019) employs hierarchical transformers with concept representations extracted from the ConceptNet (Speer and Lowry-Duda 2017) for emotion detection, which is the first ERC model integrates common-sense knowledge. COSMIC (Ghosal et al. 2020) adopts a network structure very close to DialogRNN and adds external commonsense knowledge from ATOMIC (Sap et al. 2019) to improve its performance. TODKAT (Zhu et al. 2021) leverages an encoder-decoder architecture which incorporates topic representation with commonsense knowledge from ATOMIC for ERC. Daily dialog Sota: 64.07 S+PAGE: A Speaker and Position-Aware Graph Neural Network Model for Emotion Recognition in Conversation Emotion recognition in conversation (ERC) has attracted much attention in recent years for its necessity in widespread applications. Existing ERC methods mostly model the self and inter-speaker context separately, posing a major issue for lacking enough interaction between them. In this paper, we propose a novel Speaker and Position-Aware Graph neural network model for ERC (S+PAGE), which contains three stages to combine the benefits of both Transformer and relational graph convolution network (R-GCN) for better contextual modeling. Firstly, a two-stream conversational Transformer is presented to extract the coarse self and inter-speaker contextual features for each utterance. Then, a speaker and position-aware conversation graph is constructed, and we propose an enhanced R-GCN model, called PAG, to refine the coarse features guided by a relative positional encoding. Finally, both of the features from the former two stages are input into a conditional random field layer to model the emotion transfer. Extensive experiments demonstrate that our model achieves state-of-the-art performance on three ERC datasets. EmoryNLP Emotion Detection aims to classify a fine-grained emotion for each utterance in multiparty dialogue. Our annotation is based on the primary emotions in the Feeling Wheel (Willcox, 1982) { "utterance_id": "s01_e02_c01_u002", "speakers": ["Joey Tribbiani"], "transcript": "Yeah, right!.......Y'serious?", "tokens": [ ["Yeah", ",", "right", "!"], ["......."], ["Y'serious", "?"] ], "emotion": "Neutral" }, { "utterance_id": "s01_e02_c01_u003", "speakers": ["Phoebe Buffay"], "transcript": "Oh, yeah!", "tokens": [ ["Oh", ",", "yeah", "!"] ], "emotion": "Joyful" }, { "utterance_id": "s01_e02_c01_u004", "speakers": ["Rachel Green"], "transcript": "Everything you need to know is in that first kiss.", "tokens": [ ["Everything", "you", "need", "to", "know", "is", "in", "that", "first", "kiss", "."] ], "emotion": "Powerful" } SOTA: 46.11 Hybrid Curriculum Learning for Emotion Recognition in Conversation Emocontext In this task, you are given a textual dialogue i.e. a user utterance along with two turns of context, you have to classify the emotion of user utterance as one of the emotion classes: happy, sad, angry or others. SOTA: 0.777 NELEC at SemEval-2019 Task 3: Think Twice Before Going Deep Existing Machine Learning techniques yield close to human performance on text-based classification tasks. However, the presence of multi-modal noise in chat data such as emoticons, slang, spelling mistakes, code-mixed data, etc. makes existing deep-learning solutions perform poorly. The inability of deeplearning systems to robustly capture these covariates puts a cap on their performance. We propose NELEC : Neural and Lexical Combiner, a system which elegantly combines textual and deep-learning based methods for sentiment classification. We evaluate our system as part of the third task of ’Contextual Emotion Detection in Text’ as part of SemEval-2019 (Chatterjee et al., 2019b). Our system performs significantly better than the baseline, as well as our deeplearning model benchmarks. It achieved a micro-averaged F1 score of 0.7765, ranking 3rd on the test-set leader-board. Our code is available at https://github.com/ iamgroot42/nelec 2.2 NELEC : Neural and Lexical Combiner Since neural features have a lot of shortcomings, we shift our focus to lexical features. Using a combination of both lexical (n-gram features, etc.) and neural features (scores from neural classifiers), we trained a standard Light-GBM (Ke et al., 2017) Model for 100 iterations, with feature subsampling of 0.7 and data sub-sampling of 0.7 using bagging with a frequency of 1.0. We use 10−2 ∗ ‖weights‖2 as regularization. We also experimented with a logistic regression model, but it had a significant drop in performance for the ’happy’ and ’angry’ classes (Table 2). The total number of features used is 9270, out of which 9189 are sparse. The features we use in our model are described in the sections below: Turn Wise Word n-Grams Turn Wise Char n-Grams Valence Arousal Dominance Emotion Intensity https://lightgbm.readthedocs.io/en/latest/ SEMAINE (more suitable for AGI) SEMAINE has created a large audiovisual database as part of an iterative approach to building Sensitive Artificial Listener (SAL) agents that can engage a person in a sustained, emotionally coloured conversation. Data used to build the agents came from interactions between users and an 'operator' simulating a SAL agent, in different configurations: Solid SAL (designed so that operators displayed appropriate non-verbal behaviour) and Semi-automatic SAL (designed so that users' experience approximated interacting with a machine). We then recorded user interactions with the developed system, Automatic SAL, comparing the most communicatively competent version to versions with reduced nonverbal skills. High quality recording was provided by 5 high-resolution, high framerate cameras, and 4 microphones, recorded synchronously. Recordings total 150 participants, for a total of 959 conversations with individual SAL characters, lasting approximately 5 minutes each. Solid SAL recordings are transcribed and extensively annotated: 6-8 raters per clip traced five affective dimensions and 27 associated categories. Other scenarios are labelled on the same pattern, but less fully. Additional information includes FACS annotation on selected extracts, identification of laughs, nods and shakes, and measures of user engagement with the automatic system. The material is available through a web-accessible database. SOTA: 0.189 (MAE) EmotionLines We introduce EmotionLines, the first dataset with emotions labeling on all utterances in each dialogue only based on their textual content. Dialogues in EmotionLines are collected from Friends TV scripts and private Facebook messenger dialogues. Then one of seven emotions, six Ekman’s basic emotions plus the neutral emotion, is labeled on each utterance by 5 Amazon MTurkers. A total of 29,245 utterances from 2,000 dialogues are labeled in EmotionLines. SOTA: 63.03 Hierarchical Transformer Network for Utterance-level Emotion Recognition While there have been significant advances in detecting emotions in text, in the field of utterancelevel emotion recognition (ULER), there are still many problems to be solved. In this paper, we address some challenges in ULER in dialog systems. (1) The same utterance can deliver different emotions when it is in different contexts or from different speakers. (2) Long-range contextual information is hard to effectively capture. (3) Unlike the traditional text classification problem, this task is supported by a limited number of datasets, among which most contain inadequate conversations or speech. To address these problems, we propose a hierarchical transformer framework (apart from the description of other studies, the “transformer” in this paper usually refers to the encoder part of the transformer) with a lower-level transformer to model the word-level input and an upper-level transformer to capture the context of utterance-level embeddings. We use a pretrained language model bidirectional encoder representations from transformers (BERT) as the lower-level transformer, which is equivalent to introducing external data into the model and solve the problem of data shortage to some extent. In addition, we add speaker embeddings to the model for the first time, which enables our model to capture the interaction between speakers. Experiments on three dialog emotion datasets, Friends, EmotionPush, and EmoryNLP, demonstrate that our proposed hierarchical transformer network models achieve 1.98%, 2.83%, and 3.94% improvement, respectively, over the state-of-the-art methods on each dataset in terms of macro-F1. HiTransformer-s: Hierarchical Transformer with speaker embeddings