Speech Emotion Recognition (By Björn W. Schuller) Shivam Jain(2020CS50626) Shivam Verma(2020CS50442) Presentation Overview • Introduction • Past Efforts • The Traditional Approach • The Ongoing Trends • Challenges Ahead Introduction • Speech Emotion Recognition (SER) is the task of recognizing the emotion from speech irrespective of the semantic contents • Even the most primitive animals can recognize the tones of love and fear and anger, and this knowledge is shared by the animals. The dog, the horse, and many other animals can understand emotions from speech. • The time has come for computers to understand it as well. Image Source-https://www.projectpro.io/article/speech-emotion-recognitionproject-using-machine-learning/573 Past efforts • The first- ever hardware product—the “Handy Truster”—claimed to be able to sense human stress-levels. • Approximately 10 years later, the first broad-consumer market video game appeared. “Truth or Lies” (THQ) was equipped with a disc and a micro- phone for players to bring the popular “Spin the Bottle” game to the digital age. • Today Alexa, Cortana, Siri, and many more dialogue systems have hit the consumer market on a broader basis than ever. The Traditional Approach Modelling• The first step to approach the automatic recognition of emotion requires an appropriate emotion representation model. • The model should represent emotions in an adequate way to ensure proper fit with the literature while a machine should be able to handle the model representation. • Two models are quite famous1) Discrete Classes- big six” emotion categories, including anger, disgust, fear, happiness, and sadness etc. 2) Dimensional Approach- Representing emotions with dimensions such as Valence, Activation/Arousal and Dimensional approach Image source-https://www.researchgate.net/figure/Thedimensional-approach_fig11_262409896 The Traditional Approach Annotation • Once a model is decided upon, the next crucial issue is usually the acquisition of labelled data for training and testing that suits the emotion representation model. • The problem is high subjectivity and uncertainty. • To solve this observer rating can be a more appropriate label in the case of automatic emotion. • Various algorithms have been designed to achieve this1) Majority/Average 2) Evaluator Weighted Estimator The Traditional Approach Audio Features • It is important to design audio features which best reflect the emotional content in them and are robust against different languages or different accents of the same language. • Some of the commonly used audio features include-intonation, intensity, rhythm, frequency classification, etc. • Usually, 1 second of audio material is recommended to train the model. This is done keeping in mind the tradeoff between more information and high parameter variability. • The current trend is to increase the number of features up to some several thousands of features in contrast to the sparse amount of training material available in this field. The Traditional Approach Textual Features • Textual features as derived from the automatic speech recognition engine’s output are mostly looking at individual words or sequences of these words. • N-grams- looks at sequence of words(n-grams) and find the probability of these words being associated to some emotion class. • Bag of Words- A text is represented as a bag(multiset) of words. It is commonly used in document classification where frequency of occurrence of each word is used as a classifier for the feature/emotion. Ongoing Trends 1. Holistic Speaker Modelling ● Looking at the broader picture. ● Analysing the traits of the speech other than the emotion of interest. Problem : Absence of such richly annotated speech data resources that encompass a wide variety of states and traits. Solution : Using weakly supervised cross-task labelling to relabel databases of emotional speech in a richer way. Ongoing Trends 2. Efficient Data Collection ● Emotion Labelled Speech Data - An ever-present bottleneck !! ● Major efforts have been going on to render collections of data. But data collection is not enough, efficient labelling is also required for the model. Ongoing Trends 3. Weak Supervised Learning 1. Idea is to allow the engine itself to label new data, if already cleared the confidence measures. 2. To keep human labelling in loop, which can be further reduced by active learning. 3. Machine preselects by “which one I can do, which one I need help with”. 4. Machine will gradually learn whom to trust and when. Putting semi-supervised and active learning together we get cooperative learning. 5. What if no initial data exists? However related data does, we can use it by transfer learning. 6. Features, trained model, representation can be transferred to the new data domain. Ongoing Trends 3. Weak Supervised Learning ACTIVE LEARNING Challenges! • Robustness - Across various cultures and languages • Another challenge lies in recognition of ‘sarcasm’ or ‘irony’ in the human speech. These are complex emotional states. Use information from other modalities? • Moonshot Challenge!! - To make the speech analyser itself identify the true/inner emotion without needing to use other human raters’ assessments. Improve the robustness, accuracy, and cultural sensitivity of the 'SER' systems. Potential Applications Speech emotion recognition (SER) technology has numerous potential applications across various industries and domains. Some of the most common uses of SER include: • Healthcare: SER can be used to monitor patients' emotional states and responses to treatment, especially in mental health and neurological disorders. It can also be used to help diagnose certain conditions, such as depression or anxiety. • Security: SER can be used in security settings to detect stress and deception in speech signals, which can help in identifying potential threats or criminal activities. Overall, the use of SER is still in its early stages, and there is a lot of potential for further research and development in this field. Conclusion • The journey from the first patent in 1978 to the first end-end learning model has been quite eventful. And we are still learning. • We are currently witnessing some exciting changes like the idea about learning features from data, holistic models etc. • And all this research can soon lead to the rise of applications that will be used by the masses across various domains like security, health, etc. Bibliography 1. https://dl.acm.org/doi/10.1145/3395035.3425255 2. https://cacm.acm.org/magazines/2018/5/227191-speech-emotionrecognition/abstract 3. https://ieeexplore.ieee.org/document/9774103 4. https://ieeexplore.ieee.org/document/8925524 THANK YOU.