Uploaded by Sky Travel

ELQPPT

advertisement
Speech Emotion Recognition
(By Björn W. Schuller)
Shivam Jain(2020CS50626)
Shivam Verma(2020CS50442)
Presentation Overview
• Introduction
• Past Efforts
• The Traditional Approach
• The Ongoing Trends
• Challenges Ahead
Introduction
• Speech Emotion Recognition (SER) is the task of
recognizing the emotion from speech irrespective of
the semantic contents
• Even the most primitive animals can recognize the
tones of love and fear and anger, and this knowledge
is shared by the animals. The dog, the horse, and
many other animals can understand emotions from
speech.
• The time has come for computers to understand it as
well.
Image Source-https://www.projectpro.io/article/speech-emotion-recognitionproject-using-machine-learning/573
Past efforts
• The first- ever hardware product—the “Handy Truster”—claimed to be able to sense
human stress-levels.
• Approximately 10 years later, the first broad-consumer market video game appeared.
“Truth or Lies” (THQ) was equipped with a disc and a micro- phone for players to bring
the popular “Spin the Bottle” game to the digital age.
• Today Alexa, Cortana, Siri, and many more dialogue systems have hit the consumer
market on a broader basis than ever.
The Traditional Approach
Modelling• The first step to approach the automatic recognition of emotion requires an appropriate emotion
representation model.
• The model should represent emotions in an adequate way to ensure proper fit with the literature
while a machine should be able to handle the model representation.
• Two models are quite famous1) Discrete Classes- big six” emotion categories, including anger, disgust, fear, happiness, and sadness etc.
2) Dimensional Approach- Representing emotions with dimensions such as Valence, Activation/Arousal and
Dimensional approach
Image source-https://www.researchgate.net/figure/Thedimensional-approach_fig11_262409896
The Traditional Approach
Annotation
• Once a model is decided upon, the next crucial issue is usually the acquisition of labelled
data for training and testing that suits the emotion representation model.
• The problem is high subjectivity and uncertainty.
• To solve this observer rating can be a more appropriate label in the case of automatic
emotion.
• Various algorithms have been designed to achieve this1) Majority/Average
2) Evaluator Weighted Estimator
The Traditional Approach
Audio Features
• It is important to design audio features which best reflect the emotional
content in them and are robust against different languages or different
accents of the same language.
• Some of the commonly used audio features include-intonation, intensity,
rhythm, frequency classification, etc.
• Usually, 1 second of audio material is recommended to train the model.
This is done keeping in mind the tradeoff between more information and
high parameter variability.
• The current trend is to increase the number of features up to some several
thousands of features in contrast to the sparse amount of training material
available in this field.
The Traditional Approach
Textual Features
• Textual features as derived from the automatic speech recognition engine’s output are
mostly looking at individual words or sequences of these words.
• N-grams- looks at sequence of words(n-grams) and find the probability of these words
being associated to some emotion class.
• Bag of Words- A text is represented as a bag(multiset) of words. It is commonly used in
document classification where frequency of occurrence of each word is used as a
classifier for the feature/emotion.
Ongoing Trends
1. Holistic Speaker Modelling
● Looking at the broader picture.
● Analysing the traits of the speech other than the emotion of interest.
Problem : Absence of such richly annotated speech data resources that
encompass a wide variety of states and traits.
Solution : Using weakly supervised cross-task labelling to relabel
databases of emotional speech in a richer way.
Ongoing Trends
2. Efficient Data Collection
● Emotion Labelled Speech Data - An ever-present bottleneck !!
● Major efforts have been going on to render collections of data.
But data collection is not enough, efficient labelling is also
required for the model.
Ongoing Trends
3. Weak Supervised Learning
1. Idea is to allow the engine itself to label new data, if already cleared the confidence
measures.
2. To keep human labelling in loop, which can be further reduced by active learning.
3. Machine preselects by “which one I can do, which one I need help with”.
4. Machine will gradually learn whom to trust and when. Putting semi-supervised and
active learning together we get cooperative learning.
5. What if no initial data exists? However related data does, we can use it by transfer
learning.
6. Features, trained model, representation can be transferred to the new data domain.
Ongoing Trends
3. Weak Supervised Learning
ACTIVE LEARNING
Challenges!
• Robustness - Across various cultures and languages
• Another challenge lies in recognition of ‘sarcasm’ or ‘irony’ in the
human speech. These are complex emotional states. Use information
from other modalities?
• Moonshot Challenge!! - To make the speech analyser itself identify
the true/inner emotion without needing to use other human raters’
assessments.
Improve the robustness, accuracy, and cultural sensitivity of the 'SER' systems.
Potential Applications
Speech emotion recognition (SER) technology has numerous potential
applications across various industries and domains. Some of the most
common uses of SER include:
• Healthcare: SER can be used to monitor patients' emotional states
and responses to treatment, especially in mental health and
neurological disorders. It can also be used to help diagnose certain
conditions, such as depression or anxiety.
• Security: SER can be used in security settings to detect stress and
deception in speech signals, which can help in identifying potential
threats or criminal activities.
Overall, the use of SER is still in its early stages, and there is a lot of potential for
further research and development in this field.
Conclusion
• The journey from the first patent in 1978 to the first end-end
learning model has been quite eventful. And we are still
learning.
• We are currently witnessing some exciting changes like the
idea about learning features from data, holistic models etc.
• And all this research can soon lead to the rise of applications
that will be used by the masses across various domains like
security, health, etc.
Bibliography
1. https://dl.acm.org/doi/10.1145/3395035.3425255
2. https://cacm.acm.org/magazines/2018/5/227191-speech-emotionrecognition/abstract
3. https://ieeexplore.ieee.org/document/9774103
4. https://ieeexplore.ieee.org/document/8925524
THANK YOU.
Download