Uploaded by Chinmay Vaishnav

Speech Emotion Recognition with Multilayer Perceptron

advertisement
Speech Emotion Recognition using Multilayer
Perceptron Classifier
Chinmay Vaishnav and Hemant A. Patil
Dhirubhai Ambani Institute of Information and Communication Technology (DAIICT), Gujarat, India
202101157@daiict.ac.in and hemant_patil@daiict.ac.in
Abstract – Emotions play a vital role in human
mental life. It is a medium of expressing one’s
perspective or mental state to others. Speech
Emotion Recognition (SER) can be defined as
the extraction of the emotional state of the
speaker from his or her speech signal. SER as a
Machine Learning (ML) problem continues to
garner a significant amount of research interest,
especially in the affective computing domain due
to its increasing potential, algorithmic
advancements and applications in real-world
scenarios. There are a few universal emotionssuch as Neutral, Anger, Happiness and Sadness
which any intelligent system with finite
computational resources can be trained to
identify or synthesize as required. In this paper,
Speech Emotion Recognition method is used to
gain emotions from the RAVDESS dataset. The
emotion extraction is done based on speech
features
like
Mel-frequency
cepstrum
coefficients (MFCC), chroma and mel.
Index Terms – SER, Multilayer Perceptron
Classifier, MFCC, chroma, Mel.
I. INTRODUCTION
Speech emotion recognition (SER), a sub-discipline of
affective computing, has been around for over two
decades and has led to considerable research. It involves
recognising the emotional aspects of speech irrespective
of the semantic content. A typical SER system can be
considered a collection of methodologies that isolate,
extract and classify speech signals to detect emotions
embedded in them (Akcay & Oguz, 2020). The use cases
of SER in real-world applications are countless, some of
which have demonstrated that including emotional
attributes in human-machine interactions can
significantly improve the interaction experiences of
users.
For example, a SER system can evaluate call centre
agents’ performance by detecting customer emotions
such as anger or happiness. This information can
support companies in improving their service quality
or providing targeted training which leads to
improvement of customer satisfaction.
SER has become an important building block for
many smart service systems in areas such as
healthcare, smart homes, and smart entertainment.
Emergency call centres can use speech-emotion
analysis to identify hazardous circumstances. It
could also be used by an interactive voice response
system in a car to prevent accidents due to fatigued
drivers. In clinical settings, SER could promote
mental health or be used to support mental health
diagnosis. For online education services, SER is a
valuable tool, allowing teachers to assess the degree
to which students have mastered new skills by
analyzing the emotional content of their responses.
This can be used to fine-tune the teaching plan and
optimize the teaching experience.
One of the more daunting tasks of SER is to identify
and extract information from speech that is most
suitable for computational identification and
discrimination of emotion. Linguistic features refer
to the qualitative patterns in human articulation, like
content and context, while para-linguistic features
quantitatively describe the variations in the
pronunciation of the linguistic patterns. These
include prosodic features like Linear Predictor
Coefficient (LPC) and Mel-Frequency Cepstral
Coefficients (MFCC). Moreover, speech signals can
also be represented by more visually direct forms,
such as time-frequency spectrograms.
Several SER studies have investigated the
connection between human emotions and
prosodic/spectral acoustic parameters in speech.
Fig 1. An overview of speech emotion recognition systems
More recently, the advancement in digital signal
processing,
improvements
in human-machine
interactions and rapid advancements in Machine
Learning (ML) (Zhang et al., 2020) have significantly
increased the use of ML techniques for identification of
emotions. Based on such emphasis on ML and the
increased number of studies that have been conducted,
it is needed to focus on ML. These studies mainly
accomplished SER tasks by using ML pipeline methods
that include isolation of speech signal, dimensionality
reduction, speech features extraction and emotion
classification based on the underlying speech features.
II. PROPOSED METHODOLOGY
In this section, the dataset used, features used for
extraction, the classifier model and the workflow of
obtaining emotions from the dataset are mentioned.
A. Dataset Description
The dataset used in this paper is the RAVDESS dataset.
RAVDESS dataset is the Ryerson Audio-Visual
Database of Emotional Speech and Song dataset which
consists of 7356 speech audio files in the form of .wav
file. It includes 24 actors (12 male, 12 female) and
includes emotions like calm, happy, sad, angry, fearful,
surprise and disgust expressions. Each filename is of
seven number format which describes modality, voice
channel, emotion, emotional intensity, statement,
repetition and actor.
The audio files are evaluated by 247 people
about ten times to verify the correctness of the
labels given to the audio files concerning
emotions. The emotions are labelled from 1 to 8
for each emotion and an actor, the odd number
is for males and the even number is for females.
B. Feature Extraction
Audio files can be either speech, song or video
files with sound. For extracting emotions from
these files, certain features have to be extracted
from the files to perform an analysis of the
dataset. The features considered in this paper are
MFCC, Chroma and Mel.
•
MFCC
Mel Frequency Cepstral Coefficient
(MFCC) is an audio feature mainly used for
feature extraction purposes. It is obtained
from audio signals by breaking the signal
into overlapping frames and then a fast
Fourier transform is applied over the
signals.
•
Chroma
Chroma is based on pitch classes where the
pitch classes include 12 categories. It
consists of two factors, which are chroma
vector and chroma deviation.
•
Mel
The Mel scale relates the evident repeat or pitch
of an unadulterated tone to its real assessed
recurrence. Individuals are incredibly improved
at perceiving little changes in pitch at low
frequencies than they are at high frequencies.
Solidifying this scale makes our features arrange
even more eagerly what individuals listen to.
MLP classifier uses a supervised learning
technique called backpropagation for training. It
can distinguish data that is not linearly separable
and can separate the filtered data using more
complex shapes.
III. IMPLEMENTATION
C. MLP Classifier
A. Load Dataset
Multilayer Perceptron Classifier (MLP) is a class of
feed-forward Artificial Neural Network (ANN) and
is shown in the following figure. It is loosely for feedforward ANN and strictly for a network of multiple
layers of perceptron. The input features are accepted
by the input layer and passed over several hidden
layers and finally, the classification output is obtained
at the output layer.
In this paper, initially RAVDESS dataset is
downloaded from the Kaggle dataset and
extracted in the file. This file includes 24 folders
with 1440 audio files in total.
Fig 2. Multilayered Perceptron
D. Workflow
The workflow of the proposed research work is
shown in figure 3.
B. Label Emotions
The RAVDESS dataset includes universal
emotions including neutral, calm, happy, sad,
angry, fearful, disgust, surprise.
Label
Emotion
01
Neutral
02
Calm
03
Happy
04
Sad
05
Angry
06
Fearful
07
Disgust
08
Surprised
Table 1- Emotion Labels
In RAVDESS dataset, only sad, angry and happy
emotions are considered. Total number of files
for the 3 emotions in the dataset are 192 audio
files for happy, 192 for sad and 192 files for
angry. The total number of features and
emotions to be extracted from the dataset is 576.
C. Feature Extraction using MFCC, Chroma and
Mel
In this paper, feature extraction is done mainly
based on features like Mel Frequency Cepstral
Coefficient (MFCC), Chroma and Mel
spectrogram. These features are extracted by
using in-built functions available in librosa and
stored in arrays for classification and analysis.
For each file, 45 features and 1 emotion label are
identified.
Fig 3. Workflow of SER using MLP Classifier
Fig 4. Waveplots and Spectrograms for sad, angry and happy emotions
D. Classification using MLP Classifier
MLP classifier is best suitable for complex structures
datasets when compared to other models like SVM,
KNN etc. MLP classifier is an in-built classifier model
available as default in the scikit learn library which
contains many in-built classifier models. The output
dataset from the feature extraction module is given as
input to the MLP classifier model. For performing
classification, the dataset is initially split into test and
train datasets. 20% of original dataset is used for
testing and the remaining is used for training.
E. Evaluation Metrics
The selection of appropriate evaluation metric is very
important for a proper understanding of a model.
Accuracy, Precision, Recall and F1 score are the
metrics used for the evaluation of the speech emotion
recognition model constructed.
IV. RESULT ANALYSIS
V. CONCLUSION
In this paper, the emotions considered are happy, sad
and angry. Based on these three emotions result analysis
is performed using evaluation metrics. As a result, the
RAVDESS dataset has obtained an accuracy of about
75.97% for the test dataset and 98.8% for the training
dataset. Other evaluation metrics like precision, recall,
and F1 score have been considered for three emotions
separately for both test and train datasets.
Among the evaluation metrics calculated for three
emotions, in the test dataset, Angry has the highest
precision of 100%, sad has the highest Recall of
100%and angry has the highest F1 score value of 98%.
In the training dataset, happy and angry have the same
highest precision of 100%, sad has the highest recall of
100% and happy and angry have the same highest F1
score of 99%. On comparing the evaluation metrics in
the test dataset, Angry emotion has the highest value and
in the train dataset, Happy emotion has the highest value.
This MLP model obtained a high accuracy of 98.8% for
the train dataset and 75.97% for test dataset. This
accuracy may vary according to iterations in the MLP
classifier.
REFERENCES
[1] Akçay, M. B., & Oguz, K. (2020). Speech emotion
recognition: Emotional models, databases, features,
preprocessing methods, supporting modalities, and
classifiers. Speech Communication, 116, 56–76.
Fig 5. Accuracy of Test and Train Data
[2] Zhang, J., Yin, Z., Chen, P., & Nichele, S. (2020).
Emotion recognition using multi-modal data and
machine learning techniques: A tutorial and review.
Information Fusion, 59, 103–126.
[3] Sheikhan, M., Bejani, M., & Gharavian, D. (2013).
Modular neural-SVM scheme for speech emotion
recognition using ANOVA feature selection method.
Neural Computing and Applications, 23(1), 215-227.
[4] Palo, H. K., Mohanty, M. N., & Chandra, M. (2015).
Use of different features for emotion recognition using
MLP network. In Computational Vision and Robotics
(pp. 7-15). Springer, New Delhi.
[5] Farooq, M., Hussain, F., Baloch, N. K., Raja, F. R.,
Yu, H., & Zikria, Y. B. (2020). Impact of feature
selection algorithm on speech emotion recognition using
deep convolutional neural network. Sensors, 20(21),
6008.
Fig 6. MLP Model Loss Iteration
[6] Urbano Romeu, Á. (2016). Emotion recognition
based on the speech, using a Naive Bayes classifier
(Bachelor's thesis, UniversitatPolitècnica de Catalunya).
[7] Sun, L., Zou, B., Fu, S., Chen, J., & Wang, F. (2019).
Speech emotion recognition based on DNN decision tree
SVM model. Speech Communication, 115, 29-37.
Download