Speech Emotion Recognition using Multilayer Perceptron Classifier Chinmay Vaishnav and Hemant A. Patil Dhirubhai Ambani Institute of Information and Communication Technology (DAIICT), Gujarat, India 202101157@daiict.ac.in and hemant_patil@daiict.ac.in Abstract – Emotions play a vital role in human mental life. It is a medium of expressing one’s perspective or mental state to others. Speech Emotion Recognition (SER) can be defined as the extraction of the emotional state of the speaker from his or her speech signal. SER as a Machine Learning (ML) problem continues to garner a significant amount of research interest, especially in the affective computing domain due to its increasing potential, algorithmic advancements and applications in real-world scenarios. There are a few universal emotionssuch as Neutral, Anger, Happiness and Sadness which any intelligent system with finite computational resources can be trained to identify or synthesize as required. In this paper, Speech Emotion Recognition method is used to gain emotions from the RAVDESS dataset. The emotion extraction is done based on speech features like Mel-frequency cepstrum coefficients (MFCC), chroma and mel. Index Terms – SER, Multilayer Perceptron Classifier, MFCC, chroma, Mel. I. INTRODUCTION Speech emotion recognition (SER), a sub-discipline of affective computing, has been around for over two decades and has led to considerable research. It involves recognising the emotional aspects of speech irrespective of the semantic content. A typical SER system can be considered a collection of methodologies that isolate, extract and classify speech signals to detect emotions embedded in them (Akcay & Oguz, 2020). The use cases of SER in real-world applications are countless, some of which have demonstrated that including emotional attributes in human-machine interactions can significantly improve the interaction experiences of users. For example, a SER system can evaluate call centre agents’ performance by detecting customer emotions such as anger or happiness. This information can support companies in improving their service quality or providing targeted training which leads to improvement of customer satisfaction. SER has become an important building block for many smart service systems in areas such as healthcare, smart homes, and smart entertainment. Emergency call centres can use speech-emotion analysis to identify hazardous circumstances. It could also be used by an interactive voice response system in a car to prevent accidents due to fatigued drivers. In clinical settings, SER could promote mental health or be used to support mental health diagnosis. For online education services, SER is a valuable tool, allowing teachers to assess the degree to which students have mastered new skills by analyzing the emotional content of their responses. This can be used to fine-tune the teaching plan and optimize the teaching experience. One of the more daunting tasks of SER is to identify and extract information from speech that is most suitable for computational identification and discrimination of emotion. Linguistic features refer to the qualitative patterns in human articulation, like content and context, while para-linguistic features quantitatively describe the variations in the pronunciation of the linguistic patterns. These include prosodic features like Linear Predictor Coefficient (LPC) and Mel-Frequency Cepstral Coefficients (MFCC). Moreover, speech signals can also be represented by more visually direct forms, such as time-frequency spectrograms. Several SER studies have investigated the connection between human emotions and prosodic/spectral acoustic parameters in speech. Fig 1. An overview of speech emotion recognition systems More recently, the advancement in digital signal processing, improvements in human-machine interactions and rapid advancements in Machine Learning (ML) (Zhang et al., 2020) have significantly increased the use of ML techniques for identification of emotions. Based on such emphasis on ML and the increased number of studies that have been conducted, it is needed to focus on ML. These studies mainly accomplished SER tasks by using ML pipeline methods that include isolation of speech signal, dimensionality reduction, speech features extraction and emotion classification based on the underlying speech features. II. PROPOSED METHODOLOGY In this section, the dataset used, features used for extraction, the classifier model and the workflow of obtaining emotions from the dataset are mentioned. A. Dataset Description The dataset used in this paper is the RAVDESS dataset. RAVDESS dataset is the Ryerson Audio-Visual Database of Emotional Speech and Song dataset which consists of 7356 speech audio files in the form of .wav file. It includes 24 actors (12 male, 12 female) and includes emotions like calm, happy, sad, angry, fearful, surprise and disgust expressions. Each filename is of seven number format which describes modality, voice channel, emotion, emotional intensity, statement, repetition and actor. The audio files are evaluated by 247 people about ten times to verify the correctness of the labels given to the audio files concerning emotions. The emotions are labelled from 1 to 8 for each emotion and an actor, the odd number is for males and the even number is for females. B. Feature Extraction Audio files can be either speech, song or video files with sound. For extracting emotions from these files, certain features have to be extracted from the files to perform an analysis of the dataset. The features considered in this paper are MFCC, Chroma and Mel. • MFCC Mel Frequency Cepstral Coefficient (MFCC) is an audio feature mainly used for feature extraction purposes. It is obtained from audio signals by breaking the signal into overlapping frames and then a fast Fourier transform is applied over the signals. • Chroma Chroma is based on pitch classes where the pitch classes include 12 categories. It consists of two factors, which are chroma vector and chroma deviation. • Mel The Mel scale relates the evident repeat or pitch of an unadulterated tone to its real assessed recurrence. Individuals are incredibly improved at perceiving little changes in pitch at low frequencies than they are at high frequencies. Solidifying this scale makes our features arrange even more eagerly what individuals listen to. MLP classifier uses a supervised learning technique called backpropagation for training. It can distinguish data that is not linearly separable and can separate the filtered data using more complex shapes. III. IMPLEMENTATION C. MLP Classifier A. Load Dataset Multilayer Perceptron Classifier (MLP) is a class of feed-forward Artificial Neural Network (ANN) and is shown in the following figure. It is loosely for feedforward ANN and strictly for a network of multiple layers of perceptron. The input features are accepted by the input layer and passed over several hidden layers and finally, the classification output is obtained at the output layer. In this paper, initially RAVDESS dataset is downloaded from the Kaggle dataset and extracted in the file. This file includes 24 folders with 1440 audio files in total. Fig 2. Multilayered Perceptron D. Workflow The workflow of the proposed research work is shown in figure 3. B. Label Emotions The RAVDESS dataset includes universal emotions including neutral, calm, happy, sad, angry, fearful, disgust, surprise. Label Emotion 01 Neutral 02 Calm 03 Happy 04 Sad 05 Angry 06 Fearful 07 Disgust 08 Surprised Table 1- Emotion Labels In RAVDESS dataset, only sad, angry and happy emotions are considered. Total number of files for the 3 emotions in the dataset are 192 audio files for happy, 192 for sad and 192 files for angry. The total number of features and emotions to be extracted from the dataset is 576. C. Feature Extraction using MFCC, Chroma and Mel In this paper, feature extraction is done mainly based on features like Mel Frequency Cepstral Coefficient (MFCC), Chroma and Mel spectrogram. These features are extracted by using in-built functions available in librosa and stored in arrays for classification and analysis. For each file, 45 features and 1 emotion label are identified. Fig 3. Workflow of SER using MLP Classifier Fig 4. Waveplots and Spectrograms for sad, angry and happy emotions D. Classification using MLP Classifier MLP classifier is best suitable for complex structures datasets when compared to other models like SVM, KNN etc. MLP classifier is an in-built classifier model available as default in the scikit learn library which contains many in-built classifier models. The output dataset from the feature extraction module is given as input to the MLP classifier model. For performing classification, the dataset is initially split into test and train datasets. 20% of original dataset is used for testing and the remaining is used for training. E. Evaluation Metrics The selection of appropriate evaluation metric is very important for a proper understanding of a model. Accuracy, Precision, Recall and F1 score are the metrics used for the evaluation of the speech emotion recognition model constructed. IV. RESULT ANALYSIS V. CONCLUSION In this paper, the emotions considered are happy, sad and angry. Based on these three emotions result analysis is performed using evaluation metrics. As a result, the RAVDESS dataset has obtained an accuracy of about 75.97% for the test dataset and 98.8% for the training dataset. Other evaluation metrics like precision, recall, and F1 score have been considered for three emotions separately for both test and train datasets. Among the evaluation metrics calculated for three emotions, in the test dataset, Angry has the highest precision of 100%, sad has the highest Recall of 100%and angry has the highest F1 score value of 98%. In the training dataset, happy and angry have the same highest precision of 100%, sad has the highest recall of 100% and happy and angry have the same highest F1 score of 99%. On comparing the evaluation metrics in the test dataset, Angry emotion has the highest value and in the train dataset, Happy emotion has the highest value. This MLP model obtained a high accuracy of 98.8% for the train dataset and 75.97% for test dataset. This accuracy may vary according to iterations in the MLP classifier. REFERENCES [1] Akçay, M. B., & Oguz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56–76. Fig 5. Accuracy of Test and Train Data [2] Zhang, J., Yin, Z., Chen, P., & Nichele, S. (2020). Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion, 59, 103–126. [3] Sheikhan, M., Bejani, M., & Gharavian, D. (2013). Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method. Neural Computing and Applications, 23(1), 215-227. [4] Palo, H. K., Mohanty, M. N., & Chandra, M. (2015). Use of different features for emotion recognition using MLP network. In Computational Vision and Robotics (pp. 7-15). Springer, New Delhi. [5] Farooq, M., Hussain, F., Baloch, N. K., Raja, F. R., Yu, H., & Zikria, Y. B. (2020). Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network. Sensors, 20(21), 6008. Fig 6. MLP Model Loss Iteration [6] Urbano Romeu, Á. (2016). Emotion recognition based on the speech, using a Naive Bayes classifier (Bachelor's thesis, UniversitatPolitècnica de Catalunya). [7] Sun, L., Zou, B., Fu, S., Chen, J., & Wang, F. (2019). Speech emotion recognition based on DNN decision tree SVM model. Speech Communication, 115, 29-37.