Emotion Recognition Using Voice: A Deep Dive

Emotion recognition using voice BSCS19032 – Muhammad Haider Mustafa BSCS19049 – Ahmed Sheraz BSCS19062 – Muhammad Ali BSCS19077 – Syed Muhammad Hamza Gilani Extract enough information from raw data to train models models will be trained to recognize a variety emotions from speech Motivation These can be used to detect abuse in online multiplayer games Tool to administer psychiatric aid to troublesome patients by detecting distress calls Problem Statement Every feature of audio exhibits its characteristic, which are all relevant in the field of speech recognition systems. However, it is crucial to dictate our choice of features by the desired goal the system intends to achieve. This is because unbounded or a poor choice of features will result in problems such as longer training times or data overfitting. Introduction The features that are not significantly affected by emotion changes might dominate the features that do, effectively leading to inaccurate predictions. Human emotions are generally classified through disputed definitions in psychology. Selection of features are commonly made through experimental means. Choosing sufficiently-crafted features, those that are generally geared towards classifying different emotions, the system becomes more precise in its predictions. Data Sets • RAVDESS 26GB total (500mb used) • TESS 280mb (280 mb used) Librosa: To extract speech features – Chroma, Mel Frequency, Spectral Centroid, P’th-order, Spectral Bandwidth, Spectral Contrast, Spectral Flatness, Roll off Frequency, etc. sound file: To read the audio file. Libraries Pickle: To save model after training. SKLEARN: to split train and test the data and then give out efficiency of ML models used NumPy: Mathematical functions for creating and manipulating arrays and matrices. Features Selection Choosing a subset of audio features No consensus on definition of emotion Emotions and Speech Features No generally accepted set of features Specific subset of features vary with intention of model Time Domain : Collecting Audio Features Digitally Using analogue to digital converters. Frequency Domain : Apply Fourier transform to time domain signals. Categories of Features Prosodic Features (Intonation, rhythmic): Voice Quality Features - Shimmer, Jitter ⬢ Spectral Features - Vocal Tract Combination of them provide better accuracy. ⬢ Prosodic and Voice Quality Features Captured through energy levels, pitch or amplitude in time domain Voice Quality is physical characteristic of vocal tract Spectral Features Usually in Mel-frequency domain Mel-frequency Spectrum Frequency 200300Hz Frequency 10001200 Mel Frequency Cepstral Coefficients Mel Frequency Cepstral Coefficients Features used in our models • • • • Prosodic and Voice Quality : HNR Jitter Shimmer • • • • Spectral Features: Chroma Zero Crossing Rate MFCC Classifiers Supervised model training Classifiers Gaussian Naïve Bayes Decision Tree Classifier K-Neighbors Classifier Random Forest Classifier Multi-layer perceptron Classifier Support Vector Machines Gaussian Naive Bayes Classifier Gaussian Naïve Bayes Classifier Uses Gaussian normal distribution And also supports continuous data. Since it uses a gaussian methodology, it can be used in a probabilistic way. It is assumed that each feature makes a similar contribution to the problem. Gaussian Naive Bayes • Model based on naive bayes prediction. It didn’t perform well however highlighted several potential problems Results : Accuracy 55.83% Results: 55.83% Decision Tree Classifier Decision Tree Classifier A predictive modeling technique. Each leaf is a separate class. multi-variability is dealt with easily and classification is easy for multiple features of data input. Decision Tree Classifier Used for both classification and regression Parameters • Entropy : Split depends upon information gain • best : best split strategy is used to split at each node. Accuracy: 83.13% Observation: Validation score is starting to increase right around 1800. However, from the graph it can be inferred that with increase of samples validation score will start to decline. Therefore, It is quite possible that overfitting may occur. Observation: Compared to the previous model, there is a huge improvement, however, sad emotions still has a low precision compared to the other emotions. It can also be seen that happy emotion is confused with sad and angry emotions. K-Neighbors Classifier K-Neighbor Classifier It is a non-parametric classification algorithm where the output depends on k-NN usage for regression or classification. This classifier is distance-dependent and weights are assigned according to the neighbor contributions. K-Neighbors Classifier Widely used for classification . Low calculation time Results in Better accuracy than decision tree for our model Parameters N_neighbors : Number of neighbors, used 5 because of high accuracy score. Accuracy: 90.67% weights : Weight function used in prediction, used distance as input which makes closest neighbour of query point more influential. Observation: Higher sample or greater than 1500 are required to decrease the training and validation gap to make a best fit model. Observation: Sadness precision has significantly improved compared to decision tree classifier, but we can still see that there is a small chance that sadness can confused with the neutral emotion. Random Forest Classifier Random Forest Classifier It is made of multiple decision trees. Each tree gives out an output class and the class with the best result is picked out. It is a very high cost but a very intensive algorithm. Observation: According to the learning curve for random forest, our model performed perfectly over training samples, however the accuracy for validation samples only increased after 1200 samples therefore the underfitting of our model was greatly decreased . From the trajectory of validation score it can be inferred that more sample would lead to better scores. Conclusion: Random forest classifiers performed much better as compared to decision trees and KNN. It was successful in increasing the precision and recall value for sad emotion and happy emotion. Multi-Layer Perceptron Classifier MLP – Multi-Layer Perceptron Classifier MLP is a class of feed-forward Artificial Neural networks. MLP uses a non-linear activation function and uses the backpropagation technique for training. Learning occurs by changing weights using backpropagation. This neural network has quite a few properties and parameters which are as follows. Multi-Layer Perception Classifier • In all the above graphs, the training score and validation score have a slight gap between them, showing that our model is slightly overfitting. We will choose three alpha values, 0.1, 0.01, 0.001 one from each graph and use it in grid search to find the optimal one as after each of these value's accuracy is approximately being increased for validation score, so we can be sure that one of these is appropriate values. Training: According to the above graph, as the error is decreasing, therefore our model is working correctly. Results: Support Vector Machines SVM - Support Vector Machines SVM is a very robust learning model. It uses a nonprobabilistic binary linear classification (except in the case of methods like Platt scaling(transformation of the outputs of a classification model into a probability distribution). Data are classified using linear separators after assuming that the problem at hand is a binary classification problem. The linear classifier can be run in infinite dimensions. Support Vector Machines Training and Result Conclusion We have concluded that spectral features alone make up for a weaker learning model. We have found that the combination of the best 6 features(3 from spectral and 3 from prosodic features) gives out a much better accuracy. Thank You for watching

Emotion Recognition Using Voice: A Deep Dive

Related documents

Products

Support

Emotion Recognition Using Voice: A Deep Dive

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib