Uploaded by haidermustafa2012

Emotion recognition using speech

advertisement
Emotion recognition using voice
BSCS19032 – Muhammad Haider Mustafa
BSCS19049 – Ahmed Sheraz
BSCS19062 – Muhammad Ali
BSCS19077 – Syed Muhammad Hamza Gilani
Extract enough information from
raw data to train models
models will be trained to recognize
a variety emotions from speech
Motivation
These can be used to detect abuse
in online multiplayer games
Tool to administer psychiatric aid to
troublesome patients by detecting
distress calls
Problem
Statement
Every feature of audio exhibits its
characteristic, which are all relevant in
the field of speech recognition
systems. However, it is crucial to
dictate our choice of features by the
desired goal the system intends to
achieve. This is because unbounded or
a poor choice of features will result in
problems such as longer training
times or data overfitting.
Introduction
The features that are not significantly
affected by emotion changes might
dominate the features that do, effectively
leading to inaccurate predictions.
Human emotions are generally classified
through disputed definitions in psychology.
Selection of features are commonly made
through experimental means.
Choosing sufficiently-crafted features,
those that are generally geared towards
classifying different emotions, the system
becomes more precise in its predictions.
Data Sets
• RAVDESS 26GB total (500mb used)
• TESS 280mb (280 mb used)
Librosa: To extract speech features – Chroma, Mel Frequency,
Spectral Centroid, P’th-order, Spectral Bandwidth, Spectral
Contrast, Spectral Flatness, Roll off Frequency, etc.
sound file: To read the audio file.
Libraries
Pickle: To save model after training.
SKLEARN: to split train and test the data and then give out
efficiency of ML models used
NumPy: Mathematical functions for creating and manipulating
arrays and matrices.
Features
Selection
Choosing a subset of audio features
No consensus on definition
of emotion
Emotions and
Speech
Features
No generally accepted set of
features
Specific subset of features
vary with intention of model
Time Domain :
Collecting
Audio Features
Digitally
Using analogue to digital converters.
Frequency Domain :
Apply Fourier transform to time
domain signals.
Categories of Features
Prosodic Features (Intonation, rhythmic):
Voice Quality Features - Shimmer, Jitter
⬢ Spectral Features - Vocal Tract
Combination of them provide better accuracy.
⬢
Prosodic and
Voice Quality
Features
Captured through energy levels,
pitch or amplitude in time
domain
Voice Quality is physical
characteristic of vocal tract
Spectral
Features
Usually in Mel-frequency domain
Mel-frequency
Spectrum
Frequency 200300Hz
Frequency 10001200
Mel Frequency Cepstral Coefficients
Mel Frequency Cepstral Coefficients
Features used in our
models
•
•
•
•
Prosodic and Voice Quality :
HNR
Jitter
Shimmer
•
•
•
•
Spectral Features:
Chroma
Zero Crossing Rate
MFCC
Classifiers
Supervised
model
training
Classifiers
Gaussian
Naïve Bayes
Decision Tree
Classifier
K-Neighbors
Classifier
Random
Forest
Classifier
Multi-layer
perceptron
Classifier
Support
Vector
Machines
Gaussian
Naive Bayes
Classifier
Gaussian
Naïve Bayes
Classifier
Uses Gaussian normal distribution And also supports
continuous data. Since it uses a gaussian methodology, it can
be used in a probabilistic way. It is assumed that each feature
makes a similar contribution to the problem.
Gaussian Naive
Bayes
• Model based on naive
bayes prediction. It
didn’t perform well
however highlighted
several potential
problems
Results : Accuracy 55.83%
Results: 55.83%
Decision Tree
Classifier
Decision Tree Classifier
A predictive modeling technique. Each leaf is a separate class.
multi-variability is dealt with easily and classification is easy
for multiple features of data input.
Decision Tree Classifier
Used for both classification and regression
Parameters
• Entropy : Split depends upon information gain
• best : best split strategy is used to split at each node.
Accuracy: 83.13%
Observation:
Validation score is starting to
increase right around 1800.
However, from the graph it can be
inferred that with increase of
samples validation score will start
to decline. Therefore, It is quite
possible that overfitting may
occur.
Observation:
Compared to the previous model, there is a huge improvement, however, sad emotions still has a low precision compared
to the other emotions. It can also be seen that happy emotion is confused with sad and angry emotions.
K-Neighbors
Classifier
K-Neighbor
Classifier
It is a non-parametric
classification algorithm where
the output depends on k-NN
usage for regression or
classification. This classifier is
distance-dependent and weights
are assigned according to the
neighbor contributions.
K-Neighbors Classifier
Widely used for classification
. Low calculation time
Results in Better accuracy
than decision tree for our
model
Parameters
N_neighbors : Number of
neighbors, used 5 because of
high accuracy score.
Accuracy: 90.67%
weights : Weight function
used in prediction, used
distance as input which
makes closest neighbour of
query point more influential.
Observation:
Higher sample or greater than 1500 are
required to decrease the training and
validation gap to make a best fit model.
Observation:
Sadness precision has significantly improved compared to decision tree classifier, but we can still see that
there is a small chance that sadness can confused with the neutral emotion.
Random Forest Classifier
Random Forest
Classifier
It is made of multiple decision trees. Each tree gives
out an output class and the class with the best result is
picked out. It is a very high cost but a very intensive
algorithm.
Observation:
According to the learning curve for
random forest, our model performed
perfectly over training samples,
however the accuracy for validation
samples only increased after 1200
samples therefore the underfitting of
our model was greatly decreased .
From the trajectory of validation score
it can be inferred that more sample
would lead to better scores.
Conclusion:
Random forest classifiers performed much better
as compared to decision trees and KNN. It was
successful in increasing the precision and recall
value for sad emotion and happy emotion.
Multi-Layer
Perceptron
Classifier
MLP – Multi-Layer Perceptron
Classifier
MLP is a class of feed-forward Artificial Neural networks.
MLP uses a non-linear activation function and uses the
backpropagation technique for training. Learning occurs by
changing weights using backpropagation. This neural network
has quite a few properties and parameters which are as
follows.
Multi-Layer
Perception
Classifier
• In all the above graphs, the
training score and validation
score have a slight gap between
them, showing that our model is
slightly overfitting. We will
choose three alpha values, 0.1,
0.01, 0.001 one from each graph
and use it in grid search to find
the optimal one as after each of
these value's accuracy is
approximately being increased
for validation score, so we can be
sure that one of these is
appropriate values.
Training:
According to the above
graph, as the error is
decreasing, therefore our
model is working correctly.
Results:
Support Vector
Machines
SVM - Support Vector
Machines
SVM is a very robust learning model. It uses a nonprobabilistic binary linear classification (except in
the case of methods like Platt scaling(transformation
of the outputs of a classification model into a
probability distribution). Data are classified using
linear separators after assuming that the problem at
hand is a binary classification problem. The linear
classifier can be run in infinite dimensions.
Support
Vector
Machines
Training
and
Result
Conclusion
We have concluded that spectral features alone make up for a weaker
learning model. We have found that the combination of the best 6
features(3 from spectral and 3 from prosodic features) gives out a much
better accuracy.
Thank You for watching
Download