Uploaded by Vallabh Masal

groupNo 33 Report

advertisement
PROJECT REPORT
ON
Interactive Speech Emotion Recognition
Submitted By
161070007
Vallah Sunil Masal
vsmasal_b16@ce.vjti.ac.in
Fourth Year B.Tech Computer
Academic Year 2020-2021
Under The Guidance Of
Prof. V. K. Sambhe
DEPARTMENT OF COMPUTER ENGINEERING
VEERMATA JIJABAI TECHNOLOGICAL INSTITUTE (VJTI, MUMBAI)
(AN AUTONOMOUS INSTITUTE AFFILIATED TO UNIVERSITY OF MUMBAI)
1. INTRODUCTION
1.1 Background
Speech Emotion Recognition, abbreviated as SER, is the act of attempting to
recognize human emotion and affective states from speech. This is capitalizing on
the fact that voice often reflects underlying emotion through tone and pitch. This
is also the phenomenon that animals like dogs and horses employ to be able to
understand human emotion.
SER is tough because emotions are subjective and annotating audio is challenging.
1.2 Motivation
 Speech Emotion Recognition (SER) represents one of the emerging fields in human-computer
interaction. Quality of the human-computer interface that mimics human speech emotions
relies heavily on the types of features used and also on the classifier employed for recognition
 Understanding emotions is essential in human social interactions. Studies suggest that only
10% of human life is completely unemotional.
 The aim of SER is to enable a very natural interaction with the computer by speaking instead of
using traditional input devices and not only have the machine understand the verbal content,
but also more subtle cues such as affect that any human listener would easily react to. It will
become more meaningful if the machines can recognize the emotional contents.
 SER has in the last decade shifted from a side issue to a major topic in human computer
interaction and speech processing. SER has potentially wide applications. For example,
human computer interfaces could be made to respond differently according to the emotional
state of the user.
1.3 Objective
To build a model to recognize emotion from speech using the librosa and sklearn
libraries and the RAVDESS dataset.
1.4 Scope
The global emotion detection and recognition market size is projected to grow from
USD 21.6 billion in 2019 to USD 56.0 billion by 2024, at a Compound Annual Growth
Rate (CAGR) of 21.0% during the forecast period. Factors such as the rising need for
socially intelligent artificial agents, increasing demand for speech-based biometric
systems to enable multifactor authentication, technological advancements across the
globe, and growing need for high operational excellence are expected to work in favor of
the market in the near future.
1.4.1Application of SER in HCI

Speech Emotion Recognition (SER) has potentially wide applications in Human Computer
Interaction (HCI). Example, human computer interfaces could be made to respond differently
according to the emotional state of the user. (where speech is the primary mode of interaction
with the machine. )

Therefore, SER is an essential requirement to make the following applications more human-like
and to increase their acceptance among potential users:

(1) Robots, (Pet robots)

(2) Smart call centers,

(3) Intelligent spoken tutoring systems,

(4) Smart aircraft cockpits,

(5) Prosody for dialog systems,

(6) Diagnostic tool by therapists,

7) Information retrieval for medical analysis,

(9) In-car board system,

(10) Speech synthesis,

(11) Computer games,

(12) Sorting voice mail,

(13) Intelligent toys, etc.

(14) Lie-detection,
e.g
In-car board system
Emotional speech recognition in-car board system helps the human in
performing various tasks such as steering, accelerating, braking, speed, route,
distance to other vehicles, dimming, operating windscreen wipers, changing
gears, and blinking. The ability of a car to understand natural speech provides a
human-like driver assistance system. Emotional factors are decisive for enhanced
safety and comfort while driving a car. Emotions such as aggressiveness, anger,
confusion, sadness, sleepiness have major impact in causing road accidents. This
can be minimized by recognizing drivers emotions and responding to those
emotions by adapting to the current situation
Robots
Robots can interact with people and assist them in their daily routines, in
common places such as homes, super markets, hospitals or offices. For
accomplishing these tasks,robots should recognize the emotions of the humans to
provide a friendly environment. Without recognizing the emotion, the robot
cannot interact with the human in a natural way . Authors would like to provide
details such as software tools available for doing research in speech emotion
recognition, projects already done internationally, conferences and systems based
on emotional speech, etc. for the benefit of readers.
Computer games
Computer games can be controlled through emotions of human speech. The
computer recognizes human emotion from their speech and compute the level of
game (easy, medium, hard). For example, if the human speech is in form of
aggressive nature then the level becomes hard. Suppose if the human is too
relaxed the level becomes easy. The rest of emotions come under medium level
1.5 Methodology
We are going to use the librosa library to analyze the sound files, then use a NN
classifier to assign a emotion to the sound. Our workflow looks something like this.
2. Related Works (Literature Survey)
Predicting Emotions Perceived from Sounds
Sonification is the science of communication of data and events to users through
sounds. Auditory icons, earcons, and speech are the common auditory display schemes
utilized in sonification, or more specifically in the use of audio to convey information.
Once the captured data are perceived, their meanings, and more importantly, intentions
can be interpreted more easily and thus can be employed as a complement to
visualization techniques. Through auditory perception it is possible to convey
information related to temporal, spatial, or some other context-oriented information.
An important research question is whether the emotions perceived from these auditory
icons or earcons are predictable in order to build an automated sonification platform.
This paper conducts an experiment through which several mainstream and
conventional machine learning algorithms are developed to study the prediction of
emotions perceived from sounds. To do so, the key features of sounds are captured and
then are modeled using machine learning algorithms using feature reduction
techniques. We observe that it is possible to predict perceived emotions with high
accuracy. In particular, the regression based on Random Forest demonstrated its
superiority compared to other machine learning algorithms.
Link – https://arxiv.org/abs/2012.02643
3. Implementation
3.1 Dataset information
We’ve used the RAVDESS dataset; this is the Ryerson Audio-Visual Database of
Emotional Speech and Song dataset, and is free to download. This dataset has 7356 files
rated by 247 individuals 10 times on emotional validity, intensity, and genuineness. The
entire dataset is 24.8GB from 24 actors, but we’ve lowered the sample rate on all the
files so it takes up less space.
3.2 Dataset preprocessing
Preprocessing the data for model occurred in five steps:
1. Train, test split the data
train,test = train_test_split(df_combined, test_size=0.2, random_state=0,
stratify=df_combined[['emotion','gender','actor']])
X_train = train.iloc[:, 3:]
y_train = train.iloc[:,:2].drop(columns=['gender'])
X_test = test.iloc[:,3:]
y_test = test.iloc[:,:2].drop(columns=['gender'])
2. Normalize Data — To improve model stability and performance.
mean = np.mean(X_train, axis=0)
std = np.std(X_train, axis=0)
X_train = (X_train - mean)/std
X_test = (X_test – mean)/std
3. Transform into arrays for Keras
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)
4. One-hot encoding of target variable
# CNN REQUIRES INPUT AND OUTPUT ARE NUMBERS
lb = LabelEncoder()
y_train = to_categorical(lb.fit_transform(y_train))
y_test = to_categorical(lb.fit_transform(y_test))
5. Reshape data to include 3D tensor
X_train = X_train[:,:,np.newaxis]
X_test = X_test[:,:,np.newaxis]
3.3Implementation
Prerequistites – for building project
pip install librosasoundfilenumpysklearnpyaudio
Imports
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
Feature Extraction
defextract_feature(file_name, mfcc, chroma, mel):
with soundfile.SoundFile(file_name) as sound_file:
X = sound_file.read(dtype="float32")
sample_rate=sound_file.samplerate
if chroma:
stft=np.abs(librosa.stft(X))
result=np.array([])
if mfcc:
mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate,
n_mfcc=40).T, axis=0)
result=np.hstack((result, mfccs))
if chroma:
chroma=np.mean(librosa.feature.chroma_stft(S=stft,
sr=sample_rate).T,axis=0)
result=np.hstack((result, chroma))
if mel:
mel=np.mean(librosa.feature.melspectrogram(X,
sr=sample_rate).T,axis=0)
result=np.hstack((result, mel))
return result
Define the emotions
emotions={
'01':'neutral',
'02':'calm',
'03':'happy',
'04':'sad',
'05':'angry',
'06':'fearful',
'07':'disgust',
'08':'surprised'
} observed_emotions=['calm', 'happy', 'fearful',
'disgust']
Load Data
def load_data(test_size=0.2):
x,y=[],[]
for file in glob.glob("/home/sanjar/dmdw_lab_project/ravless_data/Actor_*.wav"):
file_name=os.path.basename(file)
emotion=emotions[file_name.split("-")[2]]
if emotion not in observed_emotions:
continue
feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
x.append(feature)
y.append(emotion)
return train_test_split(np.array(x), y, test_size=test_size, random_state=9)
Split data into the training and test set.
x_train,x_test,y_train,y_test=load_data(test_size=0.25)
Observing the shape of the training and testing datasets.
x_train,x_test,y_train,y_test=load_data(test_size=0.25)
Get the number of features extracted.
print(f'Features extracted: {x_train.shape[1]}')
Initialize an MLPClassifier.
This is a Multi-layer Perceptron Classifier; it optimizes the log-loss function using
stochastic gradient descent. Unlike SVM or Naive-Bayes, the MLP Classifier has an
internal neural network for the purpose of classification. This is a feedforward ANN
model.
model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,),
learning_rate='adaptive', max_iter=500)
Training the model
model.fit(x_train,y_train)
To calculate the accuracy of our model, we’ll use the accuracy_score() function we
imported from sklearn.
accuracy=accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))
We have achieved an accuracy of 72%

Using this MLP classifier I made a INTERACTIVE SER by using
‘tkinter’ PYTHON LIB

CODE
1.SAVE CURRENT MLP_MODEL IN FILE USING ‘joblib’
2.DEFINE VOICE_REC FUNCTION TO RECORD AUDIO USING
SOUNDFILE
3.USE
‘tinkter’ as- “root” to create app to record audio and
display emotion
Conclusion :
SER(Speech Emotion Recognition) was successfully implemented and the model
obtained an accuracy of 72.40% and we created an app to record audio live and judge its
emotion.
Download