PROJECT REPORT ON Interactive Speech Emotion Recognition Submitted By 161070007 Vallah Sunil Masal vsmasal_b16@ce.vjti.ac.in Fourth Year B.Tech Computer Academic Year 2020-2021 Under The Guidance Of Prof. V. K. Sambhe DEPARTMENT OF COMPUTER ENGINEERING VEERMATA JIJABAI TECHNOLOGICAL INSTITUTE (VJTI, MUMBAI) (AN AUTONOMOUS INSTITUTE AFFILIATED TO UNIVERSITY OF MUMBAI) 1. INTRODUCTION 1.1 Background Speech Emotion Recognition, abbreviated as SER, is the act of attempting to recognize human emotion and affective states from speech. This is capitalizing on the fact that voice often reflects underlying emotion through tone and pitch. This is also the phenomenon that animals like dogs and horses employ to be able to understand human emotion. SER is tough because emotions are subjective and annotating audio is challenging. 1.2 Motivation Speech Emotion Recognition (SER) represents one of the emerging fields in human-computer interaction. Quality of the human-computer interface that mimics human speech emotions relies heavily on the types of features used and also on the classifier employed for recognition Understanding emotions is essential in human social interactions. Studies suggest that only 10% of human life is completely unemotional. The aim of SER is to enable a very natural interaction with the computer by speaking instead of using traditional input devices and not only have the machine understand the verbal content, but also more subtle cues such as affect that any human listener would easily react to. It will become more meaningful if the machines can recognize the emotional contents. SER has in the last decade shifted from a side issue to a major topic in human computer interaction and speech processing. SER has potentially wide applications. For example, human computer interfaces could be made to respond differently according to the emotional state of the user. 1.3 Objective To build a model to recognize emotion from speech using the librosa and sklearn libraries and the RAVDESS dataset. 1.4 Scope The global emotion detection and recognition market size is projected to grow from USD 21.6 billion in 2019 to USD 56.0 billion by 2024, at a Compound Annual Growth Rate (CAGR) of 21.0% during the forecast period. Factors such as the rising need for socially intelligent artificial agents, increasing demand for speech-based biometric systems to enable multifactor authentication, technological advancements across the globe, and growing need for high operational excellence are expected to work in favor of the market in the near future. 1.4.1Application of SER in HCI Speech Emotion Recognition (SER) has potentially wide applications in Human Computer Interaction (HCI). Example, human computer interfaces could be made to respond differently according to the emotional state of the user. (where speech is the primary mode of interaction with the machine. ) Therefore, SER is an essential requirement to make the following applications more human-like and to increase their acceptance among potential users: (1) Robots, (Pet robots) (2) Smart call centers, (3) Intelligent spoken tutoring systems, (4) Smart aircraft cockpits, (5) Prosody for dialog systems, (6) Diagnostic tool by therapists, 7) Information retrieval for medical analysis, (9) In-car board system, (10) Speech synthesis, (11) Computer games, (12) Sorting voice mail, (13) Intelligent toys, etc. (14) Lie-detection, e.g In-car board system Emotional speech recognition in-car board system helps the human in performing various tasks such as steering, accelerating, braking, speed, route, distance to other vehicles, dimming, operating windscreen wipers, changing gears, and blinking. The ability of a car to understand natural speech provides a human-like driver assistance system. Emotional factors are decisive for enhanced safety and comfort while driving a car. Emotions such as aggressiveness, anger, confusion, sadness, sleepiness have major impact in causing road accidents. This can be minimized by recognizing drivers emotions and responding to those emotions by adapting to the current situation Robots Robots can interact with people and assist them in their daily routines, in common places such as homes, super markets, hospitals or offices. For accomplishing these tasks,robots should recognize the emotions of the humans to provide a friendly environment. Without recognizing the emotion, the robot cannot interact with the human in a natural way . Authors would like to provide details such as software tools available for doing research in speech emotion recognition, projects already done internationally, conferences and systems based on emotional speech, etc. for the benefit of readers. Computer games Computer games can be controlled through emotions of human speech. The computer recognizes human emotion from their speech and compute the level of game (easy, medium, hard). For example, if the human speech is in form of aggressive nature then the level becomes hard. Suppose if the human is too relaxed the level becomes easy. The rest of emotions come under medium level 1.5 Methodology We are going to use the librosa library to analyze the sound files, then use a NN classifier to assign a emotion to the sound. Our workflow looks something like this. 2. Related Works (Literature Survey) Predicting Emotions Perceived from Sounds Sonification is the science of communication of data and events to users through sounds. Auditory icons, earcons, and speech are the common auditory display schemes utilized in sonification, or more specifically in the use of audio to convey information. Once the captured data are perceived, their meanings, and more importantly, intentions can be interpreted more easily and thus can be employed as a complement to visualization techniques. Through auditory perception it is possible to convey information related to temporal, spatial, or some other context-oriented information. An important research question is whether the emotions perceived from these auditory icons or earcons are predictable in order to build an automated sonification platform. This paper conducts an experiment through which several mainstream and conventional machine learning algorithms are developed to study the prediction of emotions perceived from sounds. To do so, the key features of sounds are captured and then are modeled using machine learning algorithms using feature reduction techniques. We observe that it is possible to predict perceived emotions with high accuracy. In particular, the regression based on Random Forest demonstrated its superiority compared to other machine learning algorithms. Link – https://arxiv.org/abs/2012.02643 3. Implementation 3.1 Dataset information We’ve used the RAVDESS dataset; this is the Ryerson Audio-Visual Database of Emotional Speech and Song dataset, and is free to download. This dataset has 7356 files rated by 247 individuals 10 times on emotional validity, intensity, and genuineness. The entire dataset is 24.8GB from 24 actors, but we’ve lowered the sample rate on all the files so it takes up less space. 3.2 Dataset preprocessing Preprocessing the data for model occurred in five steps: 1. Train, test split the data train,test = train_test_split(df_combined, test_size=0.2, random_state=0, stratify=df_combined[['emotion','gender','actor']]) X_train = train.iloc[:, 3:] y_train = train.iloc[:,:2].drop(columns=['gender']) X_test = test.iloc[:,3:] y_test = test.iloc[:,:2].drop(columns=['gender']) 2. Normalize Data — To improve model stability and performance. mean = np.mean(X_train, axis=0) std = np.std(X_train, axis=0) X_train = (X_train - mean)/std X_test = (X_test – mean)/std 3. Transform into arrays for Keras X_train = np.array(X_train) y_train = np.array(y_train) X_test = np.array(X_test) y_test = np.array(y_test) 4. One-hot encoding of target variable # CNN REQUIRES INPUT AND OUTPUT ARE NUMBERS lb = LabelEncoder() y_train = to_categorical(lb.fit_transform(y_train)) y_test = to_categorical(lb.fit_transform(y_test)) 5. Reshape data to include 3D tensor X_train = X_train[:,:,np.newaxis] X_test = X_test[:,:,np.newaxis] 3.3Implementation Prerequistites – for building project pip install librosasoundfilenumpysklearnpyaudio Imports import librosa import soundfile import os, glob, pickle import numpy as np from sklearn.model_selection import train_test_split from sklearn.neural_network import MLPClassifier from sklearn.metrics import accuracy_score Feature Extraction defextract_feature(file_name, mfcc, chroma, mel): with soundfile.SoundFile(file_name) as sound_file: X = sound_file.read(dtype="float32") sample_rate=sound_file.samplerate if chroma: stft=np.abs(librosa.stft(X)) result=np.array([]) if mfcc: mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0) result=np.hstack((result, mfccs)) if chroma: chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0) result=np.hstack((result, chroma)) if mel: mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0) result=np.hstack((result, mel)) return result Define the emotions emotions={ '01':'neutral', '02':'calm', '03':'happy', '04':'sad', '05':'angry', '06':'fearful', '07':'disgust', '08':'surprised' } observed_emotions=['calm', 'happy', 'fearful', 'disgust'] Load Data def load_data(test_size=0.2): x,y=[],[] for file in glob.glob("/home/sanjar/dmdw_lab_project/ravless_data/Actor_*.wav"): file_name=os.path.basename(file) emotion=emotions[file_name.split("-")[2]] if emotion not in observed_emotions: continue feature=extract_feature(file, mfcc=True, chroma=True, mel=True) x.append(feature) y.append(emotion) return train_test_split(np.array(x), y, test_size=test_size, random_state=9) Split data into the training and test set. x_train,x_test,y_train,y_test=load_data(test_size=0.25) Observing the shape of the training and testing datasets. x_train,x_test,y_train,y_test=load_data(test_size=0.25) Get the number of features extracted. print(f'Features extracted: {x_train.shape[1]}') Initialize an MLPClassifier. This is a Multi-layer Perceptron Classifier; it optimizes the log-loss function using stochastic gradient descent. Unlike SVM or Naive-Bayes, the MLP Classifier has an internal neural network for the purpose of classification. This is a feedforward ANN model. model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', max_iter=500) Training the model model.fit(x_train,y_train) To calculate the accuracy of our model, we’ll use the accuracy_score() function we imported from sklearn. accuracy=accuracy_score(y_true=y_test, y_pred=y_pred) print("Accuracy: {:.2f}%".format(accuracy*100)) We have achieved an accuracy of 72% Using this MLP classifier I made a INTERACTIVE SER by using ‘tkinter’ PYTHON LIB CODE 1.SAVE CURRENT MLP_MODEL IN FILE USING ‘joblib’ 2.DEFINE VOICE_REC FUNCTION TO RECORD AUDIO USING SOUNDFILE 3.USE ‘tinkter’ as- “root” to create app to record audio and display emotion Conclusion : SER(Speech Emotion Recognition) was successfully implemented and the model obtained an accuracy of 72.40% and we created an app to record audio live and judge its emotion.