EMOTION RECOGNITION BY SPEECH ANALYSIS USING

advertisement

EMOTION RECOGNITION BY SPEECH

ANALYSIS USING FEATURE

NORMALIZATION TECHNIQUE

Ms. Roshni Pandey

Dept. of Electronics and

Telecommunication

Priyadarshini college of engineering

Nagpur, Maharashtra, India

Prof. Mrs. Y.A .Nafde

Dept. of Electronics and

Telecommunication

Priyadarshini college of engineering

Nagpur, Maharashtra, India

Prof. Mrs. J.C. Kolte

Dept. of Electronics and

Telecommunication

Priyadarshini college of engineering

Nagpur, Maharashtra, India pandeyroshni13@gmail.com yogitanafde@gmail.com

Abstract: - This paper review on the emotional state of a person by analysing the speech features of the speech signal by employing various techniques and methods. Human emotion indicates the mental status of a person that helps a human being to properly interact with a machine and also helps in many other areas.

This can be done through the analysis of the speech signal of a person in different environments and extracting the various speech features by normalization. Then these features are classified where majority voting technique applied over

Support Vector Machine (SVM) and Neural Network

(NN), selects the exact class of the emotion.

Nowadays, emotion recognition in speech is a very important topic to find out the understanding how human being react and interact with the environment and towards each other which still remains to be one of the extreme scientific challenges. Here we will review on some of the most common speech features and their classification techniques.

Jyotiamolramteke@rediffmail.

anger and disgust. Here, the speech features are analysed. Then, these signals are normalised and classified by majority voting technique applied over Neural Network (NN). Emotion recognition gives potential to interact with machines. Some of the domains where it is highly usable to enhance working capabilities include tutoring systems, health care, games, call centres and ambient intelligent environments. com

Normalization scheme is designed to reduce the speaker variability and other barriers. The main idea of the approach is to make a database of speech signals and estimate their linear scaling parameters for each speaker depending upon his/her speech signal. Then, the normalization parameters are applied to the speech. This unsupervised feature normalization approach

Key terms – Human Emotion, Speech features Neural networks (NN), and Support vector machine (SVM). extends the aforementioned ideas by iteratively detecting a neutral speech subset for each speaker, which is used to estimate his/her normalization parameters and neural network is used classify

INTRODUCTION different speech signals.

The requirement of emotion recognition is increasing rapidly with the development in the field of intelligent machine so that it can behave like a as a human being to take decisions. For proper communication and understanding, these machines help in interfacing a machine. Emotion recognition made extra efforts in the area of theoretical science to engineering. It is also proved that emotion is very important for decision-making and social communication. Therefore it is very important for intelligent machine to recognize human emotion for social communication and to interact with human environment. Human emotion can be defined as the complex psycho physiological experience of the state of mind of an individual person in reference with the social environment. Human speech emotions are generally classified into six different categories as sadness, fear, happiness, surprise,

Some of the features of a speech signal are discussed below-

Here we will review on some of the most common speech features used in conveying different emotional states and also, the effective techniques for their classification.

SPEECH FEATURES AND EXTRACTION

METHODS

There are two main types of speech features that are phonetic and prosodic. Phonetic features comprise the study of sounds of human speech properties like vowels, consonant and their pronunciation. It concerned with the physical properties of speech sounds, acoustic properties and their physical production etc.

Prosodic features on the other hand comprise of the musical aspects of speech, such as, accents or stresses rising or falling tones and rising or falling tones.

A). Pitch –

Pitch is one of the most popular features used in emotion recognition. It is a very sensitive factor that responds to the auditory sense. Pitch can be represented by the fundamental frequency. Based on Kwon O. W. et. Al [7], emotion recognition by analysing speech signals where Mel-band energies, pitch, formant,Mel Frequency Cepstral coefficients

(MFCCs) and log energy are used as base features while velocity/acceleration of pitch forms the feature streams. This research is conducted by using a speech under simulated and actual stress

(SUSAS) and artificial intelligence robot (AIBO) database. Frank Dellaert et. al [8] have presented a new method of extracting prosodic features from speech based on a smoothing spleen approximation of the pitch contour and also had introduced a novel pattern recognition technique based on majority voting of subspace specialists.

B). Intensity –

The power and energy of a voice signal or speech signal referred to as the intensity of the signal. The physical presentation of the intensity of noise can be noticed through the subjective level of noisiness and the pressure of sounds. Normally, the simple intensity is the sum of the absolute values for each data frame. Shami M. et. al. [10] employed three ways to measure intensity contour which are mean, max and variance.

C). Energy –

The volume or intensity of the speech signal is known as energy of the speech signal. It is very important as it contains the valuable information.

The information provided by this helps in differentiating different sets of emotions. But alone this is not sufficient to differentiate various emotional states. But this measurement alone is not sufficient to differentiate basic emotions. Scherer

[37] stated that anger, joy, and fear have increased energy level, where sadness shows low energy level. Based on Ververidis D. et. al. [13] in order to find the energy content of a frequency band, a finite impulse response (FIR) filter of 120 coefficients.

The coefficients are calculated with the frequency sampling method using a Hamming window.

D). Duration –

The duration is the continuous time from start to the end of each emotional sentence. Part of silence is included, because these parts contribute to the emotion detection. Duration ratio of neutral speech and emotional speech was used as the characteristic parameters for recognition by Cai L. et. al [23]. Lee

C. M. et. al. [24] had chosen features such as ratio of duration of voiced and unvoiced region, speech rate and duration of the longest voiced speech.

E). Mel-Frequency Cepstral Coefficients (MFCC) –

MFCC is very precious feature which is widely known feature of a speech signal and it converts the basic features (phase features, pitch, energy etc) into 12 MFCC features. The success of the Least

Squared SVM Bound (LSBOUND) algorithm proved more successful and it suggests that MFCC features are more effective and informative than linear prediction coefficient (LPC) features.

Neiberg D. et. al. [20] proposed MFCC and MFCC low that shows similar performance. The filter banks in MFCC low must be placed in 20 – 30 Hz region and calculated same as MFCC. These low frequency MFCC’s shows frequency variations.

NORMALIZATION PROCESS

Ideal Feature Normalization

Previously we have discussed the problem of interspeaker variability for which the feature normalization technique is proposed to overcome the problem of recording conditions and inter speaker variability. This can be done by separately estimating the normalization parameters for neutral speech signals and those estimated normalization parameters are applied to the speech signal.

Emotional state overlapping occurs as shown in fig

1. For example, the samples from speaker 1 are mixed with sad samples of speaker 2. Fig. 1b describes the approach which aims to normalize the corpus such that neutral sets across speakers have similar properties. Fig. 1c gives the clustering of the emotional classes after the ideal normalization.

Feature normalization approach is a general approach and can be employed by using various transformations (e.g., min-max normalization, zstandardization, and feature warping). It can also be applied for facial features. For example, Zeng et al.

[30] proposed to use neutral facial poses for the normalization of the facial features. In this way normalization technique may be employed for normalization of speech features and then these signals are classified using various techniques.

Speech Data

Segmentation unit

Feature extraction

Feature selection

Classification

Fig.1. Schematic description of ideal feature normalization from speech signal. (a) Emotional classes in feature space before normalization,

(b) Corpus is normalized such that neutral portions of the corpus have similar statistics across speakers, and (c) Emotional classes in feature space after normalization, (d) Speech emotion recognition system.

DIFFERENT TYPES OF CLASSIFIERS FOR

THE CLASSIFICATION

In emotion recognition system different types of classifiers are used for the classification of speech signal after the process of normalization and they are as follows- nearest neighbour classification algorithm for detecting and evaluation emotion from Mandarin speech.

C). Support Vector Machine (SVM) –

Support vector machine (SVM), Hidden Markov

Models, neural network, and k-nearest neighbour.

A). Neural Network –

Neural networks are computational models that are capable of pattern recognition and machine learning’s and this network is evolved by the deep analysis of animals central nervous system (brain).

Neural network is used for training and classification of the normalized speech features.

Different tools are used for pattern recognition in speech based emotion recognition in which SVM is one of the most famous and powerful tool which gives the best accuracy.

Whenever we use SVM, the variable length data is first converted into the fixed length data as it can only be applied to linear data stream. It uses a fitting function. Different classifiers used various processes to control the system complexity and

SVM uses VC dimensions to control the system complexity.

V. Petrushin proposed a system implemented an artificial neural network for emotion recognition for call centres.

J. Nicholson, et.al [40] has implemented neural network based classifier for emotion classification.

He used two sets of features which are prosodic features which include speech power (P), energy and Pitch and phonetic features which include

Delta LPC parameter and 12 LPC parameters.

B). k-Nearest Neighbour (k-NN) –

Shen P., Changjun Z. and Chen X. Proposed an

Automatic Speech based Emotion Recognition

Using Support Vector Machine system 621-625,

2011.

K-NN is a type of classifier employed for the classification of different speech signals. Yi-hao

Kao et.al [16] conducted an experiment for emotion recognition and he had used five different emotional states which are happiness, sadness, anger, neutral and bored that was obtained from a database. Linear Prediction Cepstral Coefficients

(LPCC) and Mel-Frequency Cepstral Coefficients

(MFCC) were used for the extraction of features. T.

L. Pao, Y. Chen, J. Yeh and J. Burns et.al [29] on the other hand have proposed weighted discrete k-

Fig. 2. Supervised pre-training method - using a network with less hidden layers to initialize a deeper network.

D). Hidden Markov Models (HMMs) –

Hidden markov model is also a type of classifier which is widely used in the field of speech based

emotion recognition system. The inter processing of the model remains hidden as the states of first order markov chain in HMM remain hidden from the observer. These statistical data forms the events sequentially and due to state transition matrix the temporal features can be trapped. The features of the speech signals are used by the HMM for separating angriness, happiness, aggressiveness and sadness. In this way HMM works for speech recognition.

B. Schuller, G. Rigoll, and M. Lang proposed a

Speech Emotion Recognition system based on

Hidden Markov Model.

DISCUSSION

This paper presents the review on speech based emotion recognition system consist of signal database, feature extraction of the speech signal, normalization applied to the features and the various classifiers used for the classification of speech signals. Here we come across the most interesting speech features that are prosodic and phonetic in which prosodic is preferable which is based on pitch, MFCC’s, energy and formant.

These features are more preferable as they give more accuracy. The classification rates that we encounter here would be compared for different speech signals and also neutral speech. Different protocols are employed for their separation properly.

Recently, the speech features produced from the movement of the glottis helps a lot in the research field to the scientists and researchers. As emotion recognition becomes an important topic now days, many research being employed for this. The emotional state or the emotion of a person helps in better interaction between a human being and a machine. Hence a best system is thought of to be prepared. Airas and Alku [33] investigated a recently developed glottal flow parameter, normalized amplitude quotient, NAQ [33].

CONCLUSION

We have made a detailed study of the speech features and also their extraction process where we are using normalization process to normalize them.

Neural networks may be used as the classifier for the classification. Here we come across the most interesting speech features that are prosodic and phonetic in which prosodic is preferable which is based on pitch, MFCC’s, energy and formant.

These features are more preferable as they give more accuracy. We will prepare a database for speech signals and the values for different speech features will be calculated. Further normalization technique may be employed and classification may be done by neural network to get the result and then we will analyse the difference. From the study, we see that the results of other classifiers are not much accurate and it was just around 70%. We will try to get an accuracy of at least 85% by using the appropriate method.

REFERENCES

[1]. Carlos Busso, Soroosh Mariooryad, Student Member,

Angeliki Metallinou and Shrikanth Narayanan, “Iterative

Feature Normalization Scheme for Automatic Emotion

Detection from Speech”, IEEE transactions on affective computing, vol. 4, no. 4, october-december 2013

[2]. Yixiong Pan, Peipei Shen and Liping Shen, “Speech

Emotion Recognition Using Support Vector Machine”,

International Journal of Smart Home, vol. 6, No. 2, April, 2012.

[3]. Shruti Aggarwal, Naveen Aggarwal, “Classification of

Audio Data using Support Vector Machine”, IJCST , vol. 2, Issue

3, September 2011.

[4]. F. Dellaert, T. Polzin, and A. Waibel, “ Recognizing emotion in speech,”in Proc. ICSLP , Philadelphia, PA, 1996, pp.

1970 – 1973

[5]. V. Petrushin, “Emotion in speech: Recognition and application to call centers,” Artif. Neu. Net. Engr. (ANNIE) ,

1999

[7]. V. Sethu, E. Ambikairajah, and J. Epps, “Speaker

Normalisation for Speech Based Emotion Detection,” Proc. 15 th

15th International Conference on Digital Signal Processing

Processing - , pp. 611-614, July 2007.

[8]. M.N.Hasrul, M.Hariharan, Sazali Yaacob, “Human

Affective (Emotion) Behaviour Analysis using Speech Signals:

A Review,” 2012 International Conference on Biomedical

Engineering (ICoBE),27-28 February 2012,Penang.

[9]. Md. Kamruzzaman Sarker, Kazi Md. Rokibul Alam and

Md. Arifuzzaman, “Emotion Recognition from Speech based on

Relevant Feature and Majority Voting,” 3rd international conference on informatics, electronics & vision 2014.

Download