FUNDAMENTAL FREQUENCY AND ENERGY CONTOURS

Chapter 7 FEATURE CALCULATION Chapter 7 FEATURE CALCULATION 7.1 Basic Prosodic Attributes In the present section, calculations and procedures employed to obtain basic features contours are explained. These essential attributes (i.e. pitch and energy) will be the starting point in the aim to obtain more complex features, which contain valuable information for our purposes. The software used in section 7.1 is part of the Verbmobil long-term project of the Federal Ministry of Education, Science, Research and Technology. In order to achieve feasible estimations, and avoid the difficulties caused by the nonstationary nature of speech, it’s assumed that the properties of the signal change relatively slow with time. This allows examination of a short-time window of speech to extract relevant parameters that are presumed to be fixed within the duration of the window. Most techniques yield parameters averaged over the course of the time window. Thus, if dynamic parameters are to be modelled, the signal must be divided into successive windows or analysis frames so that the parameters can be calculated often enough to follow relevant changes. Consequently, in order to obtain F0 and energy contours, smaller fragments of speech, called frames, are considered. For each frame, the F0 and energy values are computed. There will be one single value per frame and for its calculation a longer analysis window is employed. Inside the analysis window, all the speech signal values are considered and analysis windows are this way always overlapped. Frame durations of 10 ms and 20 ms are commonly used in speech processing, while window lengths for F0 and energy calculations are usually 105 Chapter 7 FEATURE CALCULATION established between 25 ms and 40 ms. The analysis performed in the present work considers frame durations of 10 ms and analysis window lengths of 40 ms. Since voiced/unvoiced decision is the base of the F0 computation, it’s the first algorithm in being described within this section. The decision is frame-based, and only over voiced frames, F0 will be estimated. 7.1.1 Voiced/unvoiced decision. Voiced speech involves the vibration of the vocal folds in response to airflow from the lungs. This vibration is periodic and it could be examined independently of the properties of the vocal tract. Its periodicity refers to the fundamental frequency of such vibration or the resulting periodicity in the speech signal, also called “pitch”. Figure 7.1. Waveform of the glottal source. In unvoiced speech the sound source is not a regular vibration but rather vibrations caused by turbulent airflow due to a constriction in the vocal tract. The sound created as a result of the constriction is described as a noise source. It contains no dominating periodic component and has a relatively flat spectrum meaning that every frequency component is represented equally (in fact for some sounds the noise spectrum may slope down at around 6dB/octave). Attending to the time waveform of a noise source, only a random pattern of movement is observed around the zero axis. In this context without any periodicity, pitch estimation makes no sense. 106 Chapter 7 FEATURE CALCULATION Figure 7.2. Different sources in speech production. Therefore, for F0 estimation is essential to define which frames are considered voiced and which unvoiced. In contrast with F0 and energy calculations, non-overlapping windows will be employed for the voiced/unvoiced decision. The algorithm uses only values of the signal contained within a frame duration. Voiced frames differentiate themselves from unvoiced frames by means of high amplitude values, a relative low zero-crossing rate and big energy values. Zero-crossing rate is understood as the number of zero-crossings per time unit, defined from now as the frame length, then 10ms. Several procedures to decide between voiced/unvoiced frames are introduced in [Hes83]. The algorithm used here applies thresholds, which are presented in [Hes83], and it’s described in [Kie97]. As a result of this work, following thresholds are proved to be suitably appropriated for the voiced/unvoiced decision: fs N Zero-crossing rate in Hz: Zcross  n _ cross  Normalised energy of the signal:  s N  n EneNorm  Range  MaxRange Normalised absolute maximum: MaxNorm  1 max sn  Range (7.1) 2 (7.2) (7.3) 107 Chapter 7 FEATURE CALCULATION Where fs Sampling frequency in Hz (here 16000) N Frame length in samples (here 160) sn n-sample value of the signal n_cross Amount of zero-crossings during a frame Range Difference between maximum and minimum value in the signal MaxRange Maximum feasible range, dependent on the quantification (here 16 Bit  MaxRange=65536) Normalisation in (7.2) and (7.3) comes from the fact that the speaker may verbalise at different energy levels at different time. The decision rule is achieved through the comparison of thresholds theoretically based (  ncross ,  EneNorm ,  MaxNorm ) with a vector whose components result from equations (7.1) to (7.3): If n_cross < n_cross EneNorm > EneNorm and and (7.4) MaxNorm > MaxNorm then  Voiced else Unvoiced Where  n _ cross     EneNorm  = Vector obtained from (7.1) – (7.3)  MaxNorm    Definition of appropriated thresholds was optimised in order to reach the best algorithm performance for the various speech samples available according to some theoretical background. Thresholds were selected through experiments made during the Verbmobil project development [Hag95]. After some simple experiments, based on trial and error methods, some experiments were also conducted using Neural Networks as classifier for the voiced/unvoiced decision on the frame plain. It was observed that this 108 Chapter 7 FEATURE CALCULATION procedure provided thresholds, whose values yield better results. Detailed information and additional data about voiced unvoiced decision methods can be found in [Rab78], [Hes83] and [Kie97]. Since speech signal conditions are similar in this Diploma Thesis, these thresholds remain for calculations computed during it. Before these values were assumed, it was verified that they were able to compute efficiently voiced and unvoiced frames. Praat program is employed to compare regions selected as voiced. Both programs coincided consistently in which regions were classified as voiced. However, the Verbmobil program seemed to yield more accurate boundaries in voiced regions, while Praat created, in certain cases, too long regions, which included some undesirable unvoiced sounds in the section. 7.1.2 Fundamental Frequency Contour. 7.1.2.1 Previous remark. This section deals with the fundamental frequency (F0 or pitch) of a periodic signal, which is the inverse of its period, T (see figure 7.1). The period is defined as the smallest positive member of the set of time shifts that leave the signal invariant and makes only sense to a perfectly periodic signal. Speech signal results from a combination of a source of sound modulated by a transfer (filter) function (see figure 7.3) determined by the shape of the supra-laryngeal vocal tract, according to the source-filter theory described in section 3.3.2. This theory, stemmed from the experiments of Johannes Müller (1848), tested a functional theory of phonation by blowing air through larynges excised from human cadavers. Obviously, a signal cannot be switched on or off or modulated without losing its perfect periodicity, and this combination causes speech signal to be only quasi-periodic, either due to small period-to-period variations in the vocal cord vibration or in the vocal tract shape. Therefore, the art of fundamental frequency estimation is to deal with the information in a consistent and useful way. 109 Chapter 7 FEATURE CALCULATION Figure 7.3. Source filter model for voiced speech. 7.1.2.2 Difficulties in estimating pitch contour. F0 is considered one of the most important features for the characterisation of emotions and is the acoustic correlate of the perceptive pitch. Its perception by human ear is non-linear, reliant on the frequency. In addition, human voice is not a pure sinusoid, but a complex combination of diverse frequencies. Estimating the pitch of voiced speech sounds has always been a complex problem. Thought it appears to be a rather simple task on the surface, there are many subtleties that need to be kept in mind. F0 is usually defined, for voiced speech, as the rate of vibration of the vocal folds. Periodic vibration at the glottis may produce speech that is less perfectly periodic, due to the changes in the shape of the vocal tract that filters the glottal source waveform, making hard to estimate fundamental periodicity from the speech waveform. Therefore, F0 estimation involves a huge number of considerations; it can be influenced by many factors such as phone intrinsic parameters or coarticulation. Furthermore, the excitation signal itself is not truly periodic, but it shows small variations in period duration (jitter) and in periodic amplitude (shimmer). These aperiodicities, in the form of relatively smooth changes in amplitude, rate or glottal waveform shape (for example the duty cycle of open and closed phases), or intervals where the vibration seems to reflect several superimposed periodicities (diplophony), or where glottal pulses occur without obvious regularity of time interval or amplitude (glottalizations, vocal creak or 110 Chapter 7 FEATURE CALCULATION fry), don’t contribute to the speech intelligibility, but to the naturalness of human speech. Therefore, the mapping between physical acoustics and perceived prosody is neither linear nor one-to one; as we said, variations in F0 are the most direct cause of pitch perception, but amplitude and duration also affect pitch and make its estimation more intricate. While there are many successfully implemented pitch estimation algorithms (s. [Che01.Hes83]), none of them work without making certain assumptions about the sound being analysed and everyone has to face many difficulties and to admit certain failure. Next paragraphs exhibit a brief historical overview of different methods tried. It can be seen, how from the first method ever employed, they meet with diverse limitations. The first method tried was simply to low-pass the speech signal in order to remove all harmonics and then measure the fundamental frequency by any convenient means. This method faces two difficulties. First, it had to be an adaptive filter, because pitch can easily cover a 2 to 1 range and it always had to pass the fundamental and reject the second harmonic. The filter frequency was set by tracking the pitch and predicting the forthcoming pitch value; hence any error in one frame of speech could cause the filter to select the wrong cut-off frequency in the next frame and so lose track of the pitch altogether. The second difficulty arose from the fact that in many cases pitch had to be estimated from speech, where the fundamental frequency was omitted. For instance, in telephone speech frequency response drops off rapidly below 300 Hz; hence for many male voices the fundamental frequency is absent or so weak as to be lost in the system noise. In the absence of the fundamental, it is common to search for periodicities in a signal by examining its autocorrelation function. In a periodic function, the autocorrelation will show a maximum at a lag equal to the period of the function. One first problem is that speech is not exactly periodic, because of changes in pitch and in formant frequencies. Therefore, the maximum may be lower and broader than expected, causing problems in setting the decision threshold. Another problem arises from the possibility that the first formant frequency is equal to or below the pitch frequency. If its amplitude is particularly high, this situation can yield a peak in the autocorrelation function that is 111 Chapter 7 FEATURE CALCULATION comparable to the peak belonging to the fundamental. As a result, a pitch tracking process is used. Anyway, this process can usually ride out a single error, but not a string of errors. Pitch can be determined either from periodicity in the time domain or from regularly spaced harmonics in the frequency domain. Consequently, pitch estimation techniques can be classified into two main groups:  period-synchronous procedures: These methods try to follow the periodic characteristics of the signal, e.g. positive zero-crossings, and estimate the signal period from this information.  short-term analysis procedures (window based). The short-term variety of estimators operates on a block (short-time frame) of speech samples and, for each one of these frames, one pitch value is estimated. The series of estimated values yield the fundamental frequency contour of the signal. There are different short-time analysis procedures e.g. cross- or autocorrelation or algorithms that operate in the frequency domain. Spectral procedures transform frames spectrally to enhance the periodicity information in the signal. Periodicity appears as peaks in the spectrum at the fundamental and its harmonics. Period-synchronous procedures have the advantage of being generally faster and present an adequate performance in most applications. Short-term methods are considered more accurate and robust, due to the higher precision of calculating one changing attribute in a shorter time interval. In addition, they are less affected by noise and do not require a complex post-processing. Consequently, a short-term analysis procedure is used in this thesis for F0 calculation. 7.1.2.3 Description of the algorithm. The program used for F0 and energy contour calculations is a part of the prosodic module employed in the second phase prototype of the Verbmobil project. The procedure was developed in previous works at the Chair for Pattern Recognition of the FriederichAlexander-University Erlangen-Nurnberg and is widely detailed in manifold works (s. [Kom89, Not91, Pen93, Har94, Kie97]). Consequently, here just a brief description is introduced. 112 Chapter 7 FEATURE CALCULATION Fundamental frequency estimation through a window-based procedure This procedure performs a short-term analysis, which works in the spectral domain and provides sequential F0 computing. As it was already clarified, since F0 only makes sense to voiced frames, voiced/unvoiced decision must be the first step when F0 estimation problem is faced. The way this decision is made was detailed in previous section 7.1.1. For the prosodic analysis of the human voice, F0 is usually expected to be in the interval between 35 Hz and 550 Hz. According to the Shannon theorem [Sha49], an analog signal must be sampled with at least the double of the highest frequency of the signal, to be able to be recovered without any losses. In order to respect this theorem, voiced regions are low-pass-filtered with cut-off frequency of 1100 Hz. Through this limitation of the F0 maximum to 550 Hz, noise and mistakes will less affect the algorithm. Then, the low-pass-filtered signal is digitised using a low sampling frequency (downsampling) in order to reduce the number of signal values that must be computed. Consequently, the F0 estimation process will be accelerated. For the resulting frames, the short-time spectrum is calculated through the Fast Fourier Transform (FFT, s. [Nie83]). The procedure is based on the assumption that the absolute maximum of the shorttime spectrum corresponds to one harmonic vibration of the F0. The main difficulty of the algorithm is to find a proper definition to build the decision rule, which chooses the maximum of the spectrum inside a voiced frame. This decision is created here indirectly through an implemented Dynamic Programming (DP) procedure. For every estimated F0 value (one per voiced frame), absolute decision values (dividers) are allowed. Dividers of all the frames in one voiced region yields hence a matrix, which is used by DP to compose a specific low-cost function, employed to find the F0 optimal path. This costfunction takes into account the distance to adjacent candidates and the distance to a known target value. This value is calculated, for reasons of robustness, for the voiced frames with the maximum of the energy signal using a multi-channel procedure. Different possible candidates are calculated for every target value using correlation methods (periodic AMBF-Procedures, s. [Ros74]) and frequency domain procedures (Seffer-Procedures, s. [Sen78]) and the median of these values results in the target value of the voiced interval. The arithmetic mean of all the target values of the speech signal is 113 Chapter 7 FEATURE CALCULATION the reference point R, which is applied for the divider determination within every voiced frame. For each frame t, the spectrum from start-frame S to end-frame E of the voiced interval is considered, and the frequency Ft with maximum energy in this spectrum is calculated. With help of the divisors Kt=(Ft/R), the matrix J, containing diverse F0 candidates, is defined: J j ,t  Ft  kt  j  0    kt  j  undef .  otherwise with j   n; n  , t  S ; E  (7.5) In preliminary tests arise, that the correct F0 value is mostly enclosed when considering five candidates (n=2). Now, with the help of a recursive cost-function and by means of DP, the best path through the matrix J can be founded, which finally yields the F0 contour of the voiced region. In addition, the procedure has some other advantages. On one hand, F0 values are not estimated in isolation for every frame. Instead, the cost-function establishes a relation with nearest neighbours, so that their spectral characteristics are also taken into account. On the other hand, proceeding this way short irregular periods produce no perturbation on the results. One additional benefit is that the expense calculations for every single frame, where the estimated valued is calculated, is limited. For further description of the costfunction see [Pen93] and [Kie97]. Post-processing of the F0 Contour Independently of the F0 calculation method employed, post-processing is undoubtedly favourable, since direct application of the F0 values for further prosodic features calculations would be definitely inadequate. The sense in post-processing the F0 values lies behind different reasons:  Automatic algorithms for the F0 extraction generate errors.  Values of F0 are not calculated for every single frame of the signal.  Fluctuations between adjacent F0 values are distressing under certain conditions. 114 Chapter 7 FEATURE CALCULATION  Calculations from the F0 contour are dependent on the voice reference (e.g. maximum). Several possibilities for post-processing the fundamental frequency contour can be found in [Hes83]. In the framework of this work, post-processing is accomplished in different steps, as follows:  Smoothing of the F0 curve through a median filter.  Zero-setting of all the F0 values between 35 Hz and 60 Hz (before interpolation)  Interpolation of the unvoiced interval.  Semitone transformation and mean value subtraction. Small failures of the algorithm can yield some undesirable noise. F0 curve smoothing through a median-filter is employed in order to leave out some of these small failures resulting from the algorithm. Smoothing increases the signal-to-noise ratio and allows the signal characteristics (peak position, height, width, area, etc.) to be measured more accurately Figure 7.4. Smoothing. The right peak is the result of smoothing the left noisy peak. The zero-setting of all the values between 35 and 60 Hz before the interpolation is mainly adequate when recordings are carried out by means of WOZ dialogues. Usually, start and end point of the uttered expression are classified as voiced due to the system response and the contribution of the human voice also present in these parts. Values of F0 contained in such intervals habitually fit in the range between 35 and 60 Hz. By the use of zero-setting the system response of the whole utterance is considered. Though F0 values are not computed over unvoiced frames, a continuous F0 contour would be desirable for further feature calculation. Therefore, interpolation over the 115 Chapter 7 FEATURE CALCULATION unvoiced frames is absolutely required. For interpolation over intervals, whose F0 cannot be calculated, numerous alternatives are found. In the present Diploma Thesis, as proposed in [Kie97], linear interpolation and extrapolation is applied exclusively at the beginning and at the end of the phrase. In addition, in order to reproduce the human ear response, semitone transformation is performed over the resulting interpolated F0 contour using the following function: c  ln( x )  x  0 HT : x    0  otherwise (7.6) By choosing c=12/ln(2), semitones relate to 1 Hz as reference value; for normalisation of the F0 value, the mean value of each F0 value is subtracted from the overall F0 contour. 7.1.3 Energy Contour. Coupling of the loudness perception with the acoustic measurement is as complex as the coupling of the tone pitch perception and the computable F0. The sensation of loudness is both dependent on the frequency of the sound and on the duration, and the other way round, pitch perception depends on the loudness (s. [Zwi67]). As a result, accurate complex reliance is not directly taken into account for the following algorithm, energy and F0 calculation are stored in a vector and, consequently, an implicit standardisation takes place. Basic calculation procedures used for computation of energy as the acoustic correlated of perceptive loudness are based on relations between physical acoustic pressure magnitudes ps, measured in Pascal (1Pa= 1N/m2), and the acoustic intensity Is, whose unit is W/m2. It can be stated that Is is proportional to ps2. With help of the acoustic intensity reference value, I0=1pW/m2, and the acoustic pressure reference value, p0=20μPa, which illustrate the human auditory threshold at mid-range frequencies, the absolute acoustic pressure in decibels (dB) is given by: 116 Chapter 7 FEATURE CALCULATION L  10  log Is p dB  20  log s dB I0 p0 (7.7) The acoustic magnitude loudness quantifies the sound intensity rate between two perceived tones, hence a sound of 1 kHz with a loudness of 40 phones (acoustic pressure level of 40dB) is applied as reference. In addition, loudness varies proportionally to the third root of the intensity. Automatic computation of energy contour can be achieved through different methods. In this Diploma Thesis a general method is employed using the following formula: Em  n   T (s )w n   n m n (7.8) T[.] represents, in this manner, a convenient transformation of signal values sn and wn corresponds to an adequate window function to obtain precise segments of the signal. Values out of the used window are usually set to 0, in order to facilitate finite procedures. There are many possibilities for the choice of the transformation and the windowing function. In the loudness calculation process, a Hamming window (figure 7.5) has been used wnH with the form:  2n  H wn  0.54  0.46 cos   N  1 (7.9) Figure 7.5. The Hamming Window. 117 Chapter 7 FEATURE CALCULATION There, N represents the window size in samples. Rectangular window is proved to give maximum sharpness but large side-lobes (ripples) while hamming window blurs in frequency but produces much less leakage. For the loudness calculation, the reference value I0 is needed, which can no longer be extracted from digitised signals. For a 16-bit quantization and a maximum acoustic pressure level of 60 dB, which represents a standard value during normal conversation, I0 is computed with equation 7.7 as follows: (215 ) 2 (215 ) 2 L  10  log  60dB  I 0   1073.74 I0 106 (7.10) Using Hamming windows wnH of 40 ms of duration, thus with 40/16000 = 640 samples (N=640), the intensity value Is of the frame i can de estimated through the following expression: 639 ~ Ii  s n 0 2 i n 639  wn  wn H H (7.11) n 0 The effective loudness value Lhi of the frame i can therefore be estimated through its relation to the intensity as follows: Lhi  3 ~ Ii ~ I0 (7.12) During this Diploma Thesis, both loudness and energy will describe this magnitude and they are utilised as synonyms. For further details on different examples for energy calculation procedures or windowing functions, refer to the proper section in [Kie97]. 118 Chapter 7 FEATURE CALCULATION 7.2 Prosodic Features Previous research on feature extraction for emotion recognition has focused on prosodic features, based on different linguistic units as utterance vector [Bat00], word vector [Hub98] or intervals [Ami01]. In the present work we attempt to recognize emotions from the speech signal given a short command (approximately 2 to 4 seconds), without getting any profit from context or linguistic information. In the long term, the goal of the investigation initiated during this thesis is to have a speaker and language independent emotion classifier. Such challenging purpose, leads us to deal only with global acoustic features, computed for a whole utterance or command, which seem to have the favour of many recent studies (s. [Del96, Pet00]). The term prosody, previously introduced in section 3.1, comprises a number of attributes that can be classified into basic or compound characteristics. Main basic prosodic attributes are loudness, pitch and duration related attributes such as duration, speaking rate and pause. Compound attributes derive from them and are intonation, accentuation, prosodic phrases, rhythm and hesitation. With the aim to map emotions on the activation axis (see Chapter 2), we make a classification depending on prosodic characteristics, since most researches point them as the most related to feelings that differ in the activation dimension. With this aim we extracted features that model logarithmic F0, energy and durational aspects. Here we will mainly deal with acoustic prosodic features that are computed for the whole utterance. During this work different kind of prosodic features have been used, mainly divided into two groups: P1- Features related to prosodic basic attributes (i.e. energy and pitch) and pitch derivative. Most features have roots in statistics from values over all the frames in a sentence and in linear regression coefficients of the contour. These parameters derive from studies by [Bat00] and [Del96]. P2- Features related to prosodic compound attributes, which are more relative and provide information closer to the intonation and changes in P1 features. These parameters are based on the features proposed in [Tis93]. Calculations of both sets of features were written in C programming language and the description of their extraction method is given below. 119 Chapter 7 FEATURE CALCULATION 7.2.1 P1 In this section, features of the first set are presented. Each feature is referenced with a number that corresponds to its index within the output vector from the C program which computes this set of features (ppal.c). 7.2.1.1 Energy based features. These features derive from the estimated energy contour. For every frame i an energy value Ei exists. For further information about how this curve is obtained, see section 7.1.3. P1.0 - ENER_MAX: Short-term energy maximum. Maximum value of the energy curve in the whole utterance. The value is achieved by inspection of the energy values of all the frames within one utterance and selecting the maximum numeric value among them. P1.1 - ENER_MAX_POS: Position of short-term-energy maximum. Relative time position of the maximum energy value into the utterance. The maximum energy value is P1.0 and its temporal position in the sentence is divided by the utterance overall length. Calculations are made in frames: EneMaxPos  iE max N (7.13) Where iEmax= frame position of the maximum energy value on the time axis. N= number of frames in the whole utterance. P1.2 - ENER_MIN: Short-term-energy minimum. Minimum value of the energy curve into the whole utterance. The value is achieved by inspection of the energy values of all the frames within one utterance and selecting the minimum numeric value among them. 120 Chapter 7 FEATURE CALCULATION P1.3 - ENER_MIN_POS: Position of short-term-energy minimum. Relative time position of the minimum energy value into the utterance. The minimum energy value is P1.2 and its temporal position in the sentence is divided by the utterance overall length. Calculations are made in frames: EneMinPos  iE min N (7.14) Where iEmin = frame position of the minimum energy value on the time axis. N = number of frames in the whole utterance. P1.4 - ENER_REG_COEF: Regression coefficient for short-term-energy. Slope coefficient of the regression line for the energy curve values in the utterance. EnergyRegC oef  Sene, xy Sene, x (7.15) N 1 N   i   Ei N i 1 i 1 (7.16) With N S ene, xy   i  E i  i 1 1  N   i    i  N  i 1  i 1 N S ene, x 2 2 (7.17) Where i = frame position on the time axis. Ei = Estimated energy in the ith frame according to the algorithm described in section 7.1.3. N = Number of frames in the whole utterance. P1.5 - ENER SQR_ERR: Mean square error for regression coefficient for short-termenergy. Mean square error value between the regression line and the real energy curve.    S S 1 N  EneSqrErr     Ei    E  ene, xy  i   ene, xy  i   N i 1  Sene, x   Sene, x  2 (7.18) 121 Chapter 7 FEATURE CALCULATION With 1 N   Ei N i 1 (7.19) 1 N i    i N i 1 (7.20) E  Where i = frame position on the time axis. Ei = Estimated energy in the ith frame according to the algorithm described in section 7.1.3. N = Number of frames in the whole utterance. P1.6 - ENER_MEAN: Mean of short-term-energy. Mean energy value calculated over the whole utterance. Energy values of all the frames in a sentence are summed and then divided by the total number of frames. N MeanEne  E i 1 i (7.21) N P1.7 - ENER_ VAR: Variance of short-term-energy. Variance of the energy values over the whole utterance. N VarEne  E i 1 N 2 i  E 2 (7.22) Where µE ² = Energy mean (P1.6). 7.2.1.2 Fundamental frequency based features. These features are extracted from the estimated F0-curve, which is obtained using the logarithmic and interpolated F0-curve. F0i represents the F0-value of the frame ith. For further description about how this curve is obtained, see section 7.1.2. 122 Chapter 7 FEATURE CALCULATION Since the existence of fundamental frequency only makes sense inside voiced frames, all the outcomes related to F0 are confined to voiced regions, where ‘voice region’ is understood as a speech interval containing more than three successive voiced frames. For further information about the voiced/unvoiced decision see section 7.1.1. P1.8 - F0_MAX: F0 maximum. Maximum value of the F0 curve in the voiced parts of the utterance. The value is achieved by inspection of the pitch values of all the frames labelled as voiced in the utterance and selecting the maximum numeric value among them. P1.9 - F0_MAX_POS: Position of F0 maximum on time axis. Relative time position of the maximum F0 value into the utterance. The maximum pitch value is P1.8 and its temporal position in the sentence is divided by the utterance overall length. Calculations are made in frames: F 0MaxPos  iF 0 max N (7.23) Where IF0max= frame position of the maximum F0 value on the time axis. N= number of frames in the whole utterance. P1.10 - F0_MIN: F0 minimum. Minimum value of the F0 curve in the voiced parts of the utterance. The value is achieved by inspection of the pitch values of all the frames labelled as voiced in the utterance and selecting the minimum numeric value among them. P1.11 - F0_MIN_POS: Position of F0 minimum on time axis. Relative time position of the maximum F0 value into the utterance. The minimum pitch value is P1.10 and its temporal position in the sentence is divided by the utterance overall length. Calculations are made in frames: F 0 MinPos  iF 0 min N (7.24) 123 Chapter 7 FEATURE CALCULATION Where IF0max= frame position of the minimum pitch value on the time axis. P1.12 - F0_REG_COEF: Regression coefficient for F0. Slope coefficient of the regression line for the F0 curve values in the utterance. F 0RegCoef  SF 0, xy SF 0, x (7.25) With N S F 0, xy   i  F 0 i  i 1 N 1 N   i   F 0i N i 1 i 1 (7.26) 1  N   i    i  N  i 1  i 1 N S F 0, y 2 2 (7.27) Where i = frame position on the time axis. F0i = Estimated pitch in the ith frame according to the algorithm described in 7.1.2. N = Number of frames in the whole utterance. P1.13 - F0_SQR_ERR: Mean square error for regression coefficient. Mean square error value between the regression line and the real energy curve.    S S 1 N  F 0SqrErr     F 0i    F 0  F 0, xy  i   F 0, xy  i   N i 1  S F 0, x   S F 0, x  2 (7.28) With F 0  i  124 1 N   F 0i N i 1 1 N  i N i 1 (7.29) (7.30) Chapter 7 FEATURE CALCULATION Where i = frame position on the time axis. F0i = Estimated pitch in the ith frame according to the algorithm described in section 7.1.2. N = Number of frames in the whole utterance. P1.14 - F0_MEAN: F0 mean. Mean F0 value calculated over the voiced regions of the utterance. Pitch values of all the voiced frames in a sentence are summed and then divided by the total number of voiced frames. N MeanF 0   F0 i 1 i (7.31) N P1.15 - F0_VAR: F0 variance. Variance of the energy values over the voiced regions in the utterance. N VarF 0   F0 i 1 2 i  F 0 N 2 (7.32) Where µF0 ² = Pitch mean (P1.14). P1.36 - Jitter. Periodic jitter is defined as the relative mean absolute third-order difference of the point process. This feature is exceptionally calculated using Praat and then included in the feature vector. The algorithm is computed through the following formula: N 1 jitter   2T i 2 i  Ti 1  Ti 1 N 1 T i 2 (7.33) i 125 Chapter 7 FEATURE CALCULATION Where Ti = interval ith. N = number of intervals. For its computation, two arguments are required: - Shortest period: Shortest possible interval that will be considered. For intervals Ti shorter than this, the (i-1)th, ist, and (i+1)th terms in the formula are taken as zero. This argument is set to a very small value, 0.1 ms. - Longest period: Longest possible interval that will be considered. For intervals Ti longer than this, the (i-1)th, ith, and (i+1)th terms in the formula are taken as zero. Establishing the minimum frequency of periodicity as 50 Hz, the value for this parameter is 20 ms; intervals longer than that will be considered unvoiced. 7.2.1.3 Voiced/unvoiced regions based features. These features have roots in the voiced/unvoiced information, which is obtained through an algorithm that assigns 1 to voiced frames and 0 to unvoiced. For further description about the decision algorithm, see 7.1.1. P1.16 - F0_FIRST_VCD_FRAME. F0 value for the first voiced frame in the utterance. P1.17 - F0_LAST_VCD_FRAME. F0 value for the last voiced frame in the utterance. P1.18 - NUM_VOICED_REGIONS. Amount of regions containing more than three successive voiced frames. Regions containing three or less voiced frames are not taken into consideration, despite their frames are counted as voiced. P1.19 - NUM_UNVCD_REGIONS. Number of regions with more than three successive unvoiced frames. Same considerations as P1.18 are used to define regions. 126 Chapter 7 FEATURE CALCULATION P1.20 - NUM_VOICED_FRAMES. Amount of voiced frames in the utterance. Isolated voiced frames as well as frames belonging to a voiced region are counted. P1.21 - NUM_UNVCD_FRAMES. Number of unvoiced frames in the utterance. Isolated unvoiced frames as well as frames belonging to a voiced region are counted. P1.22 - LGTH_LNGST_V_REG. Length of the longest voiced region. The number of frames for each voiced region is counted and the highest amount is taken as feature P1.22. P1.23 - LGTH_LNGST_UV_REG. Length of longest unvoiced region. The number of frames for each unvoiced region is counted and the highest amount is taken as feature P1.23. P1.24 - RATIO_V_UN_FRMS. Ratio of number of voiced frames and number of unvoiced frames. RatVcdUnvcdFrms   voiced _ frames  unvoiced _ frames (7.33) P1.25 - RATIO_V_UN_REG. Ratio of number of voiced regions and number of unvoiced regions. RatVcdUnvcdRg   voiced _ regions  unvoiced _ regions (7.34) P1.26 - RATIO_V_ALL_FRMS. Ratio of number of voiced frames and number of all frames. RatVcdAllF rms   voiced _ frames N (7.35) 127 Chapter 7 FEATURE CALCULATION P1.27 - RATIO_UV_ALL_FRMS. Ratio of number of unvoiced frames and number of all frames. RatUnvcdAllRg   unvoiced _ regions N (7.36) 7.2.1.4 Pitch contour derivative based features. The derivative of F0 is computed and similar operations are performed. The calculations follow identical procedures as the F0 case and therefore they are just introduced. P1.28 - F0_DER_MAX. F0 derivative maximum. P1.29 - F0_DER_MAX_POS. Relative position of F0 derivative maximum. P1.30 - F0_DER_MIN. F0 derivative minimum. P1.31 - F0_DER_MIN_POS. Relative position of F0 derivative minimum. P1.32 - F0_DER_REG_COEF. Regression coefficient for F0 derivative. P1.33 - F0_DER_SQR_ERR. Mean square error for regression coefficient for F0 derivative. P1.34 -F0_DER_MEAN. F0 derivative mean. 128 Chapter 7 FEATURE CALCULATION P1.35 - F0_DER_VAR. F0 derivative variance. 7.2.1 P2. This section introduces the features included in the second set. The program used to calculate them is called complex_calcs.c (see chapter 10). In order to obtain information associated with changes in the signal, following features result from relations among signal parameters, instead of being direct measurement magnitudes. In this section, N corresponds to the number of voiced regions in the utterance. P2.0: Mean of the pitch means in every voiced region. F 0 AbsMean   F0 n N (7.37) N Where F 0n = Mean of the pitch values in the voiced region n. P2.1: Variance of the pitch means in every region. VarF 0Mean   F 0 n  F 0 AbsMean  2 N N (7.38) Where F 0n = Mean of the pitch values in the voiced region n. F0AbsMean = P2.0. P2.2: Mean of the maximum pitch values in every region. MaxF 0Mean   F0 N N max n (7.39) 129 Chapter 7 FEATURE CALCULATION Where F 0 max n = Maximum of the pitch values within the voiced region n. P2.3: Variance of the maximum pitch values in every region. VarF 0 Max   F 0 max n  MaxF 0 Mean  2 N (7.40) N F0 Voiced region 1 4 3 2 t Figure 7.6. F0 contour and points selected for calculations of P2.4 and P2.5. P2.4: Pitch increasing per voiced region. This feature take four points into account inside each voiced part of the utterance (see figure 7.6): 1. Beginning of the voiced region. 2. End of the voiced region. 3. Maximum pitch value. 4. Second maximum pitch value. The sum of all pitch differences between two successive increasing points, divided by their respective time difference is computed. The final value for this feature results from the arithmetic mean of this calculation over all voiced parts contained in the utterance.  PeaksIncrease  130    N  increase F 0i  F 0 j   ti  t j   N (7.41) Chapter 7 FEATURE CALCULATION Where i , j = represent one of the four points considered, where i appears before j ti <tj F0i < F0j P2.5: Pitch decreasing per voiced region. Same points as in P2.4 are taken into account (figure 7.6). In this case, sum of all pitch differences between two successive decreasing points, divided by their respective time difference is calculated. The final value for results from the arithmetic mean of this calculation over all voiced parts contained in the utterance.  PeaksDecrease     N  decrease F 0i  F 0 j   ti  t j   N (7.42) Where i and j represent one of the four points considered, where i appears before j ti <tj F0i > F0j P2.6: Mean of the pitch ranges in every voiced region. MeanRange   F 0 max n  F 0min n N N  (7.43) P2.7: Flatness. Mean of the flatness (mean/max) of the pitch for every voiced region multiplied by 100. F 0n Flatness   F0 N max n N  100 (7.44) 131 Chapter 7 FEATURE CALCULATION P2.8: Mean of the relative duration from the beginning of the voiced part to the position of the pitch maximum in every voiced region multiplied by 100.  DurMaxPitch    t max n N   t startn   100  N (7.45) P2.9: Peaks increasing for the whole utterance. The maximum of each voiced region is considered. Sum of all pitch differences between two successive increasing points, divided by their respective time difference is calculated. This feature is similar to P2.4 but generalised to the whole sentence.  PeaksIncreaseUtt  F 0 max i  F 0 max j increase ti  t j (7.46) Where ti , tj = positions of the maximum value for regions i and j (ti <tj). F0max = the maximum pitch value in every region. Maximum of region j must be higher than maximum of region i. P2.10: Peaks decreasing for the whole utterance. The maximum of each voiced region is considered. Sum of all pitch differences between two successive decreasing points, divided by their respective time difference is calculated. This feature is similar to P2.5 but generalised to the whole sentence. PeaksDecreaseUtt   decrease F 0max i  F 0max j ti  t j (7.47) Where ti ,tj = positions of the maximum value for regions i and j (ti <tj). F0max = the maximum value in every region. Maximum of region j must be lower than maximum of region i. 132 Chapter 7 FEATURE CALCULATION P2.11: Mean of the voiced region duration.  length n DurMean  N (7.48) N P2.12: Global energy mean. Mean of the energy means in every voiced region multiplied by 100 and divided by the absolute energy maximum of the whole utterance. E n  100 N E MAX _ UTT N EnerMean  (7.49) P2.13: Mean of the relative duration from the beginning of the voiced region to the position of the energy maximum in every voiced region. Multiplied by 100 and divided by the absolute energy maximum of the whole utterance. EneDurStart  t N E max n max n  t startn E MAX _ UTT (7.50) Where tstart = starting point of the voiced region. tmax = energy maximum position of the region. P2.14: Mean of the relative duration from the position of the energy maximum in every voiced region to the end of the voiced region. Multiplied by 100 and divided by the absolute energy maximum of the whole utterance. EneDurEnd  t N E max n end n  t max n E MAX _ UTT (7.51) Where tend = end point of the voiced region. tmax = energy maximum position of the region. 133 Chapter 7 FEATURE CALCULATION P2.15: Mean of the vehemence (mean/min) of the energy in every voiced region. En EneVehemence  E N min (7.52) N P2.16: Mean of the flatness (mean/max) of the energy in every voiced region multiplied by 100. En EneFlatness  E N min (7.53) N P2.17: Relation between the maximum energy value of the whole utterance and its position. EneMaxRatio  E MAX _ UTT (7.54) t MAX _ UTT P2.17: Relation between the maximum of the voiced region and the maximum of the utterance divided by the position of the voiced region maximum position and multiplied by 100. Arithmetic mean of this calculation for all the voiced regions in the utterance. Emax n EnerMaxRatio region   E MAX _ UTT t MAX _ UTT N  100 N (7.55) P2.18: Mean of the energy tremor in every voiced region. Tremor refers to a regular variation in the signal and is computed as the number of zero-crossings over a window of the energy curve derivative TremorMean  134  ene _ tremor n N N (7.56) Chapter 7 FEATURE CALCULATION 7.3 Quality Features. The classification of emotions using quality voice features is a brand new field of investigation, which is being used and referred in many lately studies concerned to emotion recognition (s[Joh99, Alt00]). Since this proposal faces different obstacles, due to the difficulty of estimation of this kind of attributes, diverse set of features and methods were tried during this Diploma Thesis. Some of the described features have been used just in few experiments and other are more frequently employed, but all of them are here introduced. The software employed to deal with quality features extraction is PRAAT1, a shareware program developed by Dr. Paul Boersma of the University of Amsterdam. This section makes use of two different methods for the calculation of the mean value of a given parameter within a voiced region: - Mean1: Arithmetic mean of the parameter values over all the frames inside a voiced region. Mean1 f   f i nframes 1  i  nframes nframes (7.57) Where nframes = number of frames inside a voiced region. fi = feature value in the frame i. - Mean2: First, the Mean1 of the parameter within a voiced region is computed. Then, single values of this parameter for every frame are checked and the one which is closest to the computed Mean1 is considered as the mean (Mean2) of this region. This way, we assume that this value comes from the most representative part inside the voiced region, since the mean is influenced also by voiced region boundaries. It was experimentally checked that the chosen frames normally matches the core of the vowel.  Mean 2 f   min i f i  f 1 n  1  i  nframes (7.58) Further information can be found under www.praat.org. 135 Chapter 7 FEATURE CALCULATION Where nframes = number of frames inside a voiced region n = index of the region fi = feature value in the frame i f n = Mean1 of the feature in region n From now, they are referred as Mean1 and Mean2 in the subsequent feature calculation description. 7.3.1 Harmonicity based features. Since harmonic to noise ratio is clearly related to the voice quality (see Chapter 3), this voice quality attribute has been said to provide valuable information about the speaker emotional state (s. [Alt99, Alt00]). Harmonic to noise ration estimation can be considered as an acoustic correlate for breathiness and roughness, in agreement with [Alt00]. Therefore, voice quality cues, which help us to infer assumptions about the speaker’s emotional state, can be extracted from this attribute. For its calculation, as well as for the remaining voice quality features, Praat program is utilized. Harmonicity is here expressed in dB; if 99% of the energy of the signal is in the periodic part, and 1% is noise, the HNR is 10*log10(99/1) = 20 dB. A HNR of 0 dB means that there is equal energy in the harmonics and in the noise. The algorithm performs acoustic periodicity detection on the basis of an accurate autocorrelation method, as described in [Boe93]. Harmonicity values are given for individual frames and the concrete features are calculated to be employed as classification features. Praat program requires four different parameters to calculate the harmonicity: 1. Time step (default: 0.01 s): the measurement interval (frame duration), in seconds. 2. Minimum pitch (default: 75 Hz): determines the length of the analysis window. 3. Silence threshold (default: 0.1): frames that do not contain amplitudes above this threshold (relative to the global maximum amplitude), are considered silent. 136 Chapter 7 FEATURE CALCULATION 4. Number of periods per window (default: 1): determine the level up to the HNR is guaranteed to be detected. More periods per window raises the figure of detection but the algorithm becomes more sensitive to dynamic changes in the signal. QH.0a: Harmonic to noise ratio maximum. Mean2. Default values. Maximum of the Mean2 values for all the regions in the sentence when the harmonicity is computed setting all the parameters in Praat to their default value. h2nmax  max n hn  1 n  N (7.59) Where hn = harmonic to noise ratio Mean2 value in region n N = number of regions inside a sentence QH.0b: Harmonic to noise ratio maximum. Mean2. 4.5 periods per window. Maximum of the Mean2 values for all the regions in the sentence when the harmonicity is computed setting the number of periods per window to 4.5, which is considered an optimal value for speech: HNR values up to 37 dB are guaranteed to be detected reliably. When the number of periods per window increases the minimum pitch parameter has to be also changed and, following the recommendations of Praat software, it’s set to: F 0 min  6.4 lenght (7.60) Where length = length of the speech segment where the harmonicity is computed. This feature follows the same formula (7.59) but taking into account the new values of the harmonicity. QH.0c: Harmonic to noise ratio maximum. Mean1. 4.5 periods per window. This feature is identical to QH.0b with the exception that it uses the Mean1 instead the Mean2 procedure to calculate the mean value of the HNR in the analysed region. It follows therefore also equation (7.59) with the new values of harmonicity, by just substituting the term Mean2 for its analogous Mean1. 137 Chapter 7 FEATURE CALCULATION QH.0d: Harmonic to noise ratio maximum within a voiced region. Each frame inside a voiced region contains a value of the HNR. The maximum of these values within the given region is the feature QH.0d. QH.1a: Harmonic to noise ratio range. Mean2. Default Values. Once the Mean2 values are calculated for every single voice region of the sentence, the difference between the maximum and the minimum of these values in a sentence is the feature QH.1a. When there is one unique region, this value becomes zero. h2nrange  hn,max  hn,min 1 n  N (7.61) Where h n , max = harmonic to noise ratio maximum value in the sentence. h n , min = harmonic to noise ratio minimum value in the sentence. QH.1b: Harmonic to noise ratio range. Mean2. 4.5 periods per window. Once the Mean2 values are calculated for every single voice region of the sentence, the difference between the maximum and the minimum of these values in a sentence is the feature QH.1b. When there is one unique region, this value becomes zero. The only difference with QH.1a is that the parameter number of periods per window is set to 4.5 and, consequently, the minimum pitch comes from equation 7.60 (see QH.0b). QH.1c: Harmonic to noise ratio range. Mean1. 4.5 periods per window. Once the Mean1 values are calculated for every single voice region of the sentence, the difference between the maximum and the minimum of these values in a sentence is the feature QH.1b. This feature is identical to QH.1b by changing the criteria to calculate the mean value in the analysed region; Mean1 criteria is employed as a replacement for Mean2. QH.2: Harmonic to noise ratio mean. Mean1. Default settings. Arithmetic mean (Mean1) of all the HNR values calculated by frame within a voiced region. 138 Chapter 7 FEATURE CALCULATION QH.3: Harmonic to noise ratio standard deviation within a voiced region. Default settings. Standard deviation of the HNR values within a voiced region. 7.3.2 Formant frequency based features. The algorithm followed by Praat first resample the signal to a sample rate of twice the value of Maximum formant frequency parameter (aprox. 5000). After this, preemphasis is applied. The pre-emphasis factor is computed as   exp( 2Ft ) , where t is the sampling period of the sound. Each sample xi of the sound except xi is then changed, going down from the last sample: x i  x i  x i 1 . For each analysis window, Praat applies a Gaussian-like window, and computes the LPC coefficients with the algorithm by Burg. The Burg algorithm is a recursive estimator for auto-regressive models, where each step is estimated using the results from the previous step. The implementation of the Burg algorithm is based on the routines memcof and zroots in [Pre93]. This algorithm can initially find formants at very low or high frequencies. From the values obtained for every single frame, some calculations are extracted to be used as input for the emotional classification. QF.0a: Minimum of f2Mean2 – f1Mean2 for all the voiced regions. Difference between the Mean2 of the second and the first formant frequency for each voiced region in a sentence. The minimum value of this difference among all the voiced regions is taken as QF.0a. This feature is used in some cases to select just one region and made it representative of the sentence. This way, features are calculated in similar regions and their differences will be more influenced by changes in the speaker’s emotional state than by the nature of the vowel. The reason to choose the minimum difference between first and second formant is based on the formant structure of an /a/, which is appropriate to extract quality features, due to the shape of the vocal tract when it’s uttered, and in which first and second formant frequencies are closer.  f 21  min n f 2 ,n  f 1,n  1 n  N (7.62) 139 Chapter 7 FEATURE CALCULATION Where f 2 ,n  Mean2 of the second formant frequency in the voiced region n. f 1,n  Mean2 of the first formant frequency in the voiced region n. N = Number of voiced regions within the utterance. QF.0b: Minimum of (f2–f1)Mean1 for all the voiced regions. Mean1 of the difference between all the values of the second and the first formant frequency within each voiced region in a sentence. The minimum value of this difference among all the voiced regions is taken as QF.0b. Obviously, this feature is equivalent to QF.0a by only substituting the concept of Mean2 by Mean1. However, the description follows exactly the process the software was implemented. The same equation (7.62) can be here applied by only changing the terms Mean2 for Mean1. QF.1a: First formant frequency. Mean2. Frequency of the first formant in the region from where QF.0a is extracted, calculated as the Mean2 within the voiced region. QF.1b: First formant frequency. Mean1. Frequency of the first formant in the region from where QF.0b is extracted, calculated as the Mean1 within the voiced region. QF.2a: Second formant frequency. Mean2. Frequency of the second formant in the region selected by QF.0a, calculated as the Mean2 within the voiced region. QF.2b: Second formant frequency. Mean1. Frequency of the second formant in the region selected by QF.0b, calculated as the Mean1 within the voiced region. 140 Chapter 7 FEATURE CALCULATION QF.3a: Third formant frequency. Mean2. Frequency of the third formant in the region selected by QF.0a, calculated as the Mean2 within the voiced region. QF.3b: Third formant frequency. Mean1. Frequency of the third formant in the region selected by QF.0b, calculated as the Mean1 within the voiced region. QF.4a: Second formant ratio. Mean2. Frequency of the second formant (QF.2a) divided by the difference between second and first formants (QF.0a). All the formants are calculated through the Mean2 and belong to the selected region (see QF.0a). f 2ratio Mean2  f 2 Mean2 QF .2a  f 2 Mean2  f 1Meaan2 QF .0a (7.63) QF.4b: Second formant ratio. Mean1. Frequency of the second formant (QF.2b) divided by the difference between second and first formants (QF.0b). All the formants are calculated through the Mean1 and belong to the selected region (see QF.0b). f 2ratio Mean1  f 2 Mean1 QF .2b  f 2 Mean1  f 1Mean1 QF .0b (7.64) QF.5: Maximum of the second formant ratio. The maximum value of the second formant ratio calculated by frame within the region selected by QF.0b.  f 2i f 2ratiomax  max nframes   f 2 i  f 1i    1  i  nframes (7.65) Where f 1i  Value of the first formant frequency in the frame i. f 2 i  Value of the second formant frequency in the frame i. nframes = Number of frames within the voiced region selected by QF.0b. 141 Chapter 7 FEATURE CALCULATION QF.6: Range of the second formant ratio. Difference between the maximum and the minimum of the second formant ratio calculated by frame within the region selected by QF.0b.  f 2i f 2ratio range  max nframes   f 2 i  f 1i   f 2i   min nframes    f 2 i  f 1i    (7.66) Where f 1i  Value of the first formant frequency in the frame i. f 2 i  Value of the second formant frequency in the frame i. i = frame index 1  i  nframes nframes = Number of frames within the voiced region selected by QF.0b. QF.7a: Bandwidth of the first formant. Mean2. Mean of all the Mean2 first formant bandwidth values in a sentence. N bw1  b n 1 1, n N 1 n  N (7.67) Where b1,n is the first formant bandwidth Mean2 in region n. N is the number of regions inside a sentence. QF.7b: Bandwidth of the first formant. Mean1. Mean of all the Mean1 first formant bandwidth values in a sentence. Substituting Mean2 by Mean1, equation 7.67 is employed. QF.7c: Bandwidth mean of the first formant within a region . Mean1. Arithmetic mean (Mean1) of all the first formant bandwidth values calculated by frame, inside a voiced region. QF.8a: Bandwidth of the second formant. Mean2. Mean of all the Mean2 second formant bandwidth values in a sentence. 142 Chapter 7 FEATURE CALCULATION N bw 2  b n 1 2,n N 1 n  N (7.68) Where b2,n is the second formant bandwidth Mean2 in region n. N is the number of regions inside a sentence. QF.8b: Bandwidth of the second formant. Mean1. Mean of all the Mean1 second formant bandwidth values in a sentence. Substituting Mean2 by Mean1, equation 7.68 is employed. QF.8c: Bandwidth mean of the second formant within a region. Mean1. Arithmetic mean (Mean1) of all the second formant bandwidth values calculated by frame, inside a voiced region. QF.9a: Bandwidth of the third formant. Mean2. Mean of all the Mean2 third formant bandwidth values in a sentence. N bw 3  b n 1 3, n N 1 n  N (7.69) Where b3,n is the third formant bandwidth Mean2 in region n. N is the number of regions inside a sentence. QF.9b: Bandwidth of the third formant. Mean1. Mean of all the Mean1 third formant bandwidth values in a sentence. Substituting Mean2 by Mean1, equation 7.69 is employed. QF.9c: Bandwidth mean of the third formant within a region. Mean1. Arithmetic mean (Mean1) of all the third formant bandwidth values calculated by frame, inside a voiced region. 143 Chapter 7 FEATURE CALCULATION QF.10: Maximum of the first formant frequency in the selected region. Maximum value of the first formant frequency in the region selected by QF.0b. f 1, max  max n  f 1,i  1  i  nframes (7.70) Where f 1,i  Value of the first formant frequency in the frame i. nframes = Number of frames within the selected region. QF.11: Maximum of the second formant frequency in the selected region. Maximum value of the second formant frequency in the region selected by QF.0b. Same equation (7.70) for the second formant frequency case. QF.12: Maximum of the third formant frequency in the selected region. Maximum value of the third formant frequency in the region selected by QF.0b. Same equation (7.70) for the third formant frequency case. QF.13: Range of the first formant frequency in the selected region. Difference between the maximum and the minimum of the first formant frequency for the region selected by QF.0b. f 1,range  max nframes ( f 1,i )  min nframes ( f 1,i ) 1  i  nframes (7.71) Where f 1,i  Value of the first formant frequency in the frame i. nframes = Number of frames within the voiced region selected by QF.0b. QF.14: Range of the second formant frequency in the selected region. Difference between the maximum and the minimum of the second formant frequency for the region selected by QF.0b. Same equation (7.71) for the second formant frequency case. 144 Chapter 7 FEATURE CALCULATION QF.15: Range of the third formant frequency in the selected region. Difference between the maximum and the minimum of the third formant frequency for the region selected by QF.0b. Same equation (7.71) for the third formant frequency case. QF.16: Standard deviation of the first formant frequency in the selected region. Standard deviation of all the first formant frequency values within the region selected by QF.0b. QF.17: Standard deviation of the second formant frequency in the selected region. Standard deviation of all the second formant frequency values within the region selected by QF.0b. QF.18: Standard deviation of the third formant frequency in the selected region. Standard deviation of all the third formant frequency values within the region selected by QF.0b. 7.3.3 Energy based features. QE.0 – QE.3: Energy band distribution. The energy is calculated within four different frequency bands in order to decide, whether the band contains mainly harmonics of the fundamental frequency or turbulent noise. Frequency band distribution is taken from a study [Kla97] focused on the perceptual importance of several voice quality parameters. The four frequency bands proposed are: 1. From 0 Hz to F0 Hz (where F0 is the fundamental frequency). 2. From 0 Hz to 1 KHz. 3. From 2.5 KHz to 3.5 KHz 4. From 4 KHz to 5 KHz. From each band distribution, following features are calculated: 145 Chapter 7 FEATURE CALCULATION QE.0a – QE.3a: The energy contained in the corresponding band is calculated for all the voiced parts of the utterance. Then, these values are divided by the energy over all frequencies of the voiced parts of utterance. N EneBand j   EneBand n 1 j ,n j  1,2,3,4; N  ene n 1 (7.72) n Where j = index corresponding to each one of the energy bands (1, 2, 3 or 4) N = number of voiced regions within the utterance. QE.0b – QE.3b: The energy values are calculated only in one region. Energy in each band is divided by the energy over all frequencies within the given region. EneBandi  EneBand j , n enen j  1,2,3,4; (7.73) Where j = index corresponding to each one of the energy bands (1, 2, 3 or 4) n = index of the region. QE.4: Voiced energy ratio sentence based. Rate of the energy contained in voiced regions and energy over all the utterance. N EneRate   ene n 1 n AbsEne 1 n  N Where AbsEne= energy contained in all the utterance. N = number if voiced regions within the utterance. QE.5: Relative energy of one voiced region. Energy of the voiced region divided by the energy in all the utterance. 146 (7.74) Chapter 7 FEATURE CALCULATION Ene Re ln  enen AbsEne (7.75) Where n = index corresponding to one voiced region. AbsEne= energy contained in all the utterance. 7.3.4 Spectral measurements. The algorithm used by Praat to calculate the spectrum is the continuous interpretation of the Fast Fourier Transform (s. [Bra65, Wea89, Lat92]). If the sound is expressed in Pascal (Pa), the spectrum is expressed in Pa·s, or Pa/Hz. The frequency integral over the spectrum equals the time integral over the sound. For some features concerning spectral measurements, inverse filtering of the speech signal is performed. Inverse filtering can be seen as the inverse computation of the speech production model depicted in figure 7.7. Praat obtains the filter with the help of the technique of linear prediction. This technique tries to approximate a given frequency spectrum with a small number of peaks, for which it finds the mid-frequencies and the bandwidths. Doing this for an overlapping sequence of windowed parts of a sound signal (i.e. a short-term analysis), we get a quasi-stationary approximation of the signal's spectral characteristics as a function of time. For a speech signal, the peaks are identified with the resonances (formants) of the vocal tract. Since the spectrum of a vowel spoken by an average human being falls off with approximately 6 dB per octave, pre-emphasis is applied to the signal before the linear-prediction analysis, so that the algorithm will not try to match only the lower parts of the spectrum. Figure 7.7. Mathematical model of the speech production. 147 Chapter 7 FEATURE CALCULATION For an average human voice, tradition assumes five formants in the range between 0 and 5500 Hertz. This number comes from a computation of the formants of a straight tube, which has resonances at wavelengths of four tube lengths, four thirds of a tube length, four fifths, and so on. For a straight tube 16 centimetres long, the shortest wavelength is 64 cm, which, with a sound velocity of 352 m/s, means a resonance frequency of 352/0.64 = 550 Hertz. The other resonances will be at 1650, 2750, 3850, and 4950 Hertz. For the linear prediction in Praat, this have to implement this 5500-Hz band 40 20 20 0 0 -20 0 8000 Frequency (Hz) (a) 0 5500 Frequency (Hz) (b) Figure 7.8. Spectrum of the /a/ vowel uttered in the sentence “Kick den Ball” extracted from speaker A commands database. Figure a represents the original spectrum of the uttered vowel, whereas figure b represents the source of the sound obtained after inverse filtering. limiting by resampling the original speech signal to 11 kHz. Then, a linear-prediction analysis on the resampled sound is performed. Analysis is done with 16 linear-prediction parameters (which will yield at most eight formant-bandwidth pairs in each time frame), with an analysis window effectively 10 milliseconds long, with time steps of 5 milliseconds (so that the windows will overlap), and with a pre-emphasis frequency of 50 Hz (which is the point above which the sound will be amplified by 6 dB/octave prior to the analysis proper). This analysis will provide the filter (figure 7.9), which applied to the original speech sample (figure 7.8a), yields the source signal (figure 7.8b). Since the LPC analysis was designed to yield a spectrally flat filter (through the use of pre-emphasis), the source signal will represent everything in the speech signal that cannot be attributed to the resonating cavities. Thus, the "source signal" will consist of the glottal volume- 148 Chapter 7 FEATURE CALCULATION velocity source (with an expected spectral slope of -12 dB/octave for vowels) and the 40 20 0 0 5500 Frequency (Hz) Figure 7.9. Filter of vocal tract when the /a/ vowel is uttered in the sentence “Kick den Ball” extracted from speaker A commands database. Filter is obtained through LPC analysis. radiation characteristics at the lips, which cause a 6 dB/octave spectral rise, so that the resulting spectrum of the "source signal" is actually the derivative of the glottal flow, with an expected spectral slope of -6 dB/octave. QS.0: Open quotient related features. Open quotient is a spectral measurement whose variations have been associated to changes in the glottal source quality. Therefore, along with the ideas presented in Chapter 3, it could be a useful parameter in order to determine the emotional state of the speaker. Following the hypotheses that the amplitude difference of the first and second harmonics of the inverse-filtered voice signal (H1*-H2*) is a reliable spectral indicator of the relative length of the the opening phase and therefore an spectral correlate of the open quotient (s. [Dov97, Hen01]), two open quotient related features, with and without inverse filtering, are computed. QS.0a: Difference between first and second harmonic amplitudes of the spectrum of the speech signal, within the selected region. QS.0b: Difference between first and second harmonic amplitudes of the spectrum of the speech signal after inverse filtering, within the selected region. 149 Chapter 7 FEATURE CALCULATION QS.1 Spectral Tilt related features. Spectral Tilt has been also related to glottal source variations. It is one of the major acoustic parameters that reliably differentiates phonation types in many languages, and it can be understood as the degree to which intensity drops off as frequency increases. Spectral tilt can be quantified when comparing the amplitude of the fundamental to that of higher frequency harmonics, e.g. the second harmonic, the harmonic closest to the first formant, or the harmonic closest to the second formant. Spectral tilt is characteristically most steeply positive for creaky vowels and most steeply negative for breathy vowels. The amplitude of the first harmonic (H1) compared to the amplitude of the second formant (A2), which acts as an indicator of the spectral tilt at the mid formant frequencies, is here used as a voice quality feature for emotion classification. The parameter is expected to be large and positive for breathy voices and small and/or negative for creaky voices. QS.1a: Difference between the first harmonic amplitude and the spectral amplitude at the second formant frequency. Calculated over the spectrum of the speech signal in the selected region QS.1a: Difference between the first harmonic amplitude and the spectral amplitude at the second formant frequency. Calculated over the spectrum of the speech source calculated by means of the inverse filtering of the original speech segment. 150

FUNDAMENTAL FREQUENCY AND ENERGY CONTOURS

Related documents

Products

Support

FUNDAMENTAL FREQUENCY AND ENERGY CONTOURS

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib