Chapter 7
7.1 Basic Prosodic Attributes
In the present section, calculations and procedures employed to obtain basic features
contours are explained. These essential attributes (i.e. pitch and energy) will be the
starting point in the aim to obtain more complex features, which contain valuable
information for our purposes. The software used in section 7.1 is part of the Verbmobil
long-term project of the Federal Ministry of Education, Science, Research and
In order to achieve feasible estimations, and avoid the difficulties caused by the nonstationary nature of speech, it’s assumed that the properties of the signal change relatively
slow with time. This allows examination of a short-time window of speech to extract
relevant parameters that are presumed to be fixed within the duration of the window.
Most techniques yield parameters averaged over the course of the time window. Thus, if
dynamic parameters are to be modelled, the signal must be divided into successive
windows or analysis frames so that the parameters can be calculated often enough to
follow relevant changes. Consequently, in order to obtain F0 and energy contours, smaller
fragments of speech, called frames, are considered.
For each frame, the F0 and energy values are computed. There will be one single
value per frame and for its calculation a longer analysis window is employed. Inside the
analysis window, all the speech signal values are considered and analysis windows are
this way always overlapped. Frame durations of 10 ms and 20 ms are commonly used in
speech processing, while window lengths for F0 and energy calculations are usually
established between 25 ms and 40 ms. The analysis performed in the present work
considers frame durations of 10 ms and analysis window lengths of 40 ms.
Since voiced/unvoiced decision is the base of the F0 computation, it’s the first
algorithm in being described within this section. The decision is frame-based, and only
over voiced frames, F0 will be estimated.
7.1.1 Voiced/unvoiced decision.
Voiced speech involves the vibration of the vocal folds in response to airflow from
the lungs. This vibration is periodic and it could be examined independently of the
properties of the vocal tract. Its periodicity refers to the fundamental frequency of such
vibration or the resulting periodicity in the speech signal, also called “pitch”.
Figure 7.1. Waveform of the glottal source.
In unvoiced speech the sound source is not a regular vibration but rather vibrations
caused by turbulent airflow due to a constriction in the vocal tract. The sound created as a
result of the constriction is described as a noise source. It contains no dominating periodic
component and has a relatively flat spectrum meaning that every frequency component is
represented equally (in fact for some sounds the noise spectrum may slope down at
around 6dB/octave). Attending to the time waveform of a noise source, only a random
pattern of movement is observed around the zero axis. In this context without any
periodicity, pitch estimation makes no sense.
Figure 7.2. Different sources in speech production.
Therefore, for F0 estimation is essential to define which frames are considered
voiced and which unvoiced. In contrast with F0 and energy calculations, non-overlapping
windows will be employed for the voiced/unvoiced decision. The algorithm uses only
values of the signal contained within a frame duration.
Voiced frames differentiate themselves from unvoiced frames by means of high
amplitude values, a relative low zero-crossing rate and big energy values. Zero-crossing
rate is understood as the number of zero-crossings per time unit, defined from now as the
frame length, then 10ms. Several procedures to decide between voiced/unvoiced frames
are introduced in [Hes83]. The algorithm used here applies thresholds, which are
presented in [Hes83], and it’s described in [Kie97]. As a result of this work, following
thresholds are proved to be suitably appropriated for the voiced/unvoiced decision:
Zero-crossing rate in Hz:
Zcross  n _ cross 
Normalised energy of the signal:
 s
N  n
EneNorm 
Range  MaxRange
Normalised absolute maximum:
MaxNorm 
max sn 
Sampling frequency in Hz (here 16000)
Frame length in samples (here 160)
n-sample value of the signal
Amount of zero-crossings during a frame
Difference between maximum and minimum value in the signal
Maximum feasible range, dependent on the quantification
(here 16 Bit  MaxRange=65536)
Normalisation in (7.2) and (7.3) comes from the fact that the speaker may verbalise at
different energy levels at different time.
The decision rule is achieved through the comparison of thresholds theoretically
based (  ncross ,  EneNorm ,  MaxNorm ) with a vector whose components result from equations
(7.1) to (7.3):
< n_cross
> EneNorm and
MaxNorm > MaxNorm
 Voiced
 n _ cross 
 EneNorm  = Vector obtained from (7.1) – (7.3)
 MaxNorm 
Definition of appropriated thresholds was optimised in order to reach the best
algorithm performance for the various speech samples available according to some
theoretical background. Thresholds were selected through experiments made during the
Verbmobil project development [Hag95]. After some simple experiments, based on trial
and error methods, some experiments were also conducted using Neural Networks as
classifier for the voiced/unvoiced decision on the frame plain. It was observed that this
procedure provided thresholds, whose values yield better results. Detailed information
and additional data about voiced unvoiced decision methods can be found in [Rab78],
[Hes83] and [Kie97].
Since speech signal conditions are similar in this Diploma Thesis, these thresholds
remain for calculations computed during it. Before these values were assumed, it was
verified that they were able to compute efficiently voiced and unvoiced frames. Praat
program is employed to compare regions selected as voiced. Both programs coincided
consistently in which regions were classified as voiced. However, the Verbmobil program
seemed to yield more accurate boundaries in voiced regions, while Praat created, in
certain cases, too long regions, which included some undesirable unvoiced sounds in the
7.1.2 Fundamental Frequency Contour. Previous remark.
This section deals with the fundamental frequency (F0 or pitch) of a periodic signal,
which is the inverse of its period, T (see figure 7.1). The period is defined as the smallest
positive member of the set of time shifts that leave the signal invariant and makes only
sense to a perfectly periodic signal. Speech signal results from a combination of a source
of sound modulated by a transfer (filter) function (see figure 7.3) determined by the shape
of the supra-laryngeal vocal tract, according to the source-filter theory described in
section 3.3.2. This theory, stemmed from the experiments of Johannes Müller (1848),
tested a functional theory of phonation by blowing air through larynges excised from
human cadavers.
Obviously, a signal cannot be switched on or off or modulated without losing its
perfect periodicity, and this combination causes speech signal to be only quasi-periodic,
either due to small period-to-period variations in the vocal cord vibration or in the vocal
tract shape. Therefore, the art of fundamental frequency estimation is to deal with the
information in a consistent and useful way.
Figure 7.3. Source filter model for voiced speech. Difficulties in estimating pitch contour.
F0 is considered one of the most important features for the characterisation of
emotions and is the acoustic correlate of the perceptive pitch. Its perception by human ear
is non-linear, reliant on the frequency. In addition, human voice is not a pure sinusoid, but
a complex combination of diverse frequencies.
Estimating the pitch of voiced speech sounds has always been a complex problem.
Thought it appears to be a rather simple task on the surface, there are many subtleties that
need to be kept in mind. F0 is usually defined, for voiced speech, as the rate of vibration
of the vocal folds. Periodic vibration at the glottis may produce speech that is less
perfectly periodic, due to the changes in the shape of the vocal tract that filters the glottal
source waveform, making hard to estimate fundamental periodicity from the speech
Therefore, F0 estimation involves a huge number of considerations; it can be
influenced by many factors such as phone intrinsic parameters or coarticulation.
Furthermore, the excitation signal itself is not truly periodic, but it shows small variations
in period duration (jitter) and in periodic amplitude (shimmer). These aperiodicities, in
the form of relatively smooth changes in amplitude, rate or glottal waveform shape (for
example the duty cycle of open and closed phases), or intervals where the vibration seems
to reflect several superimposed periodicities (diplophony), or where glottal pulses occur
without obvious regularity of time interval or amplitude (glottalizations, vocal creak or
fry), don’t contribute to the speech intelligibility, but to the naturalness of human speech.
Therefore, the mapping between physical acoustics and perceived prosody is neither
linear nor one-to one; as we said, variations in F0 are the most direct cause of pitch
perception, but amplitude and duration also affect pitch and make its estimation more
While there are many successfully implemented pitch estimation algorithms (s.
[Che01.Hes83]), none of them work without making certain assumptions about the sound
being analysed and everyone has to face many difficulties and to admit certain failure.
Next paragraphs exhibit a brief historical overview of different methods tried. It can be
seen, how from the first method ever employed, they meet with diverse limitations.
The first method tried was simply to low-pass the speech signal in order to remove all
harmonics and then measure the fundamental frequency by any convenient means. This
method faces two difficulties. First, it had to be an adaptive filter, because pitch can easily
cover a 2 to 1 range and it always had to pass the fundamental and reject the second
harmonic. The filter frequency was set by tracking the pitch and predicting the
forthcoming pitch value; hence any error in one frame of speech could cause the filter to
select the wrong cut-off frequency in the next frame and so lose track of the pitch
altogether. The second difficulty arose from the fact that in many cases pitch had to be
estimated from speech, where the fundamental frequency was omitted. For instance, in
telephone speech frequency response drops off rapidly below 300 Hz; hence for many
male voices the fundamental frequency is absent or so weak as to be lost in the system
In the absence of the fundamental, it is common to search for periodicities in a signal
by examining its autocorrelation function. In a periodic function, the autocorrelation will
show a maximum at a lag equal to the period of the function. One first problem is that
speech is not exactly periodic, because of changes in pitch and in formant frequencies.
Therefore, the maximum may be lower and broader than expected, causing problems in
setting the decision threshold. Another problem arises from the possibility that the first
formant frequency is equal to or below the pitch frequency.
If its amplitude is
particularly high, this situation can yield a peak in the autocorrelation function that is
comparable to the peak belonging to the fundamental. As a result, a pitch tracking process
is used. Anyway, this process can usually ride out a single error, but not a string of errors.
Pitch can be determined either from periodicity in the time domain or from regularly
spaced harmonics in the frequency domain. Consequently, pitch estimation techniques
can be classified into two main groups:
 period-synchronous procedures: These methods try to follow the periodic
characteristics of the signal, e.g. positive zero-crossings, and estimate the signal
period from this information.
 short-term analysis procedures (window based). The short-term variety of
estimators operates on a block (short-time frame) of speech samples and, for each
one of these frames, one pitch value is estimated. The series of estimated values yield
the fundamental frequency contour of the signal. There are different short-time
analysis procedures e.g. cross- or autocorrelation or algorithms that operate in the
frequency domain. Spectral procedures transform frames spectrally to enhance the
periodicity information in the signal. Periodicity appears as peaks in the spectrum at
the fundamental and its harmonics.
Period-synchronous procedures have the advantage of being generally faster and
present an adequate performance in most applications. Short-term methods are considered
more accurate and robust, due to the higher precision of calculating one changing
attribute in a shorter time interval. In addition, they are less affected by noise and do not
require a complex post-processing. Consequently, a short-term analysis procedure is used
in this thesis for F0 calculation. Description of the algorithm.
The program used for F0 and energy contour calculations is a part of the prosodic
module employed in the second phase prototype of the Verbmobil project. The procedure
was developed in previous works at the Chair for Pattern Recognition of the FriederichAlexander-University Erlangen-Nurnberg and is widely detailed in manifold works (s.
[Kom89, Not91, Pen93, Har94, Kie97]). Consequently, here just a brief description is
Fundamental frequency estimation through a window-based procedure
This procedure performs a short-term analysis, which works in the spectral domain
and provides sequential F0 computing. As it was already clarified, since F0 only makes
sense to voiced frames, voiced/unvoiced decision must be the first step when F0
estimation problem is faced. The way this decision is made was detailed in previous
section 7.1.1.
For the prosodic analysis of the human voice, F0 is usually expected to be in the
interval between 35 Hz and 550 Hz. According to the Shannon theorem [Sha49], an
analog signal must be sampled with at least the double of the highest frequency of the
signal, to be able to be recovered without any losses. In order to respect this theorem,
voiced regions are low-pass-filtered with cut-off frequency of 1100 Hz. Through this
limitation of the F0 maximum to 550 Hz, noise and mistakes will less affect the
algorithm. Then, the low-pass-filtered signal is digitised using a low sampling frequency
(downsampling) in order to reduce the number of signal values that must be computed.
Consequently, the F0 estimation process will be accelerated. For the resulting frames, the
short-time spectrum is calculated through the Fast Fourier Transform (FFT, s. [Nie83]).
The procedure is based on the assumption that the absolute maximum of the shorttime spectrum corresponds to one harmonic vibration of the F0. The main difficulty of the
algorithm is to find a proper definition to build the decision rule, which chooses the
maximum of the spectrum inside a voiced frame. This decision is created here indirectly
through an implemented Dynamic Programming (DP) procedure. For every estimated F0
value (one per voiced frame), absolute decision values (dividers) are allowed. Dividers of
all the frames in one voiced region yields hence a matrix, which is used by DP to
compose a specific low-cost function, employed to find the F0 optimal path. This costfunction takes into account the distance to adjacent candidates and the distance to a
known target value. This value is calculated, for reasons of robustness, for the voiced
frames with the maximum of the energy signal using a multi-channel procedure.
Different possible candidates are calculated for every target value using correlation
methods (periodic AMBF-Procedures, s. [Ros74]) and frequency domain procedures
(Seffer-Procedures, s. [Sen78]) and the median of these values results in the target value
of the voiced interval. The arithmetic mean of all the target values of the speech signal is
the reference point R, which is applied for the divider determination within every voiced
frame. For each frame t, the spectrum from start-frame S to end-frame E of the voiced
interval is considered, and the frequency Ft with maximum energy in this spectrum is
calculated. With help of the divisors Kt=(Ft/R), the matrix J, containing diverse F0
candidates, is defined:
J j ,t
 Ft
 kt  j  0
  kt  j
undef .  otherwise
with j   n; n  , t  S ; E 
In preliminary tests arise, that the correct F0 value is mostly enclosed when
considering five candidates (n=2). Now, with the help of a recursive cost-function and by
means of DP, the best path through the matrix J can be founded, which finally yields the
F0 contour of the voiced region.
In addition, the procedure has some other advantages. On one hand, F0 values are
not estimated in isolation for every frame. Instead, the cost-function establishes a relation
with nearest neighbours, so that their spectral characteristics are also taken into account.
On the other hand, proceeding this way short irregular periods produce no perturbation on
the results. One additional benefit is that the expense calculations for every single frame,
where the estimated valued is calculated, is limited. For further description of the costfunction see [Pen93] and [Kie97].
Post-processing of the F0 Contour
Independently of the F0 calculation method employed, post-processing is
undoubtedly favourable, since direct application of the F0 values for further prosodic
features calculations would be definitely inadequate. The sense in post-processing the F0
values lies behind different reasons:
 Automatic algorithms for the F0 extraction generate errors.
 Values of F0 are not calculated for every single frame of the signal.
 Fluctuations between adjacent F0 values are distressing under certain conditions.
 Calculations from the F0 contour are dependent on the voice reference (e.g.
Several possibilities for post-processing the fundamental frequency contour can be
found in [Hes83]. In the framework of this work, post-processing is accomplished in
different steps, as follows:
Smoothing of the F0 curve through a median filter.
Zero-setting of all the F0 values between 35 Hz and 60 Hz (before interpolation)
Interpolation of the unvoiced interval.
Semitone transformation and mean value subtraction.
Small failures of the algorithm can yield some undesirable noise. F0 curve
smoothing through a median-filter is employed in order to leave out some of these small
failures resulting from the algorithm. Smoothing increases the signal-to-noise ratio and
allows the signal characteristics (peak position, height, width, area, etc.) to be measured
more accurately
Figure 7.4. Smoothing. The right peak is the result of smoothing the left noisy peak.
The zero-setting of all the values between 35 and 60 Hz before the interpolation is
mainly adequate when recordings are carried out by means of WOZ dialogues. Usually,
start and end point of the uttered expression are classified as voiced due to the system
response and the contribution of the human voice also present in these parts. Values of F0
contained in such intervals habitually fit in the range between 35 and 60 Hz. By the use of
zero-setting the system response of the whole utterance is considered.
Though F0 values are not computed over unvoiced frames, a continuous F0 contour
would be desirable for further feature calculation. Therefore, interpolation over the
unvoiced frames is absolutely required. For interpolation over intervals, whose F0 cannot
be calculated, numerous alternatives are found.
In the present Diploma Thesis, as
proposed in [Kie97], linear interpolation and extrapolation is applied exclusively at the
beginning and at the end of the phrase.
In addition, in order to reproduce the human ear response, semitone transformation
is performed over the resulting interpolated F0 contour using the following function:
c  ln( x )  x  0
HT : x  
 0  otherwise
By choosing c=12/ln(2), semitones relate to 1 Hz as reference value; for
normalisation of the F0 value, the mean value of each F0 value is subtracted from the
overall F0 contour.
7.1.3 Energy Contour.
Coupling of the loudness perception with the acoustic measurement is as complex as
the coupling of the tone pitch perception and the computable F0. The sensation of
loudness is both dependent on the frequency of the sound and on the duration, and the
other way round, pitch perception depends on the loudness (s. [Zwi67]). As a result,
accurate complex reliance is not directly taken into account for the following algorithm,
energy and F0 calculation are stored in a vector and, consequently, an implicit
standardisation takes place.
Basic calculation procedures used for computation of energy as the acoustic
correlated of perceptive loudness are based on relations between physical acoustic
pressure magnitudes ps, measured in Pascal (1Pa= 1N/m2), and the acoustic intensity Is,
whose unit is W/m2. It can be stated that Is is proportional to ps2. With help of the acoustic
intensity reference value, I0=1pW/m2, and the acoustic pressure reference value,
p0=20μPa, which illustrate the human auditory threshold at mid-range frequencies, the
absolute acoustic pressure in decibels (dB) is given by:
L  10  log
dB  20  log s dB
The acoustic magnitude loudness quantifies the sound intensity rate between two
perceived tones, hence a sound of 1 kHz with a loudness of 40 phones (acoustic pressure
level of 40dB) is applied as reference. In addition, loudness varies proportionally to the
third root of the intensity.
Automatic computation of energy contour can be achieved through different methods.
In this Diploma Thesis a general method is employed using the following formula:
Em 
n 
 T (s )w
n  
m n
T[.] represents, in this manner, a convenient transformation of signal values sn and wn
corresponds to an adequate window function to obtain precise segments of the signal.
Values out of the used window are usually set to 0, in order to facilitate finite procedures.
There are many possibilities for the choice of the transformation and the windowing
function. In the loudness calculation process, a Hamming window (figure 7.5) has been
used wnH with the form:
 2n 
wn  0.54  0.46 cos
 N  1
Figure 7.5. The Hamming Window.
There, N represents the window size in samples. Rectangular window is proved to
give maximum sharpness but large side-lobes (ripples) while hamming window blurs in
frequency but produces much less leakage.
For the loudness calculation, the reference value I0 is needed, which can no longer be
extracted from digitised signals. For a 16-bit quantization and a maximum acoustic
pressure level of 60 dB, which represents a standard value during normal conversation, I0
is computed with equation 7.7 as follows:
(215 ) 2
(215 ) 2
L  10  log
 60dB  I 0 
 1073.74
Using Hamming windows wnH of 40 ms of duration, thus with 40/16000 = 640
samples (N=640), the intensity value Is of the frame i can de estimated through the
following expression:
Ii 
n 0
i n
 wn
 wn
n 0
The effective loudness value Lhi of the frame i can therefore be estimated through its
relation to the intensity as follows:
Lhi  3
During this Diploma Thesis, both loudness and energy will describe this magnitude
and they are utilised as synonyms. For further details on different examples for energy
calculation procedures or windowing functions, refer to the proper section in [Kie97].
7.2 Prosodic Features
Previous research on feature extraction for emotion recognition has focused on
prosodic features, based on different linguistic units as utterance vector [Bat00], word
vector [Hub98] or intervals [Ami01]. In the present work we attempt to recognize
emotions from the speech signal given a short command (approximately 2 to 4 seconds),
without getting any profit from context or linguistic information. In the long term, the
goal of the investigation initiated during this thesis is to have a speaker and language
independent emotion classifier. Such challenging purpose, leads us to deal only with
global acoustic features, computed for a whole utterance or command, which seem to
have the favour of many recent studies (s. [Del96, Pet00]).
The term prosody, previously introduced in section 3.1, comprises a number of
attributes that can be classified into basic or compound characteristics.
Main basic prosodic attributes are loudness, pitch and duration related attributes such
as duration, speaking rate and pause. Compound attributes derive from them and are
intonation, accentuation, prosodic phrases, rhythm and hesitation.
With the aim to map emotions on the activation axis (see Chapter 2), we make a
classification depending on prosodic characteristics, since most researches point them as
the most related to feelings that differ in the activation dimension. With this aim we
extracted features that model logarithmic F0, energy and durational aspects. Here we will
mainly deal with acoustic prosodic features that are computed for the whole utterance.
During this work different kind of prosodic features have been used, mainly divided
into two groups:
P1- Features related to prosodic basic attributes (i.e. energy and pitch) and pitch
derivative. Most features have roots in statistics from values over all the frames in a
sentence and in linear regression coefficients of the contour. These parameters derive
from studies by [Bat00] and [Del96].
P2- Features related to prosodic compound attributes, which are more relative and
provide information closer to the intonation and changes in P1 features. These parameters
are based on the features proposed in [Tis93].
Calculations of both sets of features were written in C programming language and
the description of their extraction method is given below.
7.2.1 P1
In this section, features of the first set are presented. Each feature is referenced with a
number that corresponds to its index within the output vector from the C program which
computes this set of features (ppal.c). Energy based features.
These features derive from the estimated energy contour. For every frame i an energy
value Ei exists. For further information about how this curve is obtained, see section
P1.0 - ENER_MAX: Short-term energy maximum.
Maximum value of the energy curve in the whole utterance. The value is achieved by
inspection of the energy values of all the frames within one utterance and selecting the
maximum numeric value among them.
P1.1 - ENER_MAX_POS: Position of short-term-energy maximum.
Relative time position of the maximum energy value into the utterance. The
maximum energy value is P1.0 and its temporal position in the sentence is divided by the
utterance overall length. Calculations are made in frames:
EneMaxPos 
iE max
iEmax= frame position of the maximum energy value on the time axis.
N= number of frames in the whole utterance.
P1.2 - ENER_MIN: Short-term-energy minimum.
Minimum value of the energy curve into the whole utterance. The value is achieved
by inspection of the energy values of all the frames within one utterance and selecting the
minimum numeric value among them.
P1.3 - ENER_MIN_POS: Position of short-term-energy minimum.
Relative time position of the minimum energy value into the utterance. The minimum
energy value is P1.2 and its temporal position in the sentence is divided by the utterance
overall length. Calculations are made in frames:
EneMinPos 
iE min
iEmin = frame position of the minimum energy value on the time axis.
N = number of frames in the whole utterance.
P1.4 - ENER_REG_COEF: Regression coefficient for short-term-energy.
Slope coefficient of the regression line for the energy curve values in the utterance.
EnergyRegC oef 
Sene, xy
Sene, x
1 N
  i   Ei
N i 1 i 1
S ene, xy   i  E i 
i 1
1  N 
 i    i 
N  i 1 
i 1
S ene, x
i = frame position on the time axis.
Ei = Estimated energy in the ith frame according to the algorithm described in
section 7.1.3.
N = Number of frames in the whole utterance.
P1.5 - ENER SQR_ERR: Mean square error for regression coefficient for short-termenergy.
Mean square error value between the regression line and the real energy curve.
 S
1 N 
EneSqrErr     Ei    E  ene, xy  i   ene, xy  i 
N i 1 
Sene, x
 Sene, x 
1 N
  Ei
N i 1
1 N
i    i
N i 1
E 
i = frame position on the time axis.
Ei = Estimated energy in the ith frame according to the algorithm described in
section 7.1.3.
N = Number of frames in the whole utterance.
P1.6 - ENER_MEAN: Mean of short-term-energy.
Mean energy value calculated over the whole utterance. Energy values of all the
frames in a sentence are summed and then divided by the total number of frames.
MeanEne 
i 1
P1.7 - ENER_ VAR: Variance of short-term-energy.
Variance of the energy values over the whole utterance.
VarEne 
i 1
 E
µE ² = Energy mean (P1.6). Fundamental frequency based features.
These features are extracted from the estimated F0-curve, which is obtained using the
logarithmic and interpolated F0-curve. F0i represents the F0-value of the frame ith. For
further description about how this curve is obtained, see section 7.1.2.
Since the existence of fundamental frequency only makes sense inside voiced frames,
all the outcomes related to F0 are confined to voiced regions, where ‘voice region’ is
understood as a speech interval containing more than three successive voiced frames. For
further information about the voiced/unvoiced decision see section 7.1.1.
P1.8 - F0_MAX: F0 maximum.
Maximum value of the F0 curve in the voiced parts of the utterance. The value is
achieved by inspection of the pitch values of all the frames labelled as voiced in the
utterance and selecting the maximum numeric value among them.
P1.9 - F0_MAX_POS: Position of F0 maximum on time axis.
Relative time position of the maximum F0 value into the utterance. The maximum
pitch value is P1.8 and its temporal position in the sentence is divided by the utterance
overall length. Calculations are made in frames:
F 0MaxPos 
iF 0 max
IF0max= frame position of the maximum F0 value on the time axis.
N= number of frames in the whole utterance.
P1.10 - F0_MIN: F0 minimum.
Minimum value of the F0 curve in the voiced parts of the utterance. The value is
achieved by inspection of the pitch values of all the frames labelled as voiced in the
utterance and selecting the minimum numeric value among them.
P1.11 - F0_MIN_POS: Position of F0 minimum on time axis.
Relative time position of the maximum F0 value into the utterance. The minimum
pitch value is P1.10 and its temporal position in the sentence is divided by the utterance
overall length. Calculations are made in frames:
F 0 MinPos 
iF 0 min
IF0max= frame position of the minimum pitch value on the time axis.
P1.12 - F0_REG_COEF: Regression coefficient for F0.
Slope coefficient of the regression line for the F0 curve values in the utterance.
F 0RegCoef 
SF 0, xy
SF 0, x
S F 0, xy   i  F 0 i 
i 1
1 N
  i   F 0i
N i 1 i 1
1  N 
 i    i 
N  i 1 
i 1
S F 0, y
i = frame position on the time axis.
F0i = Estimated pitch in the ith frame according to the algorithm described in
N = Number of frames in the whole utterance.
P1.13 - F0_SQR_ERR: Mean square error for regression coefficient.
Mean square error value between the regression line and the real energy curve.
 S
1 N 
F 0SqrErr     F 0i    F 0  F 0, xy  i   F 0, xy  i 
N i 1 
S F 0, x
 S F 0, x 
F 0 
i 
1 N
  F 0i
N i 1
1 N
 i
N i 1
i = frame position on the time axis.
F0i = Estimated pitch in the ith frame according to the algorithm described in
section 7.1.2.
N = Number of frames in the whole utterance.
P1.14 - F0_MEAN: F0 mean.
Mean F0 value calculated over the voiced regions of the utterance. Pitch values of all
the voiced frames in a sentence are summed and then divided by the total number of
voiced frames.
MeanF 0 
 F0
i 1
P1.15 - F0_VAR: F0 variance.
Variance of the energy values over the voiced regions in the utterance.
VarF 0 
 F0
i 1
 F 0
µF0 ² = Pitch mean (P1.14).
P1.36 - Jitter.
Periodic jitter is defined as the relative mean absolute third-order difference of the
point process. This feature is exceptionally calculated using Praat and then included in the
feature vector. The algorithm is computed through the following formula:
N 1
jitter 
 2T
i 2
 Ti 1  Ti 1
N 1
i 2
Ti = interval ith.
N = number of intervals.
For its computation, two arguments are required:
- Shortest period: Shortest possible interval that will be considered. For intervals Ti
shorter than this, the (i-1)th, ist, and (i+1)th terms in the formula are taken as zero. This
argument is set to a very small value, 0.1 ms.
- Longest period: Longest possible interval that will be considered. For intervals Ti
longer than this, the (i-1)th, ith, and (i+1)th terms in the formula are taken as zero.
Establishing the minimum frequency of periodicity as 50 Hz, the value for this parameter
is 20 ms; intervals longer than that will be considered unvoiced. Voiced/unvoiced regions based features.
These features have roots in the voiced/unvoiced information, which is obtained
through an algorithm that assigns 1 to voiced frames and 0 to unvoiced. For further
description about the decision algorithm, see 7.1.1.
F0 value for the first voiced frame in the utterance.
F0 value for the last voiced frame in the utterance.
Amount of regions containing more than three successive voiced frames. Regions
containing three or less voiced frames are not taken into consideration, despite their
frames are counted as voiced.
Number of regions with more than three successive unvoiced frames. Same
considerations as P1.18 are used to define regions.
Amount of voiced frames in the utterance. Isolated voiced frames as well as frames
belonging to a voiced region are counted.
Number of unvoiced frames in the utterance. Isolated unvoiced frames as well as
frames belonging to a voiced region are counted.
Length of the longest voiced region. The number of frames for each voiced region is
counted and the highest amount is taken as feature P1.22.
Length of longest unvoiced region. The number of frames for each unvoiced region is
counted and the highest amount is taken as feature P1.23.
Ratio of number of voiced frames and number of unvoiced frames.
RatVcdUnvcdFrms 
 voiced _ frames
 unvoiced _ frames
Ratio of number of voiced regions and number of unvoiced regions.
RatVcdUnvcdRg 
 voiced _ regions
 unvoiced _ regions
Ratio of number of voiced frames and number of all frames.
RatVcdAllF rms 
 voiced _ frames
Ratio of number of unvoiced frames and number of all frames.
RatUnvcdAllRg 
 unvoiced _ regions
(7.36) Pitch contour derivative based features.
The derivative of F0 is computed and similar operations are performed. The
calculations follow identical procedures as the F0 case and therefore they are just
P1.28 - F0_DER_MAX.
F0 derivative maximum.
P1.29 - F0_DER_MAX_POS.
Relative position of F0 derivative maximum.
P1.30 - F0_DER_MIN.
F0 derivative minimum.
P1.31 - F0_DER_MIN_POS.
Relative position of F0 derivative minimum.
P1.32 - F0_DER_REG_COEF.
Regression coefficient for F0 derivative.
P1.33 - F0_DER_SQR_ERR.
Mean square error for regression coefficient for F0 derivative.
P1.34 -F0_DER_MEAN.
F0 derivative mean.
P1.35 - F0_DER_VAR.
F0 derivative variance.
7.2.1 P2.
This section introduces the features included in the second set. The program used to
calculate them is called complex_calcs.c (see chapter 10).
In order to obtain information associated with changes in the signal, following
features result from relations among signal parameters, instead of being direct
measurement magnitudes. In this section, N corresponds to the number of voiced regions
in the utterance.
P2.0: Mean of the pitch means in every voiced region.
F 0 AbsMean 
 F0
F 0n = Mean of the pitch values in the voiced region n.
P2.1: Variance of the pitch means in every region.
VarF 0Mean 
 F 0
 F 0 AbsMean
F 0n = Mean of the pitch values in the voiced region n.
F0AbsMean = P2.0.
P2.2: Mean of the maximum pitch values in every region.
MaxF 0Mean 
 F0
max n
F 0 max n = Maximum of the pitch values within the voiced region n.
P2.3: Variance of the maximum pitch values in every region.
VarF 0 Max 
 F 0
max n
 MaxF 0 Mean
Voiced region
Figure 7.6. F0 contour and points selected for calculations of P2.4 and P2.5.
P2.4: Pitch increasing per voiced region.
This feature take four points into account inside each voiced part of the utterance (see
figure 7.6):
1. Beginning of the voiced region.
2. End of the voiced region.
3. Maximum pitch value.
4. Second maximum pitch value.
The sum of all pitch differences between two successive increasing points, divided by
their respective time difference is computed. The final value for this feature results from
the arithmetic mean of this calculation over all voiced parts contained in the utterance.
PeaksIncrease 
  
 increase
F 0i  F 0 j 
ti  t j 
Chapter 7
i , j = represent one of the four points considered, where i appears before j
ti <tj
F0i < F0j
P2.5: Pitch decreasing per voiced region.
Same points as in P2.4 are taken into account (figure 7.6). In this case, sum of all
pitch differences between two successive decreasing points, divided by their respective
time difference is calculated. The final value for results from the arithmetic mean of this
calculation over all voiced parts contained in the utterance.
PeaksDecrease 
  
 decrease
F 0i  F 0 j 
ti  t j 
i and j represent one of the four points considered, where i appears before j
ti <tj
F0i > F0j
P2.6: Mean of the pitch ranges in every voiced region.
MeanRange 
 F 0
max n
 F 0min n
P2.7: Flatness.
Mean of the flatness (mean/max) of the pitch for every voiced region multiplied by
F 0n
Flatness 
 F0
max n
 100
P2.8: Mean of the relative duration from the beginning of the voiced part to the position
of the pitch maximum in every voiced region multiplied by 100.
DurMaxPitch 
  t
max n
 t startn   100
P2.9: Peaks increasing for the whole utterance.
The maximum of each voiced region is considered. Sum of all pitch differences
between two successive increasing points, divided by their respective time difference is
calculated. This feature is similar to P2.4 but generalised to the whole sentence.
PeaksIncreaseUtt 
F 0 max i  F 0 max j
ti  t j
ti , tj = positions of the maximum value for regions i and j (ti <tj).
F0max = the maximum pitch value in every region. Maximum of region j must
be higher than maximum of region i.
P2.10: Peaks decreasing for the whole utterance.
The maximum of each voiced region is considered. Sum of all pitch differences
between two successive decreasing points, divided by their respective time difference is
calculated. This feature is similar to P2.5 but generalised to the whole sentence.
PeaksDecreaseUtt 
F 0max i  F 0max j
ti  t j
ti ,tj = positions of the maximum value for regions i and j (ti <tj).
F0max = the maximum value in every region. Maximum of region j must be
lower than maximum of region i.
P2.11: Mean of the voiced region duration.
 length
DurMean 
P2.12: Global energy mean.
Mean of the energy means in every voiced region multiplied by 100 and divided by
the absolute energy maximum of the whole utterance.
 100
EnerMean 
P2.13: Mean of the relative duration from the beginning of the voiced region to the
position of the energy maximum in every voiced region. Multiplied by 100 and divided
by the absolute energy maximum of the whole utterance.
EneDurStart 
E max n
max n
 t startn
tstart = starting point of the voiced region.
tmax = energy maximum position of the region.
P2.14: Mean of the relative duration from the position of the energy maximum in every
voiced region to the end of the voiced region. Multiplied by 100 and divided by the
absolute energy maximum of the whole utterance.
EneDurEnd 
E max n
end n
 t max n
tend = end point of the voiced region.
tmax = energy maximum position of the region.
P2.15: Mean of the vehemence (mean/min) of the energy in every voiced region.
EneVehemence 
P2.16: Mean of the flatness (mean/max) of the energy in every voiced region multiplied
by 100.
EneFlatness 
P2.17: Relation between the maximum energy value of the whole utterance and its
EneMaxRatio 
P2.17: Relation between the maximum of the voiced region and the maximum of the
utterance divided by the position of the voiced region maximum position and multiplied
by 100. Arithmetic mean of this calculation for all the voiced regions in the utterance.
Emax n
EnerMaxRatio region 
 100
P2.18: Mean of the energy tremor in every voiced region.
Tremor refers to a regular variation in the signal and is computed as the number of
zero-crossings over a window of the energy curve derivative
TremorMean 
 ene _ tremor
7.3 Quality Features.
The classification of emotions using quality voice features is a brand new field of
investigation, which is being used and referred in many lately studies concerned to
emotion recognition (s[Joh99, Alt00]). Since this proposal faces different obstacles, due
to the difficulty of estimation of this kind of attributes, diverse set of features and
methods were tried during this Diploma Thesis. Some of the described features have been
used just in few experiments and other are more frequently employed, but all of them are
here introduced.
The software employed to deal with quality features extraction is PRAAT1, a
shareware program developed by Dr. Paul Boersma of the University of Amsterdam.
This section makes use of two different methods for the calculation of the mean value
of a given parameter within a voiced region:
- Mean1: Arithmetic mean of the parameter values over all the frames inside a
voiced region.
Mean1 f  
1  i  nframes
nframes = number of frames inside a voiced region.
fi = feature value in the frame i.
- Mean2: First, the Mean1 of the parameter within a voiced region is computed.
Then, single values of this parameter for every frame are checked and the one which is
closest to the computed Mean1 is considered as the mean (Mean2) of this region. This
way, we assume that this value comes from the most representative part inside the voiced
region, since the mean is influenced also by voiced region boundaries. It was
experimentally checked that the chosen frames normally matches the core of the vowel.
Mean 2 f   min i f i  f
1  i  nframes
Further information can be found under
nframes = number of frames inside a voiced region
n = index of the region
fi = feature value in the frame i
f n = Mean1 of the feature in region n
From now, they are referred as Mean1 and Mean2 in the subsequent feature
calculation description.
7.3.1 Harmonicity based features.
Since harmonic to noise ratio is clearly related to the voice quality (see Chapter 3),
this voice quality attribute has been said to provide valuable information about the
speaker emotional state (s. [Alt99, Alt00]). Harmonic to noise ration estimation can be
considered as an acoustic correlate for breathiness and roughness, in agreement with
[Alt00]. Therefore, voice quality cues, which help us to infer assumptions about the
speaker’s emotional state, can be extracted from this attribute.
For its calculation, as well as for the remaining voice quality features, Praat program
is utilized. Harmonicity is here expressed in dB; if 99% of the energy of the signal is in
the periodic part, and 1% is noise, the HNR is 10*log10(99/1) = 20 dB. A HNR of 0 dB
means that there is equal energy in the harmonics and in the noise. The algorithm
performs acoustic periodicity detection on the basis of an accurate autocorrelation
method, as described in [Boe93]. Harmonicity values are given for individual frames and
the concrete features are calculated to be employed as classification features. Praat
program requires four different parameters to calculate the harmonicity:
1. Time step (default: 0.01 s): the measurement interval (frame duration), in seconds.
2. Minimum pitch (default: 75 Hz): determines the length of the analysis window.
3. Silence threshold (default: 0.1): frames that do not contain amplitudes above this
threshold (relative to the global maximum amplitude), are considered silent.
4. Number of periods per window (default: 1): determine the level up to the HNR is
guaranteed to be detected. More periods per window raises the figure of detection but the
algorithm becomes more sensitive to dynamic changes in the signal.
QH.0a: Harmonic to noise ratio maximum. Mean2. Default values.
Maximum of the Mean2 values for all the regions in the sentence when the
harmonicity is computed setting all the parameters in Praat to their default value.
h2nmax  max n hn 
1 n  N
hn = harmonic to noise ratio Mean2 value in region n
N = number of regions inside a sentence
QH.0b: Harmonic to noise ratio maximum. Mean2. 4.5 periods per window.
Maximum of the Mean2 values for all the regions in the sentence when the
harmonicity is computed setting the number of periods per window to 4.5, which is
considered an optimal value for speech: HNR values up to 37 dB are guaranteed to be
detected reliably. When the number of periods per window increases the minimum pitch
parameter has to be also changed and, following the recommendations of Praat software,
it’s set to:
F 0 min 
length = length of the speech segment where the harmonicity is computed.
This feature follows the same formula (7.59) but taking into account the new values
of the harmonicity.
QH.0c: Harmonic to noise ratio maximum. Mean1. 4.5 periods per window.
This feature is identical to QH.0b with the exception that it uses the Mean1 instead
the Mean2 procedure to calculate the mean value of the HNR in the analysed region. It
follows therefore also equation (7.59) with the new values of harmonicity, by just
substituting the term Mean2 for its analogous Mean1.
QH.0d: Harmonic to noise ratio maximum within a voiced region.
Each frame inside a voiced region contains a value of the HNR. The maximum of
these values within the given region is the feature QH.0d.
QH.1a: Harmonic to noise ratio range. Mean2. Default Values.
Once the Mean2 values are calculated for every single voice region of the sentence,
the difference between the maximum and the minimum of these values in a sentence is the
feature QH.1a. When there is one unique region, this value becomes zero.
h2nrange  hn,max  hn,min
1 n  N
h n , max = harmonic to noise ratio maximum value in the sentence.
h n , min = harmonic to noise ratio minimum value in the sentence.
QH.1b: Harmonic to noise ratio range. Mean2. 4.5 periods per window.
Once the Mean2 values are calculated for every single voice region of the sentence,
the difference between the maximum and the minimum of these values in a sentence is the
feature QH.1b. When there is one unique region, this value becomes zero. The only
difference with QH.1a is that the parameter number of periods per window is set to 4.5
and, consequently, the minimum pitch comes from equation 7.60 (see QH.0b).
QH.1c: Harmonic to noise ratio range. Mean1. 4.5 periods per window.
Once the Mean1 values are calculated for every single voice region of the sentence,
the difference between the maximum and the minimum of these values in a sentence is the
feature QH.1b. This feature is identical to QH.1b by changing the criteria to calculate the
mean value in the analysed region; Mean1 criteria is employed as a replacement for
QH.2: Harmonic to noise ratio mean. Mean1. Default settings.
Arithmetic mean (Mean1) of all the HNR values calculated by frame within a voiced
QH.3: Harmonic to noise ratio standard deviation within a voiced region. Default
Standard deviation of the HNR values within a voiced region.
7.3.2 Formant frequency based features.
The algorithm followed by Praat first resample the signal to a sample rate of twice
the value of Maximum formant frequency parameter (aprox. 5000). After this, preemphasis is applied. The pre-emphasis factor is computed as   exp( 2Ft ) , where
t is the sampling period of the sound. Each sample xi of the sound except xi is then
changed, going down from the last sample: x i  x i  x i 1 .
For each analysis window, Praat applies a Gaussian-like window, and computes the
LPC coefficients with the algorithm by Burg. The Burg algorithm is a recursive estimator
for auto-regressive models, where each step is estimated using the results from the
previous step. The implementation of the Burg algorithm is based on the routines memcof
and zroots in [Pre93]. This algorithm can initially find formants at very low or high
frequencies. From the values obtained for every single frame, some calculations are
extracted to be used as input for the emotional classification.
QF.0a: Minimum of f2Mean2 – f1Mean2 for all the voiced regions.
Difference between the Mean2 of the second and the first formant frequency for each
voiced region in a sentence. The minimum value of this difference among all the voiced
regions is taken as QF.0a.
This feature is used in some cases to select just one region and made it representative
of the sentence. This way, features are calculated in similar regions and their differences
will be more influenced by changes in the speaker’s emotional state than by the nature of
the vowel. The reason to choose the minimum difference between first and second
formant is based on the formant structure of an /a/, which is appropriate to extract quality
features, due to the shape of the vocal tract when it’s uttered, and in which first and
second formant frequencies are closer.
f 21  min n f 2 ,n  f 1,n
1 n  N
f 2 ,n  Mean2 of the second formant frequency in the voiced region n.
f 1,n  Mean2 of the first formant frequency in the voiced region n.
N = Number of voiced regions within the utterance.
QF.0b: Minimum of (f2–f1)Mean1 for all the voiced regions.
Mean1 of the difference between all the values of the second and the first formant
frequency within each voiced region in a sentence. The minimum value of this difference
among all the voiced regions is taken as QF.0b. Obviously, this feature is equivalent to
QF.0a by only substituting the concept of Mean2 by Mean1. However, the description
follows exactly the process the software was implemented.
The same equation (7.62) can be here applied by only changing the terms Mean2 for
QF.1a: First formant frequency. Mean2.
Frequency of the first formant in the region from where QF.0a is extracted,
calculated as the Mean2 within the voiced region.
QF.1b: First formant frequency. Mean1.
Frequency of the first formant in the region from where QF.0b is extracted,
calculated as the Mean1 within the voiced region.
QF.2a: Second formant frequency. Mean2.
Frequency of the second formant in the region selected by QF.0a, calculated as the
Mean2 within the voiced region.
QF.2b: Second formant frequency. Mean1.
Frequency of the second formant in the region selected by QF.0b, calculated as the
Mean1 within the voiced region.
QF.3a: Third formant frequency. Mean2.
Frequency of the third formant in the region selected by QF.0a, calculated as the
Mean2 within the voiced region.
QF.3b: Third formant frequency. Mean1.
Frequency of the third formant in the region selected by QF.0b, calculated as the
Mean1 within the voiced region.
QF.4a: Second formant ratio. Mean2.
Frequency of the second formant (QF.2a) divided by the difference between second
and first formants (QF.0a). All the formants are calculated through the Mean2 and belong
to the selected region (see QF.0a).
f 2ratio Mean2 
f 2 Mean2
QF .2a
f 2 Mean2  f 1Meaan2 QF .0a
QF.4b: Second formant ratio. Mean1.
Frequency of the second formant (QF.2b) divided by the difference between second
and first formants (QF.0b). All the formants are calculated through the Mean1 and belong
to the selected region (see QF.0b).
f 2ratio Mean1 
f 2 Mean1
QF .2b
f 2 Mean1  f 1Mean1 QF .0b
QF.5: Maximum of the second formant ratio.
The maximum value of the second formant ratio calculated by frame within the region
selected by QF.0b.
f 2i
f 2ratiomax  max nframes 
 f 2 i  f 1i
1  i  nframes
f 1i  Value of the first formant frequency in the frame i.
f 2 i  Value of the second formant frequency in the frame i.
nframes = Number of frames within the voiced region selected by QF.0b.
QF.6: Range of the second formant ratio.
Difference between the maximum and the minimum of the second formant ratio
by frame within the region selected by QF.0b.
f 2i
f 2ratio range  max nframes 
 f 2 i  f 1i
f 2i
  min nframes 
 f 2 i  f 1i
f 1i  Value of the first formant frequency in the frame i.
f 2 i  Value of the second formant frequency in the frame i.
i = frame index 1  i  nframes
nframes = Number of frames within the voiced region selected by QF.0b.
QF.7a: Bandwidth of the first formant. Mean2.
Mean of all the Mean2 first formant bandwidth values in a sentence.
bw1 
n 1
1, n
1 n  N
b1,n is the first formant bandwidth Mean2 in region n.
N is the number of regions inside a sentence.
QF.7b: Bandwidth of the first formant. Mean1.
Mean of all the Mean1 first formant bandwidth values in a sentence. Substituting
Mean2 by Mean1, equation 7.67 is employed.
QF.7c: Bandwidth mean of the first formant within a region . Mean1.
Arithmetic mean (Mean1) of all the first formant bandwidth values calculated by
frame, inside a voiced region.
QF.8a: Bandwidth of the second formant. Mean2.
Mean of all the Mean2 second formant bandwidth values in a sentence.
bw 2 
n 1
1 n  N
b2,n is the second formant bandwidth Mean2 in region n.
N is the number of regions inside a sentence.
QF.8b: Bandwidth of the second formant. Mean1.
Mean of all the Mean1 second formant bandwidth values in a sentence. Substituting
Mean2 by Mean1, equation 7.68 is employed.
QF.8c: Bandwidth mean of the second formant within a region. Mean1.
Arithmetic mean (Mean1) of all the second formant bandwidth values calculated by
frame, inside a voiced region.
QF.9a: Bandwidth of the third formant. Mean2.
Mean of all the Mean2 third formant bandwidth values in a sentence.
bw 3 
n 1
3, n
1 n  N
b3,n is the third formant bandwidth Mean2 in region n.
N is the number of regions inside a sentence.
QF.9b: Bandwidth of the third formant. Mean1.
Mean of all the Mean1 third formant bandwidth values in a sentence. Substituting
Mean2 by Mean1, equation 7.69 is employed.
QF.9c: Bandwidth mean of the third formant within a region. Mean1.
Arithmetic mean (Mean1) of all the third formant bandwidth values calculated by
frame, inside a voiced region.
QF.10: Maximum of the first formant frequency in the selected region.
Maximum value of the first formant frequency in the region selected by QF.0b.
f 1, max  max n  f 1,i 
1  i  nframes
f 1,i  Value of the first formant frequency in the frame i.
nframes = Number of frames within the selected region.
QF.11: Maximum of the second formant frequency in the selected region.
Maximum value of the second formant frequency in the region selected by QF.0b.
Same equation (7.70) for the second formant frequency case.
QF.12: Maximum of the third formant frequency in the selected region.
Maximum value of the third formant frequency in the region selected by QF.0b.
Same equation (7.70) for the third formant frequency case.
QF.13: Range of the first formant frequency in the selected region.
Difference between the maximum and the minimum of the first formant frequency for
the region selected by QF.0b.
f 1,range  max nframes ( f 1,i )  min nframes ( f 1,i )
1  i  nframes
f 1,i  Value of the first formant frequency in the frame i.
nframes = Number of frames within the voiced region selected by QF.0b.
QF.14: Range of the second formant frequency in the selected region.
Difference between the maximum and the minimum of the second formant frequency
for the region selected by QF.0b. Same equation (7.71) for the second formant frequency
QF.15: Range of the third formant frequency in the selected region.
Difference between the maximum and the minimum of the third formant frequency
for the region selected by QF.0b. Same equation (7.71) for the third formant frequency
QF.16: Standard deviation of the first formant frequency in the selected region.
Standard deviation of all the first formant frequency values within the region selected
by QF.0b.
QF.17: Standard deviation of the second formant frequency in the selected region.
Standard deviation of all the second formant frequency values within the region
selected by QF.0b.
QF.18: Standard deviation of the third formant frequency in the selected region.
Standard deviation of all the third formant frequency values within the region selected
by QF.0b.
7.3.3 Energy based features.
QE.0 – QE.3: Energy band distribution.
The energy is calculated within four different frequency bands in order to decide,
whether the band contains mainly harmonics of the fundamental frequency or turbulent
noise. Frequency band distribution is taken from a study [Kla97] focused on the
perceptual importance of several voice quality parameters. The four frequency bands
proposed are:
1. From 0 Hz to F0 Hz (where F0 is the fundamental frequency).
2. From 0 Hz to 1 KHz.
3. From 2.5 KHz to 3.5 KHz
4. From 4 KHz to 5 KHz.
From each band distribution, following features are calculated:
QE.0a – QE.3a: The energy contained in the corresponding band is calculated for all the
voiced parts of the utterance. Then, these values are divided by the energy over all
frequencies of the voiced parts of utterance.
EneBand j 
 EneBand
n 1
j ,n
j  1,2,3,4;
 ene
n 1
j = index corresponding to each one of the energy bands (1, 2, 3 or 4)
N = number of voiced regions within the utterance.
QE.0b – QE.3b: The energy values are calculated only in one region. Energy in each
band is divided by the energy over all frequencies within the given region.
EneBandi 
EneBand j , n
j  1,2,3,4;
j = index corresponding to each one of the energy bands (1, 2, 3 or 4)
n = index of the region.
QE.4: Voiced energy ratio sentence based.
Rate of the energy contained in voiced regions and energy over all the utterance.
EneRate 
 ene
n 1
1 n  N
AbsEne= energy contained in all the utterance.
N = number if voiced regions within the utterance.
QE.5: Relative energy of one voiced region.
Energy of the voiced region divided by the energy in all the utterance.
Ene Re ln 
n = index corresponding to one voiced region.
AbsEne= energy contained in all the utterance.
7.3.4 Spectral measurements.
The algorithm used by Praat to calculate the spectrum is the continuous interpretation
of the Fast Fourier Transform (s. [Bra65, Wea89, Lat92]). If the sound is expressed in
Pascal (Pa), the spectrum is expressed in Pa·s, or Pa/Hz. The frequency integral over the
spectrum equals the time integral over the sound.
For some features concerning spectral measurements, inverse filtering of the speech
signal is performed. Inverse filtering can be seen as the inverse computation of the speech
production model depicted in figure 7.7. Praat obtains the filter with the help of the
technique of linear prediction. This technique tries to approximate a given frequency
spectrum with a small number of peaks, for which it finds the mid-frequencies and the
bandwidths. Doing this for an overlapping sequence of windowed parts of a sound signal
(i.e. a short-term analysis), we get a quasi-stationary approximation of the signal's
spectral characteristics as a function of time. For a speech signal, the peaks are identified
with the resonances (formants) of the vocal tract. Since the spectrum of a vowel spoken
by an average human being falls off with approximately 6 dB per octave, pre-emphasis is
applied to the signal before the linear-prediction analysis, so that the algorithm will not
try to match only the lower parts of the spectrum.
Figure 7.7. Mathematical model of the speech production.
For an average human voice, tradition assumes five formants in the range between 0
and 5500 Hertz. This number comes from a computation of the formants of a straight
tube, which has resonances at wavelengths of four tube lengths, four thirds of a tube
length, four fifths, and so on. For a straight tube 16 centimetres long, the shortest
wavelength is 64 cm, which, with a sound velocity of 352 m/s, means a resonance
frequency of 352/0.64 = 550 Hertz. The other resonances will be at 1650, 2750, 3850, and
4950 Hertz. For the linear prediction in Praat, this have to implement this 5500-Hz band
Frequency (Hz)
Frequency (Hz)
Figure 7.8. Spectrum of the /a/ vowel uttered in the sentence “Kick den Ball” extracted from speaker A
commands database. Figure a represents the original spectrum of the uttered vowel, whereas figure b
represents the source of the sound obtained after inverse filtering.
limiting by resampling the original speech signal to 11 kHz. Then, a linear-prediction
analysis on the resampled sound is performed. Analysis is done with 16 linear-prediction
parameters (which will yield at most eight formant-bandwidth pairs in each time frame),
with an analysis window effectively 10 milliseconds long, with time steps of 5
milliseconds (so that the windows will overlap), and with a pre-emphasis frequency of 50
Hz (which is the point above which the sound will be amplified by 6 dB/octave prior to
the analysis proper). This analysis will provide the filter (figure 7.9), which applied to the
original speech sample (figure 7.8a), yields the source signal (figure 7.8b). Since the LPC
analysis was designed to yield a spectrally flat filter (through the use of pre-emphasis),
the source signal will represent everything in the speech signal that cannot be attributed to
the resonating cavities. Thus, the "source signal" will consist of the glottal volume-
velocity source (with an expected spectral slope of -12 dB/octave for vowels) and the
Frequency (Hz)
Figure 7.9. Filter of vocal tract when the /a/ vowel is uttered in the sentence “Kick den Ball” extracted
from speaker A commands database. Filter is obtained through LPC analysis.
radiation characteristics at the lips, which cause a 6 dB/octave spectral rise, so that the
resulting spectrum of the "source signal" is actually the derivative of the glottal flow, with
an expected spectral slope of -6 dB/octave.
QS.0: Open quotient related features.
Open quotient is a spectral measurement whose variations have been associated to
changes in the glottal source quality. Therefore, along with the ideas presented in Chapter
3, it could be a useful parameter in order to determine the emotional state of the speaker.
Following the hypotheses that the amplitude difference of the first and second harmonics
of the inverse-filtered voice signal (H1*-H2*) is a reliable spectral indicator of the
relative length of the the opening phase and therefore an spectral correlate of the open
quotient (s. [Dov97, Hen01]), two open quotient related features, with and without
inverse filtering, are computed.
QS.0a: Difference between first and second harmonic amplitudes of the spectrum of the
speech signal, within the selected region.
QS.0b: Difference between first and second harmonic amplitudes of the spectrum of the
speech signal after inverse filtering, within the selected region.
QS.1 Spectral Tilt related features.
Spectral Tilt has been also related to glottal source variations. It is one of the major
acoustic parameters that reliably differentiates phonation types in many languages, and it
can be understood as the degree to which intensity drops off as frequency increases.
Spectral tilt can be quantified when comparing the amplitude of the fundamental to that of
higher frequency harmonics, e.g. the second harmonic, the harmonic closest to the first
formant, or the harmonic closest to the second formant. Spectral tilt is characteristically
most steeply positive for creaky vowels and most steeply negative for breathy vowels.
The amplitude of the first harmonic (H1) compared to the amplitude of the second
formant (A2), which acts as an indicator of the spectral tilt at the mid formant
frequencies, is here used as a voice quality feature for emotion classification. The
parameter is expected to be large and positive for breathy voices and small and/or
negative for creaky voices.
QS.1a: Difference between the first harmonic amplitude and the spectral amplitude at the
second formant frequency. Calculated over the spectrum of the speech signal in the
selected region
QS.1a: Difference between the first harmonic amplitude and the spectral amplitude at the
second formant frequency. Calculated over the spectrum of the speech source calculated
by means of the inverse filtering of the original speech segment.