Chapter 10

Chapter 10
Chapter 10
10.1 Data processing
This chapter describes the functionality of all the software modules implemented and
utilized in the framework of this thesis. Each one of these modules was employed for a
specific function during the experimental performance and then integrated for the final
scheme. In order to give a structured idea of the system, it is divided into three main
blocks of functionalities:
a) Data preprocessing. Its function is to adapt the recorded speech data to satisfy the
format requirements of compute_basismerkmale.c, which is included in the next block.
This group includes only the program call.c, which is described in section 10.1.1.
b) Feature calculation. In order to provide the NN classifier with input patterns, a
vector of features should be calculated at the previous stage. First, three basic features:
fundamental frequency, energy contour and voiced/unvoiced decision, are estimated by
compute_basismerkmale.c. From these basic features, other more complex, i.e. P1, are
computed in ppal.c. Finally, quality voice features and jitter are calculated by means of
Praat. Thought P2 prosodic features were discarded during the experimental research to
be integrated in the final system, the software employed for their calculation,
complex_calcs.c, would belong to this group and it’s also described in within section
c) Classification. Once the features are computed, this block use them as the
classifier inputs and make the final decision. Two neural networks and little program to
combine their results are compound this group and are described in section 10.1.3.
Chapter 10
Basic prosodic
features calcs.
Complex prosodic
features calcs.
calcs P2
quality features
Activation level
Evaluation level
(.nist file)
Features calculation
Figure 10.1. Block schema of the final system.
10.1.1 Data preprocessing
The first step in order to process the audio data, recorded for the emotional database,
and obtain features that allow an efficient classification of the speech signal into
emotional categories, is the extraction of basic features such as fundamental frequency,
energy and voice/unvoiced decision. For this task, the basic prosodic module of the
VERBMOBIL1 project is utilized. However, the audio format of the input files for this
software doesn’t match the format employed in the recorded speech.
The format of the signal files expected by the software is PhonDat. Signal files with
PhonDat Header contain a binary header of constant length (512 bytes). The signal
samples (2 bytes per sample) start after this header and are always in LoHi byte order
(Intel format), what means that the low significant byte is on the left. The header contains
a defined structure with information as sampling frequency, resolution in bits, etc. On the
Verbmobil is a long-term project of the German Federal Ministry of Education, Science, Research and
Technology (BMBF, Projekträger DLR). Its aim is to give Germany an international top position in
language technology and its economical application in the next millenium by cooperation and concentration
of as many as possible specialists from industry and science. The long-sighted aim is the development of a
mobile translation system for the translation of spontaneous speech in face-to-face situations.
Chapter 10
other hand, the format used for the recordings, which is actually the one used for most
applications inside the speech technology group, is NIST. The NIST speech header format
was defined by the 'National Institute of Standards and Technology, USA'. It is used in
many American speech corpora. The NIST SPHERE header is an object-oriented, 1024byte blocked, ASCII structure which is preceded to the waveform data. The header is
composed of a fixed-format portion followed by an object-oriented variable portion. The
sample byte order is HiLo, what means that the most significant byte is followed by least
significant byte. As a result two main problems must be faced for the data conversion:
Header length and speech data format (byte order).
The prosody module utilized allows an input argument indicating the length of the
header. The information contained in the header of the input files is not used by the
program, which only takes the data of the signal to make computations. Therefore, this
information can be omitted. The adopted solution is to skip the header of the input NIST
files and set this parameter to 0. This way, the software interprets the bytes just as speech
data starting from the beginning of the file.
The problem of the byte ordering can only be solved by swapping them. This way
each sample inverts the order of the bytes before the basic prosodic module reads the
speech file.
The module of data preprocessing is implemented using one unique program call.c,
written in C, whose input is the emotional recording in NIST format and the output is the
same file without header and with the bytes of each sample swapped.
10.1.2 Feature calculation
Four main programs are employed to calculate features of different nature:
a) compute_basismerkmale.c is the basic prosody module of the VERBMOBIL
project. It extracts the fundamental frequency contour, energy contour and voice/unvoiced
decision of each frame from an input speech file. The algorithms used to calculate these
basic features are detailed in section 7.1.
b) ppal.c is the principal program used for the extraction of complex prosodic
features. It calculates the whole set of P1 features (see section 7.2.1) with the exception of
jitter, which is calculated by Praat. The program inputs are the vectors resulting from the
Chapter 10
compute_basismerkmale stage, i.e. vectors containing the frame-based basic features
calculations. The output of this module is a vector of 36 features, whose calculations are
described in 7.2.1, to which the jitter is added in order to complete the whole P1 feature
set. This output vector will be the input of one of the neural network classifiers, the one
that discriminates among the three levels of arousal (see chapter 2).
c) complex_calcs.c is the program that calculates all the P2 features employed during
the experimental trials. The program input is the output of the compute_basismerkmale
module. These data is mathematically combined, following the algorithms detailed in
7.2.2, and yield a vector of twenty prosodic features whose function is to be the input of
an activation dimension based classifier. As it was previously said, the most significant
results obtained after the experimentation do not use this module. However, it was
entirely implemented during the development of this work, based on a promising study
d) Praat is a system for doing phonetics by computer. The computer program Praat
is a research, publication, and productivity tool for phoneticians. It has been principally
developed by Paul Boersma and David Weenink in the University of Amsterdam. Its
functionalities are to analyse, synthesize, and manipule speech. All the quality features
used during the present work, and described in section 7.3, are calculated by means of
Praat. The use of this tool is optima during the investigation because the complexity of
quality voice measurements, mainly based in spectral properties of the speech, would
make their calculations really time-consuming. However, for a real-time systems, Praat
processes are too slow and it should be taken into account for further development. The
input of the Praat script is the original signal without preprocessing, since this tool admits
multiple audio formats, including NIST. The output is a vector of a variable number of
quality features, depending on the experiment. In the final system, the output vector is
combined with the decision made by the activation level classifier in order to make the
final assumption.
Chapter 10
10.1.3 Classification
The software utilized to train and test the neural network classifiers is the Stuttgart
Neural Network Simulator (SNNS 4.2), developed at University of Stuttgart and
maintained at University of Tübingen.
SNNS is a software simulator for neural networks on Unix workstations developed at
the Institute for Parallel and Distributed High Performance Systems (IPVR) at the
University of Stuttgart. The goal of the SNNS project is to create an efficient and flexible
simulation environment for research on and application of neural nets.
The SNNS simulator consists of two main components:
1) Simulator kernel written in C.
2) Graphical user interface under X11R4 or X11R5.
The simulator kernel operates on the internal network data structures of the neural
nets and performs all operations of learning and recall. It can also be used without the
other parts as a C program embedded in custom applications. It supports arbitrary
network topologies and it supports the concept of sites. SNNS can be extended by the
user with user defined activation functions, output functions, site functions and learning
procedures, which are written as simple C programs and linked to the simulator kernel.
The graphical user interface XGUI (X Graphical User Interface), built on top of the
kernel, gives a 2D and a 3D graphical representation of the neural networks and controls
the kernel during the simulation run. In addition, the 2D user interface has an integrated
network editor which can be used to directly create, manipulate and visualize neural nets
in various ways.
A network created with SNNS cam be compiled into an executable C source. This
source is generated from a network file using the tool SNNS2C, included in the simulator.
After completion of network training with SNNS, the tool SNNS2C is used to integrate
the trained network as a C function into a separate application.
The final system make use of three different networks, each one trained according to
the best performance achieved during the experiments (chapters 8 and 9).
Chapter 10
10.2 Analysis of the results
Most experiments of chapter 8 and 9 are evaluated attending to their confusion matrix
and the means of their neural network outputs. Both calculations were implemented in the
program confusion_th.c. This program reads a result file from the output of SNNS neural
network simulator and has two possible outputs: confusion matrix or mean outcome
The confusion matrix of the testing set. Given a class A. All the patterns
belonging to class A are counted. Each of these patterns, after being tested by the
network, can be well recognized by the classifier or can be confused with another class.
The number of times that this class is classified into each one of the categories, included
itself (correct recognition), divided by the total number of occurrences gives the rate of
confusion/recognition to calculate the percentage of times that they are right recognized
or confused with each one on the remaining classes. This procedure is applied to each one
of the defined classes and the resulting confusion matrix looks as follows:
Class 1
Class 2
Class N
Class 1
Class 2
Table 10.1. General appearance of a confusion matrix generated by confusion_th.c.
Where Nij is the number of patterns of class i classified as category j. When i and j
coincide, the pattern is well recognized. Nn is the total number of occurrences of the class
n at the input of the test. The Grey-highlighted ratios are the correct recognition rates. The
matrixes are presented in percentages, thus these ratios are then multiplied by 100.
Confusion_th.c also allows setting two different thresholds in order to select the final
winner for a pattern. After choosing the winner candidate of the output trough WTA rule,
a decision, based in two different thresholds, determines whether this value can be
actually considered as the winner or not. The confusion matrix would be, when this
functionality is included, calculated after the thresholds decision. These thresholds are
given as arguments to the program and are defined as follows:
Chapter 10
Threshold 1: Minimum value of the output to be considered as the winner. The
program checks if the winner candidate given by the NN exceed this value, if not the
pattern is classified into the central class.
Threshold 2: Maximum value of the opposite emotion or emotional groups. When
an utterance is classified into one emotion, e.g. angry, the program checks the output
values for the emotions situated on the opposite side of the axis, e.g. bored and sad for the
arousal dimension, must not exceed this value. Otherwise, the winner candidate is
classified into the central category. This is based on the observed experimental fact (see
Chapter 8) that mean output values of opposite emotions are well differentiated, for both
the five outputs and the three outputs case.
The mean outcome values of the testing set. For a teaching class A, the average
value of each one of the outputs is calculated. The average is computed for each of the
defined classed and for each one of the output nodes (for our experiments, each output
node matches a classification category), and a representative graph is depicted.
class 1
class 2
class N
class 1
class 2
class N
Figure 10.2. Graphical representation of the mean output values of the training set generated by
The horizontal axis of figure 10.2 represents the teaching class and the vertical bars
are the mean values of the neural network output.
Chapter 10