Sound Recognition Figure 1 – The Sound Recognizer As illustrated

advertisement
1. Sound Recognition
Figure 1 – The Sound Recognizer
As illustrated in figure 1, the sound recognizer is composed of four main modules: training,
testing, classifier and data analysis (all implemented in MATLAB). Below we describe each of
these modules in detail.
1.1.
Training the Recognizer
Figure 2 – The Training Module
First, the recognizer must be trained to obtain a feature set that can be used to correctly
classify data samples. This is accomplished by the training module, which consists of three
main processes, as depicted in figure 2.
The first task of the training module is to process the training data samples, that is, to
transform them from raw data samples into a more suitable representation. This process starts
by converting, if necessary, all the data samples to mono. This conversion is done by
calculating the sum of the values from both channels and dividing it by two. From a practical
viewpoint, this modifies the spatialization of the sound, which may be distributed along the
stereo panorama, to a centred hearing point. Next, the amplitude of the signals is normalized,
which guarantees that all the sounds have a similar volume level, reducing the possibility of
discrepancies in terms of spectral properties. Finally, we cut each sound sample in 20
segments of 0.2 seconds uniformly throughout its length (if the sound has a duration of less
than 4 seconds, the segments will overlap) and calculate the magnitude spectrogram of these
segments. To guarantee that each of matrices is non-negative, we compute the modulus of
each spectrogram.
After the magnitude spectrograms for all the sound segments are computed, we concatenate
each of them into matrix S, which will be used as the input for the next process, the nonnegative matrix factorization (NMF). Also, we construct a matrix C that stores the sizes of each
spectrogram. Later on, this will aid us in the feature extraction process, as it gives us the
indexes that identify each of the training sounds in the merged spectrogram.
The NMF algorithm requires the definition of two parameters: the cost function, for which we
use a divergence function, and the update rule, for which we use a multiplicative update.
Given a non-negative matrix S, NMF calculates two, also non-negative, matrices Θ and Ρ, such
that 𝑆 = Θ ∗ Ρ, where the matrix S is a magnitude spectrogram (as the one computed
previously), matrix Θ is the mixing matrix (containing the spectra of the sounds and which
represents the sound features, or axis of the new space where the data is now represented)
and Ρ is the source signal matrix (containing the temporal envelopes of the sounds and which
contains the sounds’ coordinates in the new space, that is the feature values). Ρ will be
processed in the following feature extraction phase, and Θ will later be crucial for the testing
phase.
Figure 3 – Section cut of a Ρ matrix
To characterize each segment, we first determine the respective subset of matrix Ρ (Figure 3)
by using the indexes in matrix C (which indicate the interval of columns in matrix Ρ that are
related to the processed sound segments). In each of these segments we have the spectral
properties of a given sound segment throughout the dimensions of the matrix. We create a
feature vector of that segment by calculating the average value and median of the spectral
properties, as well as the spectral energy (the sum of the values) for each of the segments for
all the matrix’ dimensions.
This process produces a training matrix, Ftraining, composed of the feature vectors for each of
the individual sound segments. We can easily map a specific sound to its feature vector, as
both have the same ordering, that is, the position of a sound’s spectrogram in S is the same as
the position of its feature vector in matrix Ftraining.
1.2.
The Testing Module
Figure 4 – Testing Module
The testing module is, akin to the training module, composed by three processes, as shown in
Figure 4, which will be described in this section.
The samples that will be processed for testing can be either individual sounds or sequences of
sounds. The audio processing of these sounds is the same as that of the training of the system,
as explained in the previous section, that is: the sound is read, if it is in stereo we convert it to
mono, it is then normalized, and the magnitude spectrogram for each segment of 0.2 seconds
is calculated.
Logically, the testing features values are extracted from a matrix Ρ that we obtain from a test
sample. However, differently from the training process, we do not use the NMF algorithm to
compute this matrix. Instead, we rely on the mathematical manipulation of equation 𝑆 = Θ ∗
Ρ, to obtain Ρ𝑡𝑒𝑠𝑡 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 = Θ−1
𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ∗ 𝑆𝑡𝑒𝑠𝑡 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 . This allows us to use the matrix Θ
obtained from the training process to project the spectral properties of Stest segment to those that
define the training data. By doing so we create Ρ test segment, from which we extract the feature
vectors that define the sound segment. Similarly to what is done in the training module, we
calculate the feature vectors for each segments by calculating the average value and median of
the spectral properties, as well as the spectral energy (the sum of the values) in all the matrix’
dimensions. These feature vectors are then gathered to create matrix Ftesting.
1.3.
Classifier
The system uses a k-NN classifier with a Euclidean distance metric, where k is determined
dynamically by the formula ceiling(√mean(sounds)), where sounds is a matrix with the
number of training samples for each training class.
The classifier’s feature space is composed by the matrix Ftraining, from which we will be able to
classify the various test sounds represented by the matrix Ftesting. A matrix with the k nearest
neighbours is calculated for each test sound. The class of a test segment is assigned by its most
occurring neighbours, the class of a complete sound is determined by the most occurring class.
1.4.
Data Analysis
While not functionally relevant to the system in itself, this final module is important as it
automatically compiles the results in a form that is much easier to interpret and analyse. The
system creates a series of spreadsheets, one for each test set, that give information about
each sound that was tested. This information consists of the neighbours calculated by the
classifier for the sound, the percentage of these neighbours that are effectively correct, the
class that was assigned to the sound and its correctness, as well as general percentage of the
classification correctness for the given data set.
1.5.
Results
The tests included two data sets created from real world audio recordings. It’s important to
note that, as mentioned earlier, the maximum length taken from each sound samples for the
training of system is set to 4 seconds.
The first data set is composed of sounds representing three different concepts, “water”, “car”
and “train”. The different sounds used for each of the concepts are as following:


Training samples
o
four sounds for “water”, for a total of 51.910 seconds;
o
four sounds for “car”, for a total of 1 minute and 17.380 seconds;
o
and four sounds for “train”, for a total of 3 minutes and 37.484 seconds;
Testing samples
o
three sounds for “water”, for a total of 18.903 seconds and 96 segments;
o
two sounds for “car”, for a total of 24.975 seconds and 125 segments;
o
and one sound for “train”, for a total of 4 minutes and 40.204 seconds and 1400
segments;
The results for this data set are presented in Table 1, which shows the recognition rates in
terms of the whole audio samples, and Table 2, that shows the recognition rates in terms of
the total number of segments obtained from the audio files.
Water
Car
Train
Water
2
0
0
Car
1
2
0
Train
0
0
1
Total
67%
100%
100%
Table 1 – Confusion matrix for the complete samples of data set 1
Water
Car
Train
Water
73
8
0
Car
23
111
317
Train
0
6
1083
Total
76%
89%
77%
Table 2 – Confusion matrix for the sound segments of data set 1
The second data set adds to the first set the concept “people”, for which we used only one
sound with the length of 23.472 seconds for training, and another sound for testing, with 2
minutes and 2.984 seconds and 615 segments.
The recognition rates are presented in Table 3 and Table 4 in similar fashion to the results of
the first data set.
Water
Car
Train
People
Water
2
0
0
0
Car
1
1
0
0
Train
0
0
1
0
People
0
1
0
1
Total
67%
50%
100%
100%
Table 3 – Confusion matrix for the complete samples of data set 2
Water
Car
Train
People
Water
75
7
0
0
Car
20
61
342
35
Train
0
6
1053
0
People
1
51
5
580
Total
78%
49%
75%
94%
Table 4 – Confusion matrix for the sound segments of data set 2
For the first data set, only one of the complete samples belonging to the “water” class was
misclassified, lowering the recognition rate of the class to 67%. By observing the results for
each of the 0.2 second segments, it’s visible that most were correctly classified for this class.
While the rates for the other classes were lower, the accuracy was substantially high with over
75% for all the classes.
The second data set, with its addition of the new class “people”, continued to produce overall
good results. Two complete samples were misclassified, one of the “car” class and the same
sample from the “water” class that misclassified in the first data set. However, by analyzing the
recognition rates for the sound segments, we notice that the “car” class was the only one that
had a decrease in accuracy, as both “water” and “train” maintained theirs and the new
“people” class attained a very high value (94%).
1.6.
Bibliography
Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization.In NIPS,
pages 556–562, 2000.
Ling Ma, Ben Milner, and Dan Smith. Acoustic environment classification. ACM Trans.Speech
Lang. Process., 3(2):1–22, 2006.
Download