1. Sound Recognition Figure 1 – The Sound Recognizer As illustrated in figure 1, the sound recognizer is composed of four main modules: training, testing, classifier and data analysis (all implemented in MATLAB). Below we describe each of these modules in detail. 1.1. Training the Recognizer Figure 2 – The Training Module First, the recognizer must be trained to obtain a feature set that can be used to correctly classify data samples. This is accomplished by the training module, which consists of three main processes, as depicted in figure 2. The first task of the training module is to process the training data samples, that is, to transform them from raw data samples into a more suitable representation. This process starts by converting, if necessary, all the data samples to mono. This conversion is done by calculating the sum of the values from both channels and dividing it by two. From a practical viewpoint, this modifies the spatialization of the sound, which may be distributed along the stereo panorama, to a centred hearing point. Next, the amplitude of the signals is normalized, which guarantees that all the sounds have a similar volume level, reducing the possibility of discrepancies in terms of spectral properties. Finally, we cut each sound sample in 20 segments of 0.2 seconds uniformly throughout its length (if the sound has a duration of less than 4 seconds, the segments will overlap) and calculate the magnitude spectrogram of these segments. To guarantee that each of matrices is non-negative, we compute the modulus of each spectrogram. After the magnitude spectrograms for all the sound segments are computed, we concatenate each of them into matrix S, which will be used as the input for the next process, the nonnegative matrix factorization (NMF). Also, we construct a matrix C that stores the sizes of each spectrogram. Later on, this will aid us in the feature extraction process, as it gives us the indexes that identify each of the training sounds in the merged spectrogram. The NMF algorithm requires the definition of two parameters: the cost function, for which we use a divergence function, and the update rule, for which we use a multiplicative update. Given a non-negative matrix S, NMF calculates two, also non-negative, matrices Θ and Ρ, such that 𝑆 = Θ ∗ Ρ, where the matrix S is a magnitude spectrogram (as the one computed previously), matrix Θ is the mixing matrix (containing the spectra of the sounds and which represents the sound features, or axis of the new space where the data is now represented) and Ρ is the source signal matrix (containing the temporal envelopes of the sounds and which contains the sounds’ coordinates in the new space, that is the feature values). Ρ will be processed in the following feature extraction phase, and Θ will later be crucial for the testing phase. Figure 3 – Section cut of a Ρ matrix To characterize each segment, we first determine the respective subset of matrix Ρ (Figure 3) by using the indexes in matrix C (which indicate the interval of columns in matrix Ρ that are related to the processed sound segments). In each of these segments we have the spectral properties of a given sound segment throughout the dimensions of the matrix. We create a feature vector of that segment by calculating the average value and median of the spectral properties, as well as the spectral energy (the sum of the values) for each of the segments for all the matrix’ dimensions. This process produces a training matrix, Ftraining, composed of the feature vectors for each of the individual sound segments. We can easily map a specific sound to its feature vector, as both have the same ordering, that is, the position of a sound’s spectrogram in S is the same as the position of its feature vector in matrix Ftraining. 1.2. The Testing Module Figure 4 – Testing Module The testing module is, akin to the training module, composed by three processes, as shown in Figure 4, which will be described in this section. The samples that will be processed for testing can be either individual sounds or sequences of sounds. The audio processing of these sounds is the same as that of the training of the system, as explained in the previous section, that is: the sound is read, if it is in stereo we convert it to mono, it is then normalized, and the magnitude spectrogram for each segment of 0.2 seconds is calculated. Logically, the testing features values are extracted from a matrix Ρ that we obtain from a test sample. However, differently from the training process, we do not use the NMF algorithm to compute this matrix. Instead, we rely on the mathematical manipulation of equation 𝑆 = Θ ∗ Ρ, to obtain Ρ𝑡𝑒𝑠𝑡 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 = Θ−1 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ∗ 𝑆𝑡𝑒𝑠𝑡 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 . This allows us to use the matrix Θ obtained from the training process to project the spectral properties of Stest segment to those that define the training data. By doing so we create Ρ test segment, from which we extract the feature vectors that define the sound segment. Similarly to what is done in the training module, we calculate the feature vectors for each segments by calculating the average value and median of the spectral properties, as well as the spectral energy (the sum of the values) in all the matrix’ dimensions. These feature vectors are then gathered to create matrix Ftesting. 1.3. Classifier The system uses a k-NN classifier with a Euclidean distance metric, where k is determined dynamically by the formula ceiling(√mean(sounds)), where sounds is a matrix with the number of training samples for each training class. The classifier’s feature space is composed by the matrix Ftraining, from which we will be able to classify the various test sounds represented by the matrix Ftesting. A matrix with the k nearest neighbours is calculated for each test sound. The class of a test segment is assigned by its most occurring neighbours, the class of a complete sound is determined by the most occurring class. 1.4. Data Analysis While not functionally relevant to the system in itself, this final module is important as it automatically compiles the results in a form that is much easier to interpret and analyse. The system creates a series of spreadsheets, one for each test set, that give information about each sound that was tested. This information consists of the neighbours calculated by the classifier for the sound, the percentage of these neighbours that are effectively correct, the class that was assigned to the sound and its correctness, as well as general percentage of the classification correctness for the given data set. 1.5. Results The tests included two data sets created from real world audio recordings. It’s important to note that, as mentioned earlier, the maximum length taken from each sound samples for the training of system is set to 4 seconds. The first data set is composed of sounds representing three different concepts, “water”, “car” and “train”. The different sounds used for each of the concepts are as following: Training samples o four sounds for “water”, for a total of 51.910 seconds; o four sounds for “car”, for a total of 1 minute and 17.380 seconds; o and four sounds for “train”, for a total of 3 minutes and 37.484 seconds; Testing samples o three sounds for “water”, for a total of 18.903 seconds and 96 segments; o two sounds for “car”, for a total of 24.975 seconds and 125 segments; o and one sound for “train”, for a total of 4 minutes and 40.204 seconds and 1400 segments; The results for this data set are presented in Table 1, which shows the recognition rates in terms of the whole audio samples, and Table 2, that shows the recognition rates in terms of the total number of segments obtained from the audio files. Water Car Train Water 2 0 0 Car 1 2 0 Train 0 0 1 Total 67% 100% 100% Table 1 – Confusion matrix for the complete samples of data set 1 Water Car Train Water 73 8 0 Car 23 111 317 Train 0 6 1083 Total 76% 89% 77% Table 2 – Confusion matrix for the sound segments of data set 1 The second data set adds to the first set the concept “people”, for which we used only one sound with the length of 23.472 seconds for training, and another sound for testing, with 2 minutes and 2.984 seconds and 615 segments. The recognition rates are presented in Table 3 and Table 4 in similar fashion to the results of the first data set. Water Car Train People Water 2 0 0 0 Car 1 1 0 0 Train 0 0 1 0 People 0 1 0 1 Total 67% 50% 100% 100% Table 3 – Confusion matrix for the complete samples of data set 2 Water Car Train People Water 75 7 0 0 Car 20 61 342 35 Train 0 6 1053 0 People 1 51 5 580 Total 78% 49% 75% 94% Table 4 – Confusion matrix for the sound segments of data set 2 For the first data set, only one of the complete samples belonging to the “water” class was misclassified, lowering the recognition rate of the class to 67%. By observing the results for each of the 0.2 second segments, it’s visible that most were correctly classified for this class. While the rates for the other classes were lower, the accuracy was substantially high with over 75% for all the classes. The second data set, with its addition of the new class “people”, continued to produce overall good results. Two complete samples were misclassified, one of the “car” class and the same sample from the “water” class that misclassified in the first data set. However, by analyzing the recognition rates for the sound segments, we notice that the “car” class was the only one that had a decrease in accuracy, as both “water” and “train” maintained theirs and the new “people” class attained a very high value (94%). 1.6. Bibliography Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization.In NIPS, pages 556–562, 2000. Ling Ma, Ben Milner, and Dan Smith. Acoustic environment classification. ACM Trans.Speech Lang. Process., 3(2):1–22, 2006.