Bathiya Senevirathna 1 The Effect of Noise on Automatic Speech Segmentation Algorithms The Effect of Noise on Automatic Speech Segmentation Algorithms 1. Introduction We first familiarized ourselves with using a numerical analysis environment to analyze sound files. The software we used was MATLAB developed by MathWorks. We used MATLAB’s built-in ‘wavread’ function to first import sound file data into a numerical array. By using this command, we were introduced to the concept of sampling frequency. Sound in the physical world is analog and in order to analyze it by computerized means, it is necessary to convert this continuous stream of data into discrete, digital data. The sampling frequency is the rate at which samples of the analog data are extracted to form digital data points. This value is usually given in samples per unit time and is very important to this speech analysis research because the sampling rate essentially sets the quality of the data you have to work with. After some experimenting with recording and playback of sounds with different sampling rates, we set about analyzing actual speech. 2. The Waveform Figure 2.1 below shows an energy plot of the original sound I recorded (“She sells sea shells on the sea shore”) at a sampling rate of 10kHz for 8 seconds. As can be seen, each spoken word stands out quite prominently when looking at its energy. We then went about trying to automatically identify words using an algorithm based on the sound data energy levels. Figure 2.1 – Plot of original sound "She sells sea shells on a sea shore" 8s @ 10kHz 1 0.8 0.6 0.4 Amplitude 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0 1 2 3 4 Time (seconds) 5 6 7 8 4 x 10 First we set an unvoiced threshold, an energy level below which we assume the sound to represent silence. Then we set a word boundary threshold, the length of time of the shortest unvoiced segment. After some testing and experimenting, the most coherent results came from an energy threshold of 0.05 and a word length of 0.1s. Bathiya Senevirathna 2 The Effect of Noise on Automatic Speech Segmentation Algorithms Figure 2.2 below shows my segmented speech with the red trace representing unvoiced speech. Figure 2.2 – Plot of segmented speech Labeled Speech 1 Unvoiced Voiced 0.8 0.6 0.4 Amplitude 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0 1 2 3 4 5 6 7 8 Time (Seconds) The thresholds I used actually fractured the word “shells” so nine words were found instead of eight. 3. Adding Noise The automatic speech segmentation algorithm worked quite well so we then decided what effects adding White Gaussian noise to the speech would have. The first step was to generate an array of random numbers equal to the total number of samples in my speech file (8s x 10kHz = 80,000 samples). We then standardized this data. One parameter that we changed before adding the noise to the speech file was its scale, the ratio of the amplitudes of my speech and the noise. I first started with a scale of 5, i.e. my speech amplitude was 5 times the size of the noise amplitude. As expected, my original thresholds still worked fine. In fact, after I increased the energy threshold to 0.055, the word “shells” was correctly segmented as one word. This did not happen when there was no noise. Next I reduced the scale to 4. It became apparent from the graph that I had to increase the energy threshold to help identify words. Figure 3.1 – Plot of noisy speech (1/4 scaled noise) Labeled Speech 1.5 Unvoiced Voiced 1 Amplitude 0.5 0 -0.5 -1 -1.5 0 1 2 3 4 Time (Seconds) 5 6 7 8 Bathiya Senevirathna 3 The Effect of Noise on Automatic Speech Segmentation Algorithms My new parameters of 0.07 (+0.02) for energy and 0.15s (+0.05s) for word length were still enough to identify the correct words. When the scale was reduced to 2, the algorithm had a harder time distinguishing words from noise. I had to increase my energy threshold to 0.12 (+0.07) but kept the same word length as with the scale of 4. It worked well but the word “sea” was lost in the noise which was expected because of the soft “s” sound. This seems to be the emerging pattern as the noise gets louder. The softer sounds are lost firs. Since the words in this particular recorded speech has so many soft sounds (“She sells sea shells on the sea shore”), the success of this automatic speech segment algorithm falls very quickly. Figure 3.2 – Plot of noisy speech (1/2 scaled noise) Labeled Speech 1.5 Unvoiced Voiced 1 Amplitude 0.5 0 -0.5 -1 -1.5 0 1 2 3 4 5 6 7 8 Time (Seconds) Finally, with a 1:1 noise:speech scale, it was very difficult to find any words at all automatically.