Final Project – Speaker Recognition Scott A. Pigg 5/8/2001 Abstract It has been put fourth that each persons voice might prove to be a feature that can be used to distinguish a particular person from others in much the same way as fingerprints have been used (Klevans, 16). If this assertion is true, it should be possible to mathematically analyze certain features of one’s voice that can serve as distinguishing characteristics. One such feature is the pitch of one’s voice, and computer algorithms have been developed which can perform pitch determination. Pitch is most useful in distinguishing between speakers of different gender or between juvenile and adult speakers. However, one can easily mask the pitch of one’s voice (Klevans, 22), so a more reliable method for identification is desirable. The resonances, or peaks, in the power spectral density of one’s voice signal are referred to as formants (Speech Production, 60). These formants are caused by the cavities of ones vocal tract (Klevans, 23). This feature may be the feature that can be used to distinguish between different speakers of the same gender. If this is the case, the Euclidean distance between the formants of different voice samples must be analyzed. A recording of an individual’s voice can be sampled and transformed into a digital file, which can be analyzed with computer software, such as MATLAB, designed for sophisticated computations. In order to make this conversion, it is necessary to use a sufficiently high sampling rate. Once a digital file has been formed, a simple vector can be used to represent the speech signal. By manipulating the entries in this vector, one can manipulate the order of the sounds in the speech file. It is all also a simple matter to add or filter out background noise. The tools of a computer can equally be applied to determine the Fourier transform, the PSD, pitch, formants, and to manipulate large numbers of voice files. Introduction This experiment is the final project in a course involving the analyses of signals and systems. In this experiment, the tools of the computer software MATLAB will be employed to deal speech files. It is intended to demonstrate methods for manipulating such files. The formants and average pitches of eighty-three different audio files will be analyzed in order to determine the accuracy in distinguishing between different speech files, both of the same speaker and different speakers. The files have already been recorded, sampled, and converted into digital files, and the functions used are supplied by MATLAB or by the instructor Dr. Hairong Qi. An outline of the manipulations to be performed on the various speech files are: Rearrange the order of a file from “signals and systems ECE 310” to “ECE 310 signals and systems.” Add Guassian noise to the background of a file. Filter a signal recorded in a noisy background to exclude the noise. Use formant and pitch analysis to determine the three files closest to that of one’s own file Approach Preliminaries. Recorded four files of each person’s voice in the class. Three saying “signals and systems ECE 310.” One was at a slightly faster speed and one was with a noisy background. The fourth file should be each person saying their name. Download the pertinent files from the webpage panda.ece.utk.edu/hqi/ece310/project/final.htm by double clicking on the wavfiles.tar.gz link. Used gunzip and tar to decompress and untar the files (panda). Open MATLAB. Part I. Use “wavread” to read one’s speech file recorded at the slower speed into an mfile. Divide the size of the signal by the sampling rate in order to determine the time length of the signal. Establish a time vector starting at 1/fs and progressing in 1/fs intervals. Plot the signal and estimate were the “ECE 310” part of the signal begins. Using “for” loops, move the “ECE 310” in front of the “signals and systems.” Use “wavwrite” to write the modified signal to a file. Using Windows Media Player, listen to the signal for accuracy. Part II. Read one’s speech file recorded at slightly faster speed into an m-file. Establish a time vector. Use “randn” in a similar fashion as is described on the above mentioned website to create a random vector the same size as the signal. Add the random vector to the original signal. Plot the original and modified signal. Write the modified signal to a file. Listen to the signal for accuracy. Part III. Read one’s speech file recorded with a noisy background into an m-file. Establish a time vector. Plot the shifted Fourier transform of the original signal. Determine the proper cut-off and generate a low-pass filter using the MATLAB function “butter.” Apply the low-pass filter to the original signal using “filter.” Plot the shifted Fourier transform of the modified signal. Write the modified signal to a file. Listen to the signal for accuracy. Part IV. Read in one’s speech file recorded at a slower speed. Apply the function “formant” to the signal to get PSD. Apply the function “pickmax” to the PSD returned by “formant” (panda) to get index of maximums. Normalize the index by dividing by 128. Store the index in a memory variable. Apply function “pitch” (panda) and store the average pitch in a different memory variable. Establish two vectors to hold the Euclidean difference between the index and average pitch of the former file and each of the other 83 speech files. Use “sprintf” and a “for” loop in a fashion similar to that described on the website mentioned in the preliminaries in order to read in each file. During each iteration of the “for” loop, calculate the PSD of each file, calculate the index of each file, normalize the index, store the Euclidean difference between the index of the very first file read in and the present file in the appropriate entry of the vector, calculate the average pitch, and store the difference between the average pitch of the first file read in and the present file in the appropriate entry of the vector. After exiting the loop use “sort” to arrange the two vectors in order of smallest difference. Display the first four entries of the vector that shows the location of each entry in the vector before sorting. These numbers coorespond to the four closest matches to the first file read in. The first number should coorrespond to the first file read in. Experimental Results Part I. File “a01.wav” was found to be the file said at the slower speed. Using MATLAB, an m-file was opened and the file was read into the m-file using “wavread” (see appendix). A time vector was established by dividing the number of elements in the vector (30,000) by the sampling frequency (8,000 Hz) to find the maximum time of 3.75 sec. Since MATLAB vectors are indexed starting with 1, the time was designated to begin at 1/8000 sec and to proceed at 1/8000 sec intervals. The resulting vector was plotted in order to get an idea of where to start. It was estimated that the “ECE 310” part of the signal began around 1.75 sec. Multiplying this time by the sampling frequency (8,000 Hz) led to the estimate that the “ECE 310” part of the signal began at entry 14,000 in the vector. A new vector, the same size as the original, was created. Using a “for” loop, the last section of the original vector was copied to the first part of the new vector. Using another “for” loop, the first part of the original vector was copied to the last part of the new vector. “wavwrite” was used to write the modified signal out to the file “rearrange.wav”. Windows Media Player was used to listen to the modified signal. The original estimate was found to be incorrect and the technique of trial-and-error was used until the correct cut-off was found to be entry 16,000 in the original signal. The result of the experiment done in Part I. was a .wav file titled “rearrange.wav”. The “ECE 310” part of the original file, the file said at a slower speed, was moved to the beginning of the file. “rearrange.wav” consisted of the message, “ECE 310 signals and systems,” as opposed to the original which consisted of, “signals and systems ECE 310.” Part II. The file said at a faster speed was found to be “a02.wav”. This file was read into MATLAB in a similar fashion to Part I. This signal was found to be represented by a 27,425 element vector with the same sampling frequency as Part I. A time vector with a maximum time of 3.428125 was generated. Using the “randn” function, a random vector the same size as the original signal was generated. Using code copied and pasted from the webpage (panda) mentioned in the Preliminaries, the vector was generated with Guassian distribution, specified standard deviation of 0.05, and mean 0. This random vector was then added to the original signal to create a noisy representation. “wavwrite” was used to write the modified signal out to the file “addnoise.wav”. Using Windows Media Player, the modified signal was listened to in order to check the new signal for adequate distortion. The result of the experiment done in Part II. was a .wav file titled “addnoise.wav”. Guassian noise was added to the file said at a faster speed. A single plot with the original signal plotted above the modified signal was also generated. Part III. Files “a63.wav” was found to be the file said with a noisy background. It was read into Matlab and the function “fft” was used to take the Fourier transform of the signal. The magnitude of the Fourier transform was then plotted. The “butter” command was used to generate a low-pass filter (panda) with cut-off frequency 0.15 and order 3. The “filter’ command was used to apply the low-pass filter to the signal. The Fourier transform of the modified signal was plotted and the modified signal was written to the file “filter.wav.” Media Player was used to check the modified signal for adequate filtering. The result of the experiment done in Part III. was a .wav file titled “filter.wav”. The original signal was said in the midst of a noisy background, and the modified signal had the noise filtered out . A single plot consisting of the Fourier transform of the original signal plotted over the Fourier transform of the filtered signal was also generated. Part IV. File “a01.wav” was read into an m-file. The function “formant” (panda) was applied to obtain the PSD of the file. The function “pickmax” (panda) was used to find the indices of the maximums in the PSD. The function “pitch” (panda) was used to obtain the average pitch of the signal contained in the file. Vectors were created to hold the norm of the difference between the indices and average pitch of the first file and the other 83 files. Using “sprintf”, the other 83 files were read into MATLAB (panda), their indices and average pitch were calculated, and the norm of the difference between each file and the first file were recorded in a vector. “sort” was then used to sort both vectors, and the first four entries in each vector were printed out to determine which files were predicted to be by the same speaker. The predictions of this experiment were very poor. The three closest matching files by the analysis of formant indices were first “a04.wav”, then “a82.wav”, and lastly “a15.wav”. The three closest files by the analysis of the average pitch were first “a15.wav”, then “a19.wav”, and lastly “a71.wav”. None of these were spoken by the same speaker as “a01.wav”. Discussion In this experiment, my first practical understanding of the application of signals and systems theory to a real problem was gained. I was able to use computer software to manipulate digital audio signals, by rearranging their contents, adding background noise to them, and filtering out noise that was already there. The cause of the inability to accurately distinguish between different audio files in Part IV. is beyond the scope of my understanding; however, it did afford me the opportunity to sharpen my MATLAB skills and to glimpse an application of the material I have learned in class. I have found the mere concept of manipulating speech files and matching speakers to be a very intriguing task to tackle. In the future, I would like to learn more about the manipulation of signals and how to apply it to find solutions to real problems. Reference panda.ece.utk.edu/hqi/ece310/project/final/htm Klevans, Richard L. and Rodman, Robert D. Voice Recogniton. Artech House: Boston 15-31. Speech Production, Labeling, and Characteristics. 51-74. Appendix Part I. % Function: This code reads in the wave file from the first round at the slower speed, moves % ECE 310 in front of signals and systems, plots out the original and modified % signals, and writes the modified signal out to rearrange.wav % Author: Scott Pigg % Date: April 19, 2001 clear all; clf; [y,fs,nbits]=wavread('a01.wav'); t=1/fs:1/fs:3.75; % % read in slower signal establish time vector x=y; size in order to hold modifications % establish vector of same subplot(211) plot(t,y) % plot original signal for i=1:16000, signal to first part of new signal x(i)=y(i+14000); end % move last part of original for i=16001:30000, % original signal to last part of new signal x(i)=y(i-16000); end move first part of subplot(212) plot(t,x) plot modified signal % wavwrite(x,'a:\project310\Experimental Results\rearrange.wav') % write mod Part II. % Function: This code reads in the wave file from the first round said at a faster speed, % adds random noise to the signal, plots out the original and modified signals, % then writes the noisy signal to a file named addnoise.wav. % Author: Scott Pigg % Date: April 20, 2001 clear all; clf; [y,fs,nbits]=wavread('a02.wav'); at faster speed t=1/fs:1/fs:3.75; vector % read in signal said % establish time sigma = 0.05; mu = 0; n = randn(size(y))*sigma + mu*ones(size(y)); % % % std deviation mean get random signal z=y+n; to original signal % apply random signal subplot(211); plot(t,y); title('Part II. Original signal'); subplot(212); plot(t,z); title('Modified signal'); % plot original signal % plot modified signal wavwrite(z,'a:/addnoise.wav'); signal to addnoise.wav % write modified Part III. % Function: This code will read in the noisy take and use a low-pass filter with a cut-off % frequency of 4,000 Hz to filter the dft of the signal. The inverse dft % is then written to filter.wav. % Author: Scott Pigg % Date: April 24,2001 clear all; clf; [y,fs,nbits]=wavread('a63.wav'); t=1/fs:1/fs:3.428125; % % read in the noisy file establish time vector y1=fft(y); signal y2=fftshift(y1); % Fourier Transform of original % plot the magnitude of the % at the top of a 2 row plot No=size(y2); f=-No/2:No/2-1; subplot(211); Fourier Transform plot(f,abs(y2)); title('FFT of Original Signal'); order = 3; cut = .15; [B, A] = butter(order, cut); x = 4*filter(B, A, y); signal % % order of filter cut-off frequency of filter % get low-pass filter % apply filter to noisy x1=fft(x); signal x2=fftshift(x1); % Fourier Transform of filtered subplot(212); Fourier Transform plot(f,abs(x2)); plot title('FFT of Filtered Signal'); % plot the magnitude of the % at wavwrite(x,'a:\filter.wav'); filter.wav % the bottom of a 2 row write filtered signal to Part IV. % Function: This code reads in each of the 83 .wav files, calculates the PSD, the indices, % the pitch, calculates the difference between the indices and pitch of each of % the 83 files and that of "a01.wav" (my file said at a slower speed), stores them % in two vectors, and sorts them to find the best matches. % Author: Scott Anthony Pigg % Date: May 7, 2001 % Acknowledgement: The functions formant(), pickmax(), pitch(), pitchacorr() which is called % by pitch(), and the code for sprintf were provided by Dr. Hairong Qi. clear all; [x, fs, nbits] = wavread('a01.wav'); [P, F] = formant(x); [Pm,I1] = pickmax(P); I1 = I1/128; [t, f0, avgF01] = pitch(x, fs); indexKennel=zeros(1,83); pitchKennel=zeros(1,83); % read in the files calculate PSD, indices of maximums, pitch, and store in vectors the % difference between index and pitch of file and that of "a01.wav" for i=1:83 if i<10 filename = sprintf('a0%i.wav', i); else filename = sprintf('a%i.wav', i); end [x, fs, nbits] = wavread(filename); % find formant of each signal and store in appropriate column of formantKennel [P, F] = formant(x); % find the maximums and indices of the PSD (P) of x [Pm,I] = pickmax(P); I = I/128; indexKennel(1,i)=(norm(I1-I)); % find pitch of each signal and store in appropriate column of pitchKennel [t, f0, avgF0] = pitch(x, fs); pitchKennel(1,i)=(norm(avgF01-avgF0)); end % sort the differences in indices and pitch, and output the 4 smallest differences in each [A,Index]=sort(indexKennel); Index(1,1:4) [B,Pitch]=sort(pitchKennel); Pitch(1,1:4)