Final Project – Speaker Recognition

advertisement
Final Project – Speaker Recognition
Scott A. Pigg
5/8/2001
Abstract
It has been put fourth that each persons voice might prove to be a feature that can
be used to distinguish a particular person from others in much the same way as
fingerprints have been used (Klevans, 16). If this assertion is true, it should be possible
to mathematically analyze certain features of one’s voice that can serve as distinguishing
characteristics. One such feature is the pitch of one’s voice, and computer algorithms
have been developed which can perform pitch determination. Pitch is most useful in
distinguishing between speakers of different gender or between juvenile and adult
speakers. However, one can easily mask the pitch of one’s voice (Klevans, 22), so a
more reliable method for identification is desirable. The resonances, or peaks, in the
power spectral density of one’s voice signal are referred to as formants (Speech
Production, 60). These formants are caused by the cavities of ones vocal tract (Klevans,
23). This feature may be the feature that can be used to distinguish between different
speakers of the same gender. If this is the case, the Euclidean distance between the
formants of different voice samples must be analyzed.
A recording of an individual’s voice can be sampled and transformed into a
digital file, which can be analyzed with computer software, such as MATLAB, designed
for sophisticated computations. In order to make this conversion, it is necessary to use a
sufficiently high sampling rate. Once a digital file has been formed, a simple vector can
be used to represent the speech signal. By manipulating the entries in this vector, one can
manipulate the order of the sounds in the speech file. It is all also a simple matter to add
or filter out background noise. The tools of a computer can equally be applied to
determine the Fourier transform, the PSD, pitch, formants, and to manipulate large
numbers of voice files.
Introduction
This experiment is the final project in a course involving the analyses of signals
and systems. In this experiment, the tools of the computer software MATLAB will be
employed to deal speech files. It is intended to demonstrate methods for manipulating
such files. The formants and average pitches of eighty-three different audio files will be
analyzed in order to determine the accuracy in distinguishing between different speech
files, both of the same speaker and different speakers. The files have already been
recorded, sampled, and converted into digital files, and the functions used are supplied by
MATLAB or by the instructor Dr. Hairong Qi. An outline of the manipulations to be
performed on the various speech files are:




Rearrange the order of a file from “signals and systems ECE 310” to
“ECE 310 signals and systems.”
Add Guassian noise to the background of a file.
Filter a signal recorded in a noisy background to exclude the noise.
Use formant and pitch analysis to determine the three files closest to that
of one’s own file
Approach
Preliminaries.
Recorded four files of each person’s voice in the class. Three saying “signals and
systems ECE 310.” One was at a slightly faster speed and one was with a noisy
background. The fourth file should be each person saying their name. Download the
pertinent files from the webpage panda.ece.utk.edu/hqi/ece310/project/final.htm by
double clicking on the wavfiles.tar.gz link. Used gunzip and tar to decompress and untar
the files (panda). Open MATLAB.
Part I.
 Use “wavread” to read one’s speech file recorded at the slower speed into an mfile.
 Divide the size of the signal by the sampling rate in order to determine the time
length of the signal.
 Establish a time vector starting at 1/fs and progressing in 1/fs intervals.
 Plot the signal and estimate were the “ECE 310” part of the signal begins.
 Using “for” loops, move the “ECE 310” in front of the “signals and systems.”
 Use “wavwrite” to write the modified signal to a file.
 Using Windows Media Player, listen to the signal for accuracy.
Part II.
 Read one’s speech file recorded at slightly faster speed into an m-file.
 Establish a time vector.
 Use “randn” in a similar fashion as is described on the above mentioned website
to create a random vector the same size as the signal.
 Add the random vector to the original signal.
 Plot the original and modified signal.
 Write the modified signal to a file.
 Listen to the signal for accuracy.
Part III.
 Read one’s speech file recorded with a noisy background into an m-file.
 Establish a time vector.
 Plot the shifted Fourier transform of the original signal.
 Determine the proper cut-off and generate a low-pass filter using the MATLAB
function “butter.”
 Apply the low-pass filter to the original signal using “filter.”
 Plot the shifted Fourier transform of the modified signal.
 Write the modified signal to a file.
 Listen to the signal for accuracy.
Part IV.
 Read in one’s speech file recorded at a slower speed.
 Apply the function “formant” to the signal to get PSD.









Apply the function “pickmax” to the PSD returned by “formant” (panda) to get
index of maximums.
Normalize the index by dividing by 128.
Store the index in a memory variable.
Apply function “pitch” (panda) and store the average pitch in a different memory
variable.
Establish two vectors to hold the Euclidean difference between the index and
average pitch of the former file and each of the other 83 speech files.
Use “sprintf” and a “for” loop in a fashion similar to that described on the website
mentioned in the preliminaries in order to read in each file. During each iteration
of the “for” loop, calculate the PSD of each file, calculate the index of each file,
normalize the index, store the Euclidean difference between the index of the very
first file read in and the present file in the appropriate entry of the vector,
calculate the average pitch, and store the difference between the average pitch of
the first file read in and the present file in the appropriate entry of the vector.
After exiting the loop use “sort” to arrange the two vectors in order of smallest
difference.
Display the first four entries of the vector that shows the location of each entry in
the vector before sorting.
These numbers coorespond to the four closest matches to the first file read in.
The first number should coorrespond to the first file read in.
Experimental Results
Part I.
File “a01.wav” was found to be the file said at the slower speed. Using
MATLAB, an m-file was opened and the file was read into the m-file using “wavread”
(see appendix). A time vector was established by dividing the number of elements in the
vector (30,000) by the sampling frequency (8,000 Hz) to find the maximum time of 3.75
sec. Since MATLAB vectors are indexed starting with 1, the time was designated to
begin at 1/8000 sec and to proceed at 1/8000 sec intervals. The resulting vector was
plotted in order to get an idea of where to start. It was estimated that the “ECE 310” part
of the signal began around 1.75 sec. Multiplying this time by the sampling frequency
(8,000 Hz) led to the estimate that the “ECE 310” part of the signal began at entry 14,000
in the vector. A new vector, the same size as the original, was created. Using a “for”
loop, the last section of the original vector was copied to the first part of the new vector.
Using another “for” loop, the first part of the original vector was copied to the last part of
the new vector. “wavwrite” was used to write the modified signal out to the file
“rearrange.wav”. Windows Media Player was used to listen to the modified signal. The
original estimate was found to be incorrect and the technique of trial-and-error was used
until the correct cut-off was found to be entry 16,000 in the original signal. The result of
the experiment done in Part I. was a .wav file titled “rearrange.wav”. The “ECE 310”
part of the original file, the file said at a slower speed, was moved to the beginning of the
file. “rearrange.wav” consisted of the message, “ECE 310 signals and systems,” as
opposed to the original which consisted of, “signals and systems ECE 310.”
Part II.
The file said at a faster speed was found to be “a02.wav”. This file was read into
MATLAB in a similar fashion to Part I. This signal was found to be represented by a
27,425 element vector with the same sampling frequency as Part I. A time vector with a
maximum time of 3.428125 was generated. Using the “randn” function, a random vector
the same size as the original signal was generated. Using code copied and pasted from
the webpage (panda) mentioned in the Preliminaries, the vector was generated with
Guassian distribution, specified standard deviation of 0.05, and mean 0. This random
vector was then added to the original signal to create a noisy representation. “wavwrite”
was used to write the modified signal out to the file “addnoise.wav”. Using Windows
Media Player, the modified signal was listened to in order to check the new signal for
adequate distortion. The result of the experiment done in Part II. was a .wav file titled
“addnoise.wav”. Guassian noise was added to the file said at a faster speed. A single
plot with the original signal plotted above the modified signal was also generated.
Part III.
Files “a63.wav” was found to be the file said with a noisy background. It was
read into Matlab and the function “fft” was used to take the Fourier transform of the
signal. The magnitude of the Fourier transform was then plotted. The “butter” command
was used to generate a low-pass filter (panda) with cut-off frequency 0.15 and order 3.
The “filter’ command was used to apply the low-pass filter to the signal. The Fourier
transform of the modified signal was plotted and the modified signal was written to the
file “filter.wav.” Media Player was used to check the modified signal for adequate
filtering. The result of the experiment done in Part III. was a .wav file titled “filter.wav”.
The original signal was said in the midst of a noisy background, and the modified signal
had the noise filtered out . A single plot consisting of the Fourier transform of the
original signal plotted over the Fourier transform of the filtered signal was also generated.
Part IV.
File “a01.wav” was read into an m-file. The function “formant” (panda) was applied
to obtain the PSD of the file. The function “pickmax” (panda) was used to find the
indices of the maximums in the PSD. The function “pitch” (panda) was used to obtain
the average pitch of the signal contained in the file. Vectors were created to hold the
norm of the difference between the indices and average pitch of the first file and the other
83 files. Using “sprintf”, the other 83 files were read into MATLAB (panda), their
indices and average pitch were calculated, and the norm of the difference between each
file and the first file were recorded in a vector. “sort” was then used to sort both vectors,
and the first four entries in each vector were printed out to determine which files were
predicted to be by the same speaker. The predictions of this experiment were very poor.
The three closest matching files by the analysis of formant indices were first “a04.wav”,
then “a82.wav”, and lastly “a15.wav”. The three closest files by the analysis of the
average pitch were first “a15.wav”, then “a19.wav”, and lastly “a71.wav”. None of these
were spoken by the same speaker as “a01.wav”.
Discussion
In this experiment, my first practical understanding of the application of signals
and systems theory to a real problem was gained. I was able to use computer software to
manipulate digital audio signals, by rearranging their contents, adding background noise
to them, and filtering out noise that was already there. The cause of the inability to
accurately distinguish between different audio files in Part IV. is beyond the scope of my
understanding; however, it did afford me the opportunity to sharpen my MATLAB skills
and to glimpse an application of the material I have learned in class. I have found the
mere concept of manipulating speech files and matching speakers to be a very intriguing
task to tackle. In the future, I would like to learn more about the manipulation of signals
and how to apply it to find solutions to real problems.
Reference



panda.ece.utk.edu/hqi/ece310/project/final/htm
Klevans, Richard L. and Rodman, Robert D. Voice Recogniton. Artech House:
Boston 15-31.
Speech Production, Labeling, and Characteristics. 51-74.
Appendix
Part I.
% Function:
This code reads in the wave file from the first round at
the slower speed, moves
%
ECE 310 in front of signals and systems, plots out the
original and modified
%
signals, and writes the modified signal out to
rearrange.wav
% Author:
Scott Pigg
% Date:
April 19, 2001
clear all;
clf;
[y,fs,nbits]=wavread('a01.wav');
t=1/fs:1/fs:3.75;
%
%
read in slower signal
establish time vector
x=y;
size in order to hold modifications
%
establish vector of same
subplot(211)
plot(t,y)
%
plot original signal
for i=1:16000,
signal to first part of new signal
x(i)=y(i+14000);
end
%
move last part of original
for i=16001:30000,
%
original signal to last part of new signal
x(i)=y(i-16000);
end
move first part of
subplot(212)
plot(t,x)
plot modified signal
%
wavwrite(x,'a:\project310\Experimental Results\rearrange.wav')
%
write mod
Part II.
% Function:
This code reads in the wave file from the first round
said at a faster speed,
%
adds random noise to the signal, plots out the original
and modified signals,
%
then writes the noisy signal to a file named
addnoise.wav.
% Author:
Scott Pigg
% Date:
April 20, 2001
clear all;
clf;
[y,fs,nbits]=wavread('a02.wav');
at faster speed
t=1/fs:1/fs:3.75;
vector
%
read in signal said
%
establish time
sigma = 0.05;
mu = 0;
n = randn(size(y))*sigma + mu*ones(size(y));
%
%
%
std deviation
mean
get random signal
z=y+n;
to original signal
%
apply random signal
subplot(211);
plot(t,y);
title('Part II. Original signal');
subplot(212);
plot(t,z);
title('Modified signal');
%
plot original signal
%
plot modified signal
wavwrite(z,'a:/addnoise.wav');
signal to addnoise.wav
%
write modified
Part III.
% Function: This code will read in the noisy take and use a low-pass
filter with a cut-off
%
frequency of 4,000 Hz to filter the dft of the signal.
The inverse dft
%
is then written to filter.wav.
% Author:
Scott Pigg
% Date:
April 24,2001
clear all;
clf;
[y,fs,nbits]=wavread('a63.wav');
t=1/fs:1/fs:3.428125;
%
%
read in the noisy file
establish time vector
y1=fft(y);
signal
y2=fftshift(y1);
%
Fourier Transform of original
%
plot the magnitude of the
%
at the top of a 2 row plot
No=size(y2);
f=-No/2:No/2-1;
subplot(211);
Fourier Transform
plot(f,abs(y2));
title('FFT of Original Signal');
order = 3;
cut = .15;
[B, A] = butter(order, cut);
x = 4*filter(B, A, y);
signal
%
% order of filter
cut-off frequency of filter
% get low-pass filter
% apply filter to noisy
x1=fft(x);
signal
x2=fftshift(x1);
%
Fourier Transform of filtered
subplot(212);
Fourier Transform
plot(f,abs(x2));
plot
title('FFT of Filtered Signal');
%
plot the magnitude of the
%
at
wavwrite(x,'a:\filter.wav');
filter.wav
%
the bottom of a 2 row
write filtered signal to
Part IV.
% Function:
This code reads in each of the 83 .wav files,
calculates the PSD, the indices,
%
the pitch, calculates the difference between the
indices and pitch of each of
%
the 83 files and that of "a01.wav" (my file said
at a slower speed), stores them
%
in two vectors, and sorts them to find the best
matches.
% Author:
Scott Anthony Pigg
% Date:
May 7, 2001
% Acknowledgement: The functions formant(), pickmax(), pitch(),
pitchacorr() which is called
%
by pitch(), and the code for sprintf were provided
by Dr. Hairong Qi.
clear all;
[x, fs, nbits] = wavread('a01.wav');
[P, F] = formant(x);
[Pm,I1] = pickmax(P);
I1 = I1/128;
[t, f0, avgF01] = pitch(x, fs);
indexKennel=zeros(1,83);
pitchKennel=zeros(1,83);
% read in the files calculate PSD, indices of maximums, pitch, and
store in vectors the
% difference between index and pitch of file and that of "a01.wav"
for i=1:83
if i<10
filename = sprintf('a0%i.wav', i);
else
filename = sprintf('a%i.wav', i);
end
[x, fs, nbits] = wavread(filename);
% find formant of each signal and store in appropriate column of
formantKennel
[P, F] = formant(x);
%
find the maximums and indices of the PSD (P) of x
[Pm,I] = pickmax(P);
I = I/128;
indexKennel(1,i)=(norm(I1-I));
% find pitch of each signal and store in appropriate column of
pitchKennel
[t, f0, avgF0] = pitch(x, fs);
pitchKennel(1,i)=(norm(avgF01-avgF0));
end
% sort the differences in indices and pitch, and output the 4
smallest differences in each
[A,Index]=sort(indexKennel);
Index(1,1:4)
[B,Pitch]=sort(pitchKennel);
Pitch(1,1:4)
Download