Development of a Voice Recognition Program May 9, 2002 T2 Michael Kohanski, Anna Marie Lipski, Justin Tannir, Tony Yeung ABSTRACT The objective of the following experiment was to develop a digital voice recognition program utilizing LabView and MatLab 5.3R11 that will be able to identify people uniquely by their voice. As seen from the demonstration, the programs developed equip us with the necessary tools to record, filter, and analyze different voice samples and compare them to the archived sample. Furthermore, any threshold value can be set, depending on the security level wanted.. The spectrogram program subtracts two fingerprints to get a relative difference. A large relative distance indicates little correlation between the voices. The results show that this program can accurately determine the difference between a male and a female voice and to a certain extent differentiate between different male voices. The peak comparison program is able to differentiate between four speakers saying the word “open” on multiple occasions. It compares the multiple repeats of the word to stored fingerprints of the four speakers saying, “open”. This program is a good foundation for allowing a user to identify a person based upon voice fingerprinting. BACKGROUND Voice recognition is not speaker-independent. It amplifies the idiosyncratic speech characteristics that individuate a person and suppresses the linguistic characteristics. The goal of this experiment was to identify, through comparative measures, the characteristic fingerprints of certain pitches and sounds. Speech recognition has been actively studied since the 1950s, however recent developments in computer and telecommunications technology have improved speech recognition capabilities. In a practical sense, speech recognition will solve problems, improve productivity, and change the way we run our lives. Academically, speech recognition holds considerable promise as well as challenges in the years to come for scientists and product developers. Two applications of digital speech processing, pitch recognition and voice fingerprinting, were modeled in order to create a working program in MatLab 5.3R11 that could differentiate among different frequencies. Pitch recognition was modeled initially in order to create a program that could classify and distinguish between different notes, and to determine the frequencies by FFT analysis at which each note resonated. After doing this we could then further develop our program to analyze and determine the differences in human voice. The voice fingerprinting method was utilized in order to differentiate voices among those who participated. Voice fingerprint refers to the distinct manner in which the air from an individual’s lungs passes through the vocal chords and resonates off of the pallet. It is through noise filtering and Fast Fourier Transform (FFT) analysis; comprehensive comparisons can distinctly identify the owner of a sample voice. Moreover, graphical representations of the voiceprint took the form of a speech spectrogram, which consists of a representation of a spoken utterance in which time is displayed on the abscissa and frequency on the ordinate. Voice Recognition consists of two major tasks: feature extraction and pattern recognition. Feature extraction attempts to discover characteristics of the speech signal unique to the individual speaker while pattern recognition refers to the matching of features in such a way as to determine, within probabilistic limits, whether two sets of features are from the same or different individual (peak to peak analysis of voices). Voices can be analyzed in either the temporal domain or the frequency domain. Features of a voice in the time domain are either evident in the raw signal, or they become more evident by deriving a secondary signal that has particular properties. The major time-domain parameters of interest are duration and amplitude. Features of the voice are identified by spectral analysis in the frequency domain. The FFT is the most commonly used technique that calculates the spectrum from a signal, and it was the method of choice in the following experiment. It provides a measure of the frequencies found in a given segment of a signal by decomposing it into its sine components. Furthermore, it allows for pitch extraction and the identification of fundamental frequencies, which was an essential component of the experiment. Areas of analysis included: parameter extraction and distance measurements. Parameter extraction, as discussed before, consists of preprocessing an electrical signal to transform it into a usable digital form, applying algorithms to extract only speaker-related information from the signal, and determining the quality of the extracted parameters. The two parameters of choice were pitch and frequency representations. Pitch allows for distinguishing between a male and a female voice, and frequency allows for the representation of a signal in various time frames. This is equivalent to a spectrogram in numerical form. The numerical form of a spectrogram was computed utilizing the FFT algorithm. It is from the distance measurements that one can determine how much two voices differ from each other. Here, a threshold value can be decided (which can be set to whatever value desired, depending on the amount of security needed), and if the distance between the unknown speaker and the known speaker is greater than the threshold value, the unknown is rejected. THEORY & METHOD Peak Analysis Program The characteristic properties of one’s voice is a function/product of many parameters including the dynamics of sound as it passes through the pharynx, the vibration of the larynx, the shape of the mouth and how sound reverberates off of the palette. Due to the uniqueness complexity of one’s voice, a voice has fingerprint qualities, i.e. no two people’s voice patterns are exactly alike. The method of FFT analysis maps the frequency fingerprint of a person’s voice, in order to systematically determine if another sample is or is not from the same subject. This program gathers the FFT fingerprint from the subject in question and compares it to four previously stored fingerprints in order to determine if the subject’s voice matches one of the archived voices. The program itself is divided into five functions. wav_gather.m: This is the main function and it searches out the file(*.wav) of the subject, checks/converts the data to mono from stereo (ie takes only input from one speaker) and outputs to wav_plot.m wav_plot.m: Sends the raw data to wav_filter.m to be filtered. Then it gathers the filtered data and plots the FFT, spectral density and original waveform wav_filter.m: Filters the raw data with high pass, low pass, bandpass, bandstop, or no filter. peak_compare.m: Sends the FFT to peak_finder.m to get the peaks. The peaks are then returned and compared against archived fingerprints. Peak_finder.m: Analyzes filtered data and picks out peaks via a complex downward scan of the FFT up-to a preset threshold value. Spectrogram Program In addition to the frequency domain, analysis can also be performed in the timefrequency domain using the spectrogram function in Matlab. The spectrogram performs a shorttime Fourier Transform on the input signal in the time domain and transforms it into an intensity frequency-time plot. The short time Fourier Transform is based on the Fourier Transform but the input signal, x(t), is multiplied by a shifted window function, w(t-τ). Equation 1 is the ShortTime Fourier Transform equation. (Eq. 1) Instead of performing a Fourier Transform on the whole signal, the signal is segmented into many sections and a short-time Fourier Transform is performed on each windowed segment. Figure 2 is a spectrogram of the A string (440Hz) on a violin. The top figure is the signal in a voltage-time plot and the bottom figure is the spectrogram. Each point (x1,y1) on the spectrogram indicates the intensity of the frequency at that particular time point. Pixel with red color has the highest intensity and that with blue has the lowest intensity. Figure 1 Input signal in a voltage-time plot and in a spectrogram. Since the spectrogram is essentially a “fingerprint” of a person’s voice, it shows particular features specific to the subject himself. For instance, the parallel bands observed in the above example show the harmonics of the A note. Based on the voice fingerprinting, different subjects will have different parallel bands as well as other features such as delays and time shifts. In this project, the spectrograms of two different subjects are subtracted from each other and the result is the relative difference in frequency intensity. The largest difference in the frequency intensity indicates that those two subjects vary the most. It is also important to note that the two voices are pre-aligned to obtain the best overlap between the two. This is achieved by using the crosscorrelation function between the two voices and obtaining the best correlation coefficient and the lag time. The following parameters were selected while using the peak compare program. Filter value is a normalized value, where an input of 0.5 corresponds to a value of 50% of the frequency. Percent difference in peaks corresponds to the maximum deviation in total number of peaks between an archive and a test sound. This will stop false positives from occurring for a test or an archive being a subset of the other. Maximum difference between the peaks corresponds to the maximum deviation in the frequency axis allowed between peaks in an archive versus a test subject for it to be considered the same peak. In general, this is a direct function the value used to define peak separation in the peak_finder.m function and has been defaulted to r/80, where r corresponds to the total number of data points in the FFT. parameter value/setting used filter type high pass filter value-normalized 0.2 peak-compare Yes percent diff in peaks 20 maximum diff between peak frequencies 100 Table 1 Parameter setting for the Peak Analysis programs METHODS & MATERIALS Materials 16 Bit Sound Card Microphone LabView Tuning forks MatLab 5.3R11 Programs Created ● Wave Gather (wav_gath) ● Wave Filter (wav_filt) ● Wave Plot (wav_plot) ● Peak Finder (peak_find) ● Peak Compare (peak_compare) Figure 2 Representation of the protocol. The specific protocol of the voice data collection required that all four of the participants speak into the microphone while running LabView. The settings for LabView data acquisition were at a sampling rate of 44100Hz. Four different individuals spoke chosen words into the microphone at different times: “subject,” and “open.” A total of four subjects participated in the experiment. Data was then saved in LabView and then re-opened, filtered, graphed, and analyzed in MatLab 5.3R11 by FFT analysis using the programs that were created. The archived voice was compared to all other voices recorded. The overall method of data acquisition was as followed: sound wave created by person’s speech was transuded into an analog electrical signal via mic the signal is sampled & quantized resulting in a digital representation of the analog signal. Wav_gather wav_plot Peak_comp wave_filt Peak_finder Figure 3 Order of Signal Analysis After signal collection, the analysis segment of the experiment utilized the programs made for the experiment. The data collected in LabView is transferred to MatLab and first, wave gather gathers the analog signal. Then wave plot plots the FFT of the signal, wave filter filters the original signal, and finally peak comparison and peak finder are used for the analysis of the signal, along with another signal to determine the owner of the voiceprint as well as if the two signals match. RESULTS Peak Finder & Compare Program Figure 4 is a raw wave file (voltage vs. time). ><(((*> m-open-normal2.wav Waveform <*)))>< 0.3 0.25 0.2 0.15 0.1 0.05 0 -0.05 -0.1 -0.15 -0.2 0 0.05 0.1 0.15 0.2 time 0.25 0.3 0.35 0.4 Figure 4 Raw signal in time domain. Table 2 is a sample of results from the peak compare program. Anna #2 Justin #2 Justin FAIL PASS Archived Vioce Data Mike Tony FAIL FAIL PASS FAIL Deviation Standard % Peak Difference Anna PASS FAIL +/- 100 20 Table 2 Sample of results from peak compare program The FFT on the word “open” spoken by Mike in trial #2 is presented in figure 5. As labeled in the text box, the red marker labels the peaks obtained from the FFT directly beneath it in blue. It could be seen that the red peaks overlap almost exactly on top of the peaks from the FFT in blue. This shows that the peak finder program is able to identify the peaks on a FFT accurately. Figure 5 An illustration of the red peaks identified by the peak finder program The FFT on the word “open” spoken by Anna in trial #1 and #2 is presented in figure 6. As labeled in the text box, the red represents the peaks obtained from FFT of trial #1 and the blue is the FFT of trial #2. It could be seen that the red peaks overlap to a great extent to the blue peaks from the FFT directly beneath it. Figure 6 The red peaks of Anna in trial #1 are compared to the blue peaks from trial #2 The FFT on the word “open” spoken by Anna in trial #2 and Justin in trial #1 is presented in the following Figure 7. It could be seen that the red peaks from Justin’s FFT do not match Anna’s peaks in the FFT (in blue) as closely as in the above figure, Figure 6. Figure 7 Red peaks from Justin are compared to Blue peaks from Anna Spectrogram Program The relative difference in frequency intensity between two speakers is presented in the following figure. A positive relative difference shows that the first subject (listed below each column) has higher frequency intensity than the second subject. However, the difference can be the result of delays, shifts or pitches at which the word is being voiced. The error bars are the standard deviations in the measurement. Relative Difference in Frequency Intensity between Speakers Relative Difference in Frequency Intensity 3500 3000 2500 2000 1500 1000 500 0 -500 Tony-Tony Justin-Tony Mike-Tony Speaker Figure 8 Relative difference in frequency intensity between speakers. Anna-Tony ANALYSIS Various FFT fingerprints are presented in Figures 5 for the word “open”. The word’s pronunciation and syllabic simplicity gives rise to the correspondingly simplistic FFT representations, which are appropriate for graphical display. More complex words often contain many more peaks, and the resultant FFTs become increasingly more difficult to subjectively analyze. Furthermore, a slightly different inflection of a word will give rise to a different FFT, The adjustable parameters in the peak compare program (percent difference in peaks and maximum difference between peak frequencies [Table 1]) exist to compensate for slight deviations in voice between trials, and prove effective in at least two of the subjects. The last two subjects did not exhibit such similar FFT fingerprints between trials, the causes of which could range from the distance and angle of the microphone to the constriction in the voice passage. Particularly good results are found in the trials of subject “Anna”, in which the second trial matches in good precision to the first, but exhibits no other extraneous matches. (Refer to Figure 6.) Subject “Justin” also demonstrates correspondence between the first and second trial, but also matches subject “Mike”; upon examining the FFTs, it is clear that the two voices are quite similar. This similarity can be justified in the simplicity of the word (i.e. the small number of peaks in the FFT) and the fact that the program itself is still in relatively early stages of development. In addition to this, The FFT fingerprinting in this experiment can be extended to alterations on the topic, such as matching voices to percentage values. Yet another application is on-the-fly analysis of subjects for the purpose of voice authentication, which could be coupled with spectrogram analysis to amplify security The relative difference in frequency intensity between speakers are plotted in Figure 8. The word “subject” was spoken by all speakers and the average relative difference from ten trials are shown. The difference can be a combination of different patterns of the voice and different pitches at which the word was being spoken. Thus, the variation can be due to the fact that the delay in between “sub” and “ject” is different between any two speakers or “sub-ject” all together was spoken at higher or lower pitches. It was seen that the difference between Tony and himself again was the smallest among the four groups. There was a significant increase in the relative difference for the other three groups Justin-Tony, Mike-Tony and Anna-Tony. The word “subject” compared between Anna, Tony had the most difference, and this result is reasonable and consistent with speech patterns in real life. It was also interesting to see that the program was able to differentiate between male speakers. It was shown that between Tony and himself, and between Tony and either Mike or Justin, the program finds the least difference between Tony and himself and significantly larger relative difference between Tony and the rest of the male speakers. Thus, it shows that the subtraction method of the spectrograms is sensitive enough to differentiate between the main speaker and the rest of the male speakers. As discussed above, one disadvantage of using the subtraction method is the inability to differentiate between patterns of voice and pitches of the word. Thus, an improvement of the present spectrogram analysis would be to utilize other algorithms such that these differences can be detected. One possible method is to identify two areas on the spectrogram with very high frequency intensity and determines the change in time between these two points. This change in time in one spectrogram can be compared to other ones on different spectrograms. Similar method like this one would thus be able to distinguish speakers based on speech patterns. With this improvement, voice recognition program would be able to differentiate speakers of the same sex much more accurately based on his or her distinctive way of speech. CONCLUSION As seen from the demonstration, the programs developed equip us with the necessary tools to record, filter, and analyze different voice samples and compare them to the archived sample. Furthermore, any threshold value can be set, depending on the security level wanted.. The spectrogram program subtracts two fingerprints to get a relative difference. A large relative distance indicates little correlation between the voices. The results show that this program can accurately determine the difference between a male and a female voice and to a certain extent differentiate between different male voices. The peak comparison program is able to differentiate between four speakers saying the word “open” on multiple occasions. It compares the multiple repeats of the word to stored fingerprints of the four speakers saying, “open”. This program is a good foundation for allowing a user to identify a person based upon voice fingerprinting.