3D Echo Cancellation in a Home Environment by Gina F. Yip Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology February 6, 2001 Copyright 2001 Gina F. Yip. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. Author__________________________________________________________________ Department of Electrical Engineering and Computer Science February 6, 2001 Certified by______________________________________________________________ David L. Waring VIA Company Supervisor Telcordia Technologies Certified by______________________________________________________________ David H. Staelin Thesis Supervisor Accepted by_____________________________________________________________ Arthur C. Smith Chairman, Department Committee on Graduate Theses 3D Echo Cancellation in a Home Environment by Gina F. Yip Submitted to the Department of Electrical Engineering and Computer Science February 6, 2001 In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science ABSTRACT This thesis describes the work done to research, implement, and compare various algorithms for the cancellation of echoes in a home environment, where the room impulse response is unknown and variable. The general problem, where the speaker’s movements are completely unrestricted, is a very hard one, and research in this area has only begun in the last several years. Therefore, this thesis addresses a simplified version of the problem, where the impulse response of the multipath environment is assumed to be stationary within the duration of a verbal command. Given this assumption, which is reasonable for most situations, algorithms based on the complex cepstrum, autocorrelation, and delay and sum methods of echo cancellation were chosen and developed for the study. Many simulation tests were done to determine the behavior of the algorithms under different echo environments. The test signals were based on the simple delay and attenuation echo model with one microphone, and on a more realistic echo model, generated by the Cool Edit Pro software, with one or three microphones. The performance metrics were the number of errors and the percent of improvement in speech recognition by Dragon Systems’ Naturally Speaking software. The results showed vast improvement for the cepstral domain methods on the simple echo signals, but the numbers were mixed for the complex model, one microphone cases. However, with three microphones, the delay and sum algorithm showed consistent improvement. Given that research in this specific area of 3D echo cancellation in a home environment, where 3D refers to the moving speech source, is still in its early stage, the results are encouraging. VIA Company Supervisor: David L. Waring Title: Director of Broadband Access & Premises Internetworking Group, Telcordia Technologies Thesis Supervisor: David H. Staelin Title: Professor of Electrical Engineering & Computer Science, Assistant Director of Lincoln Lab -2- Acknowledgements Resounding thanks to my supervisor at Telcordia, Dave Waring, for being extremely supportive in providing me everything I needed to complete the project. Loud thanks to my thesis advisor, Professor David H. Staelin, for his technical advice and guidance. Thanks to my mentor, Craig Valenti, at Telcordia for helping me get the project off the ground and for reading my thesis, and thanks to Murray Spiegel for his sound advice. Also, thanks to Stefano Galli, Kevin Lu, Joanne Spino, Brenda Fields, and everyone else at Telcordia who helped me along the way. Thanks to Jason, my officemate and fellow 6A intern, for being my sounding board and lunch buddy. A shout of thanks to my friends, who kept me sane during these long, quiet months in Morristown, NJ: Anne, Jenny, Linda, Lucy, Nkechi, Teresa, Xixi, and Yu. Finally, deep gratitude to my parents for their love, support, and sacrifices through the years! -3- Table of Contents ABSTRACT................................................................................................................................................... 2 ACKNOWLEDGEMENTS ......................................................................................................................... 3 TABLE OF CONTENTS ............................................................................................................................. 4 LIST OF FIGURES ...................................................................................................................................... 6 CHAPTER 1.................................................................................................................................................. 8 1.1 HOME NETWORKING ......................................................................................................................... 8 1.1.1 Ideal Home Networking: Smart Houses................................................................................. 8 1.1.2 Problems ................................................................................................................................. 8 1.2 RELATED WORK ................................................................................................................................ 9 1.2.1 Visual Tracking by MIT Media Lab ...................................................................................... 10 1.2.2 Array Processing................................................................................................................... 10 1.2.3 Blind Source Separation and Deconvolution (BSSD) ........................................................... 10 1.2.4 Adaptive Processing.............................................................................................................. 11 1.2.5 Simpler Techniques ............................................................................................................... 11 1.3 SCOPE OF THESIS ............................................................................................................................. 11 1.4 STRUCTURE OF THESIS .................................................................................................................... 12 CHAPTER 2................................................................................................................................................ 13 2.1 MAIN ALGORITHMS......................................................................................................................... 13 2.1.1 MPD...................................................................................................................................... 13 2.1.2 C2I ........................................................................................................................................ 17 2.1.3 DSA ....................................................................................................................................... 22 2.2 ACTUAL METHODS IMPLEMENTED .................................................................................................. 22 CHAPTER 3................................................................................................................................................ 24 3.1 3.2 BASIC ECHO MODEL ....................................................................................................................... 24 COMPLEX ECHO ENVIRONMENT SIMULATION ................................................................................. 26 CHAPTER 4................................................................................................................................................ 28 4.1 GOALS ............................................................................................................................................. 28 4.2 SPEECH DATA USED ........................................................................................................................ 28 4.3 METHODS ........................................................................................................................................ 30 4.4 RESULTS .......................................................................................................................................... 32 4.4.1 Simple Echo Environments ................................................................................................... 32 4.4.2 Complex Echoes, One Microphone....................................................................................... 37 4.4.3 Complex Echoes, Three Microphones................................................................................... 41 4.4.4 Different Training Environments .......................................................................................... 45 CHAPTER 5................................................................................................................................................ 50 5.1 CONCLUSIONS ................................................................................................................................. 50 5.2 FUTURE WORK ................................................................................................................................ 51 5.2.1 Testing in Real Echo Environments ...................................................................................... 51 5.2.2 Types of Microphones ........................................................................................................... 52 -4- 5.2.3 Microphone Placement ......................................................................................................... 52 5.2.4 Real Time .............................................................................................................................. 52 5.2.5 Continual or Rapid Speaker Movement ................................................................................ 52 5.2.6 Multiple Speakers.................................................................................................................. 53 5.3 FINAL THOUGHTS ............................................................................................................................ 53 APPENDIX A.............................................................................................................................................. 54 APPENDIX B .............................................................................................................................................. 58 B.1 TEST FUNCTIONS ............................................................................................................................... 58 B.2 SUPPORT FUNCTIONS ........................................................................................................................ 59 B.3 SOURCE CODE ................................................................................................................................... 62 B.3.1 Main Algorithms ....................................................................................................................... 62 B.3.2 Test Functions........................................................................................................................... 68 B.3.3 Support Functions...................................................................................................................... 79 APPENDIX C.............................................................................................................................................. 85 C.1 C.2 C.3 C.4 RESULTS FOR SIMPLE MODEL ........................................................................................................... 85 TABLES FOR COMPLEX MODEL SIGNALS WITH ONE MICROPHONE ................................................... 87 TABLES FOR COMPLEX SIGNALS WITH THREE MICROPHONES .......................................................... 88 DIFFERENT TRAINING ENVIRONMENTS ............................................................................................. 90 REFERENCES............................................................................................................................................ 97 -5- List of Figures Figure 2-1: Complex cepstrum of the min-phase component of a signal with an echo at delay = 0.5s, attenuation = 0.5 .................................................................................. 14 Figure 2-2: Zoomed in version of Figure 2-1................................................................... 14 Figure 2-3: Block diagram of the MPD algorithm........................................................... 15 Figure 2-4: Complex cepstrum from Figure 2-1, after the spikes were taken out, using MPD .......................................................................................................................... 16 Figure 2-5: The spikes that were detected and taken out by MPD .................................. 16 Figure 2-6: Block diagram for C2I algorithm .................................................................. 18 Figure 2-7: Autocorrelation of the original clean signal .................................................. 19 Figure 2-8: Autocorrelation of the signal with an echo at delay = 0.5s, attenuation = 0.5 ................................................................................................................................... 19 Figure 2-9: Autocorrelation of the resultant signal after processing the reverberant signal with C2I..................................................................................................................... 20 Figure 2-10: Impulse response of an echo at delay = 0.5s, attenuation = 0.5 .................. 21 Figure 2-11: Impulse response estimated by C2I............................................................. 21 Figure 3-1: Simple model of an echo as a reflection that is a delayed copy of the original signal ......................................................................................................................... 24 Figure 3-2: Screen shot of the 3-D Echo Chamber menu in Cool Edit Pro 1.2 ............... 26 Figure 4-1: Female subject’s breakdown of errors for varying delays, with attenuation held constant at 0.5.................................................................................................... 32 Figure 4-2: Male subject’s breakdown of errors for varying delays, with attenuation held constant at 0.5............................................................................................................ 33 Figure 4-3: Female subject’s breakdown of errors for varying attenuation factors, with delay held constant at 11025 samples (0.5 seconds)................................................. 34 Figure 4-4: Male subject’s breakdown of errors for varying attenuation factors, with delay held constant at 11025 samples (0.5 seconds)................................................. 35 Figure 4-5: Percent improvement as a function of delay and of attenuation for male and female subjects .......................................................................................................... 36 Figure 4-6: Female subject’s breakdown of errors for complex, one microphone signals ................................................................................................................................... 37 Figure 4-7: Male subject’s breakdown of errors for complex, one microphone signals.. 38 Figure 4-8: Percent improvement vs. signal environment, female subject ...................... 39 Figure 4-9: Percent improvement vs. signal environment, male subject ......................... 40 Figure 4-10: Female subject’s breakdown of errors for complex, multiple microphone signals........................................................................................................................ 41 Figure 4-11: Male subject’s breakdown of errors for complex, multiple microphone signals........................................................................................................................ 42 -6- Figure 4-12: Percent improvement vs. echo environment, female subject ...................... 43 Figure 4-13: Percent improvement vs. echo environment, male subject ......................... 44 Figure 4-14: How C2I and MPD2 perform on simple echo signals under different training environments................................................................................................ 46 Figure 4-15: How C2I and MPD2 perform on complex reverberation, one microphone signals under different training environments........................................................... 47 Figure 4-16: How C2Is, DSA, MPDs, MPDs2, MPDs3, SCP perform on complex reverberation, multi-microphone signals under different training environments ..... 49 Figure A-1: Block diagram of the complex cepstrum...................................................... 54 -7- Chapter 1 Introduction 1.1 Home Networking Home networking can refer to anything from simply having a few interconnected computers in a house, to having appliances that are wired to the Internet, to having fully connected "smart houses.” The last definition is the one used in this thesis. 1.1.1 Ideal Home Networking: Smart Houses As the digital revolution rages on, the notion of smart houses is no longer just a science fiction writer’s creation. These houses are computerized and networked to receive and execute verbal commands, such as to open the door, turn on the lights, and turn on appliances. Ideally, microphones are placed throughout the house, and the homeowner is free to move about and speak naturally, without having to focus his speech in any particular direction or being encumbered by handheld or otherwise attached microphones. However, many problems must be solved first, before science fiction becomes reality. 1.1.2 Problems Specifically, speech recognition is crucial to the success of home networking, since home security, personal safety, and the overall system’s effectiveness are all -8- affected by this component’s ability to decode the speech input, recognize commands, and distinguish between different people’s voices. However, the performance of current speech recognition technology is drastically degraded by distance from the microphone(s), background noise, and room reverberation. Therefore, to increase the speed and accuracy of the speech recognition process, a pre-filtering operation should be used to adjust gain, eliminate noise, and cancel echoes. Of these desired functions, echo cancellation will be one of the hardest to design. Hence, the topic of this master’s thesis research is providing “clean” speech to the voice recognition engine by canceling the 3D echoes that are produced, when a person is speaking and moving about in a home environment. 1.2 Related Work It is true that much work has been done on echo cancellation. One especially famous project is Stockham et al.’s restoration of Caruso’s singing voice from old phonographic recordings [1]. However, there are additional factors in the home environment that complicate matters. For instance, different objects and materials in the house absorb and reflect sound waves differently, and many of these objects are not permanent, or at least, they are not always placed in the same location. Additionally, the processing must be done in real time (or pseudo real time), so speed and efficiency, which were less crucial in the Caruso project, need to be considered. For example, in the Caruso project, the researchers used a modern recording of the same song to estimate the impulse response of the original recording environment, but this is impractical for the task at hand. Finally, when the source of the signal is moving around, there is a Doppler Effect, and the system must either track the source’s location to accurately estimate the multipath echoes, adapt to the changing location of the source, or work independently of the source’s location. Therefore, while an overwhelming amount of work has been done on removing echoes, very few methods actually address the problem of unknown and changing source locations. For instance, in recent years, many solutions for applications such as hands-9- free telephony and smart cars have been published [2], [3], but in all of these cases, the speaker does not move very much, and the general direction of the speaker remains relatively constant. 1.2.1 Visual Tracking by MIT Media Lab One method, proposed by the MIT Media Lab, addresses the tracking problem visually [4]. This solution uses cameras and a software program called Pfinder [5] to track human forms and steers the microphone beam accordingly. However, the use of video cameras and image processing may be expensive—both computationally and monetarily. Also, while people may be willing to have microphones in their houses, they may still be uncomfortable with the possible violations of privacy due to having cameras in their homes. 1.2.2 Array Processing In addition, there has been a lot of research on using large microphone arrays to do beamforming [6]. However, these approaches require anywhere from tens to hundreds of microphones, which can be very expensive, especially for private homes with multiple rooms. Also, the math becomes very complicated, so processing speed and processing power may become issues. 1.2.3 Blind Source Separation and Deconvolution (BSSD) Another MIT student, Alex Westner, examined in his master’s thesis ways to separate audio mixtures using BSSD algorithms. These algorithms were adaptive and based on higher order statistics. Although his project focused on ways to separate multiple speech sources, it had the potential of shedding some light into the echo cancellation problem at hand. After all, one way to view echo cancellation is the deconvolution of an unknown room impulse response from a reverberant signal (also known as blind deconvolution). Also, the original speaker and the echoes could be viewed as multiple sources. However, further reading revealed that the BSSD algorithms - 10 - assumed that the sources were statistically independent, which is not the case for echoes, since echoes are generally attenuated and delayed copies of the original source. Also, Westner found that even a small amount of reverberation severely impairs the performance of these algorithms [7]. 1.2.4 Adaptive Processing Adaptive processing algorithms are very popular for noise cancellation, though they are used sometimes for echo cancellation as well. However, since this project focuses specifically on the context of home networking, it is reasonable to assume that utterances will tend to be limited to a few seconds. (e.g. “Close the refrigerator door.” or “Turn off the air conditioner.”) Therefore, the algorithms (normally iterative or recursive) are not likely to converge within the duration of the signals [8]. 1.2.5 Simpler Techniques Therefore, this thesis will focus on simpler, classical approaches, such as cancellation in the cepstral domain, estimating the multipath impulse response through the reverberant signal’s autocorrelation, and for the multiple microphone case, delaying and summing the signals (also known as cross spectra processing or delay and sum beamforming). In addition, the first two methods are combined with the third when there are multiple microphones, and the multi-microphone cepstral domain processing case is based on work done by Liu, Champagne, and Kabal in [9]. 1.3 Scope of Thesis Based on the background research described above, it seems that even with large arrays or highly complex algorithms, developing a system that effectively removes echoes in a highly variable environment remains extremely challenging. However, there are some assumptions that can be made, based on the nature of home networking, which will better define and simplify the problem. - 11 - As suggested previously, the echo environment is not stationary (i.e. objects and speakers are not in fixed locations), so the algorithms cannot assume any fixed impulse responses. This rules out predetermining the room impulse response by sending out a known signal and recording the signal that reaches the microphone. However, a key assumption, as mentioned in Section 1.2.4, is that utterances will tend to be short, so that the multipath environment can be considered stationary within the duration of an utterance. In other words, the person is not moving very fast while speaking. Note, though, that “movement” refers to any change in position, including turning one’s head. Change in the direction that the speaker faces will alter the multipath more drastically that other forms of movement. Therefore, in order for the stationary assumption to hold, the speaker must keep his head motionless while uttering a command. Another assumption is that the detection of silence is possible, which is valid, since most speech recognition software programs already have this feature. As a result, pauses can be used to separate utterances. Therefore, the purpose of this thesis is to develop, simulate, and compare echo cancellation algorithms in the context of smart houses. There are many other issues, such as dealing with multiple simultaneous speakers or external speech sources, (e.g. televisions and radios). However, these problems are beyond the scope of this thesis. 1.4 • Structure of Thesis Chapter 1 gives background information, motivation, and an overview of the problem, as well as defining the scope of the thesis. • Chapter 2 describes the algorithms that were chosen and implemented. • Chapter 3 describes the echo environments and how they were simulated. • Chapter 4 explains the experiments that were set up and run, and the various metrics used to compare the different algorithms and methods. • Chapter 5 gives conclusions and suggests future work to be done in this area. - 12 - Chapter 2 Methods 2.1 Main Algorithms The actual methods implemented are combinations of three basic ideas: MPD (Min-phase Peak Detection), C2I (Correlation to Impulse), and DSA (Delay, Sum, Average). 2.1.1 MPD This algorithm is based on the observation by Kabal et al. in [9] that in the cepstral domain, the minimum-phase∗ component of a reverberant speech signal shows distinct decaying spikes at times that are multiples of each echo’s delay. For instance, if there is an echo that begins at t = 0.5s, then there will be noticeable impulses at t = 0.5n, where n = 1,2,3… The height of these impulses depends on the echo intensity. Please see Appendix A for a detailed explanation of the complex cepstrum, why the echoes show up as spikes, and how zeroing them out results in canceling out the echoes. The following figures show the complex cepstrum of the minimum phase component of a signal with an echo at a delay of 0.5s and attenuated by 0.5: ∗ A signal is said to be minimum phase if its z-transform contains no poles or zeros outside the unit circle. - 13 - Complex cepstrum of min phase component of signal with echo at delay = 0.5s, echo attenuation = 0.5 10 5 0 −5 −10 0 1 2 3 4 5 [s/22050] 6 7 8 9 4 x 10 Figure 2-1: Complex cepstrum of the min-phase component of a signal with an echo at delay = 0.5s, attenuation = 0.5 Zoomed in version of complex cepstrum 0.5 0.4 0.3 0.2 0.1 0 −0.1 −0.2 −0.3 −0.4 −0.5 1 1.5 2 2.5 [s/22050] Figure 2-2: Zoomed in version of Figure 2-1 - 14 - 3 3.5 4 4 x 10 Given the above characterizations, the MPD algorithm works as follows: 1) decompose signal into its all-pass (ap) and minimum-phase (mp) components [10] 2) take the complex cepstrum of the mp component (cm) 3) put cm through a comb filter (done by a method called rfindpeaks2, which detects impulsive values and zeros them out) 4) take the inverse complex cepstrum of the altered cm 5) recombine with the all-pass component Here’s the algorithm in block diagram form: x[n] All pass and minimum phase decomposition mp[n] Complex Cepstrum ap[n] cm[n] Recombine All pass and Min Phase Components Comb Filter cm’[n] Inverse Complex Cepstrum mp’[n] s’[n] Figure 2-3: Block diagram of the MPD algorithm The following figures show the result of applying the algorithm on a signal with a simple echo at t = 0.5s, attenuation = 0.5: - 15 - Complex cepstrum with echo spikes zeroed out 10 5 0 −5 −10 0 1 2 3 4 5 [s/22050] 6 7 8 9 4 x 10 Figure 2-4: Complex cepstrum from Figure 2-1, after the spikes were taken out, using MPD Difference between complex cepstrum with echoes and complex cepstrum after echoes were taken out 0.6 0.5 0.4 0.3 0.2 0.1 0 −0.1 −0.2 0 1 2 3 4 5 [s/22050] 6 7 8 Figure 2-5: The spikes that were detected and taken out by MPD - 16 - 9 4 x 10 From Figures 2-4 and 2-5, it is clear that this method should work very well for simple echoes. However, when signals from complex and highly reverberant environments are used, the ambience effect will not be removed. This is because rather than having discrete echoes only, there are also echoes that are so closely spaced that they are perceived as a single decaying sound. These closely spaced echoes cannot be distinguished from the original signal content in the cepstral domain. 2.1.2 C2I This algorithm takes advantage of the observation that the autocorrelation function of the reverberant signal will have peaks at the echo delay(s). Therefore, the autocorrelation can be used to estimate the multipath impulse response. The following algorithm was used: Let: x[n] = reverberant signal of length N Rx[n] = autocorrelation of x (xcorr(x(n)) in Matlab) 1) Find Rx[n]. 2) Since Rx[n] is symmetric, with length 2N+1, where N = length of x[n], and the maximum of Rx is Rx[N] (normally, this occurs at Rx[0], but Matlab scales it differently), it is only necessary to look at Rx[N:2N+1]. Therefore, let Rx2[n] = Rx[N:2N+1]. 3) Use the findpeaks2 method (similar to rfindpeaks from MPD) to find the spikes in Rx2[n], which make up a scaled version of the estimated impulse response, h’[n]. 4) To actually get h’[n], the spikes are normalized, such that h’[0] = 1. - 17 - 5) The estimated original signal, s’(n), is found by IFFT(X[k])/H[k]), where X[k] = FFT(x[n]) and H[k] = FFT(h[n])*. In Figure 2-6, the algorithm is translated into block diagram form. x[n] xcorr( ) Rx[n] Estimate Impulse Response h’[n] FFT H’[k] Inverse 1 H' [k ] FFT X[k] Multiply S’[k] IFFT s’[n] Figure 2-6: Block diagram for C2I algorithm It is important to note that it is possible for H’[k] to include samples with the value zero. In this case, the algorithm would not work, due to the inversion step. Instead, direct deconvolution in the time domain is done using the deconv(x, h’) command in Matlab. However, this takes considerably longer (minutes, as compared to seconds when using the frequency domain method). Figures 2-7 to 2-11 show how this algorithm works in a simple case, where the echo attenuation (alpha) is 0.5, and the echo delay is 0.5 seconds (or 11025 samples). * The FFT (Fast Fourier Transform) is an algorithm for computing the DFT (Discrete Fourier Transform). The DFT is made up of samples of the DTFT (Discrete Time Fourier Transform), which is a continuous function. For a discrete time domain signal x[n], its “Fourier Transform” generally refers to its DTFT, which is expressed as either X(w) or X(ejw), while its DFT is generally expressed as X[k]. - 18 - Auto correlation of the original signal 700 600 500 400 300 200 100 0 −100 −200 −300 −400 0 2 4 6 8 10 time [s/22050] 12 14 16 18 4 x 10 Figure 2-7: Autocorrelation of the original clean signal Auto correlation of the reverberant signal 2000 1600 1200 800 400 0 −400 −800 −1200 0 2 4 6 8 10 time [s/22050] 12 14 16 18 4 x 10 Figure 2-8: Autocorrelation of the signal with an echo at delay = 0.5s, attenuation = 0.5 - 19 - Auto correlation of the estimated original signal 1800 1400 1000 600 200 −200 −600 −1000 0 2 4 6 8 10 time [s/22050] 12 14 16 18 4 x 10 Figure 2-9: Autocorrelation of the resultant signal after processing the reverberant signal with C2I The following figures show the actual impulse response and the one estimated by C2I, respectively: - 20 - Impulse response for simple case of delay = 0.5s, echo attenuation = 0.5 1 0.9 0.8 0.7 h 0.6 (0.5) 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 time [s/22050] 6 7 8 9 4 x 10 (11025=0.5s) Figure 2-10: Impulse response of an echo at delay = 0.5s, attenuation = 0.5 C2I estimated impulse response for simple case of delay = 0.5s, echo attenuation = 0.5 1 0.9 0.8 0.7 h 0.6 0.5 (0.4038) 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 time [s/22050] Figure 2-11: Impulse response estimated by C2I - 21 - 6 7 8 9 4 x 10 This algorithm is not likely to be as good as MPD is at eliminating simple echoes, because the estimated impulse response is not exact, even for the most basic case, as illustrated by Figure 2-11. Meanwhile, as illustrated by Figures 2-3 to 2-5, the MPD algorithm can very effectively detect all of the spikes for a simple case. However, it is harder to predict how C2I will perform in a complex environment, so it is still worthwhile to consider this algorithm. 2.1.3 DSA When there are multiple microphones, speech will generally reach the different microphones at different times. Therefore, to combine these signals, it is important to line the signals up first. This can be accomplished through finding the cross correlation between two signals and finding the maximum of the cross correlation function. Knowing the location of the maximum will then allow the relative delay to be calculated, and the signals can be lined up and added. For more than two microphones, the first two signals are lined up and summed, and that sum is used to line up with and add to the third signal, and so on. The sum is then divided by the number of input signals, thereby yielding the average. By lining the signals up and taking the average, the original speech signal adds constructively, while the echoes are generally attenuated. 2.2 Actual Methods Implemented The algorithms mentioned above can be combined in various ways, especially when there are multiple microphones, because the averaging can be done at different stages. The following is a list of the main methods that have been coded and tested: • mpd2(v) – takes in the sound vector v and performs the basic MPD algorithm • c2i(v) – takes in the sound vector v and performs the C2I algorithm - 22 - • mpds(m) – takes in the matrix m (whose columns are sound vectors), averages the allpass components using the DSA algorithm, takes a normal average (without lining up) of the min-phase components, and does MPD (steps 2-5) • mpds2(m) – takes in the matrix m, averages the all-pass components using the DSA algorithm, take the complex cepstrum and eliminate the impulses of the individual min-phase components, take the average of the cepstra, and then take the inverse cepstrum and recombine • mpds3(m) – similar to the previous two, except that the averaging of the min-phase component takes place after the inverse cepstrum has been taken for each processed signal • scp(m) – Spatial Cepstral Processing – takes in the matrix m, does DSA on the allpass components, averages the min-phase components in the cepstral domain (no peak detection), and recombines • dsa2(m) – takes in the matrix m and does plain DSA (i.e., without separating the allpass and min-phase components) • c2is(m) – takes in matrix m and applies the C2I algorithm to each column vector, and then does DSA averaging on the resultant vectors Of course, other functions have also been coded in Matlab in order to support these methods. A comprehensive list of all the methods and their respective descriptions will be included as Appendix B. - 23 - Chapter 3 Simulation of Multipath Environments This chapter will explain the echo models and simulations. A detailed description of the actual speech corpora generated will be discussed in the Experiments section of the next chapter. 3.1 Basic Echo Model The most basic echo model is that of a copy of the original signal reflected off a surface, and therefore, is delayed and attenuated relative to the “direct path” copy. Figure 3.1 illustrates this model. Source Direct Path Microphone Reflection Figure 3-1: Simple model of an echo as a reflection that is a delayed copy of the original signal - 24 - The two main parameters for each echo are the delay and the attenuation. To generate sound files based on this echo model, a Matlab function, addecho2wav(A, alpha, delay), was implemented. A is the original signal, represented as a vector with values between -1 and 1, inclusive. Alpha is a row vector whose elements indicate the attenuation of each echo, and delay is a row vector of the corresponding delays. Therefore, this function can add multiple echoes. In general, a reverberant signal is represented as a convolution (denoted by *) between the original signal and the room impulse response: x[n] = s[n] * h[n] (3.1) For the simple model, the form of the impulse response can be generalized as h[n] = δ[n] + α1 • δ[n-delay1] + α2 • δ[n-delay2] + … + αN • δ[n-delayN], (3.2) where N is the number of echoes, δ[n] is the unit impulse function, and “•” denotes multiplication. Given 3.2, the reverberant signal can also be expressed as follows: x[n] = s[n] + α1 • s[n-delay1] + α2 • s[n-delay2] + … + αN • s[n-delayN] (3.3) This is considered a simple model, because it does not take a lot of other factors into consideration, such as room size, damping ratios of different surfaces, and positioning of microphones and sound sources. The next section will discuss how to simulate more realistic reverberant signals. - 25 - 3.2 Complex Echo Environment Simulation A well-known mathematical model for simulating room reverberation is the Image Method [11]. Instead of actually implementing this method to simulate reverberant data, a popular audio editing software program, Cool Edit Pro 1.2, was used. This powerful (and fast) tool includes functions such as filtering, noise reduction, and 3D echo simulation, as well as multi-track mixing. The following figure is a screen shot of the 3D Echo Chamber menu: Figure 3-2: Screen shot of the 3-D Echo Chamber menu in Cool Edit Pro 1.2 As Figure 3-2 shows, this feature allows the specification of the room dimensions, speaker and microphone locations, damping factors of the room’s surfaces, number of echoes, etc. However, while Cool Edit Pro generates a fairly realistic reverberant signal, the software does have some limitations. For instance, it assumes that the speech source is a point source (i.e. speech radiates equally in all directions), which is not true, because the - 26 - direction a person is facing affects the signal that will be received by the microphone. Also, the software does not allow the user to specify which type of microphone is being used. An omni-directional microphone is assumed, which, as the name suggests, picks up sound from all directions with equal gain. Other types of microphones with different beam patterns are available, and they may be more practical for the room environment. Nevertheless, it is still possible to evaluate and compare the effectiveness of various echo cancellation algorithms, despite the points mentioned above. For instance, while the use of different microphones may improve the signal to noise ratios, it should not affect how well one algorithm performs relative to another algorithm. The same is true for having a directional source, which means that the signal content will be lower at some microphones. Therefore, these factors may affect the overall performance of the speech recognition system, but not the relative performance of the algorithms. - 27 - Chapter 4 Experiments and Results 4.1 Goals The experiments described in this chapter were designed to answer the following questions: 1) Under the simple echo model and using one microphone, how are C2I and MPD2 affected by echo attenuation (intensity) and by echo delay? 2) Under the complex echo model and using one microphone, how do C2I and MPD2 perform in low, medium, and high echo environments? 3) Under the complex echo model and using multiple (three) microphones, how do C2Is, DSA, MPDs, MPDs2, MPDs3, and SCP perform in low, medium, and high echo environments? 4) How does the training environment affect the algorithms’ performance in the above cases? 4.2 Speech Data Used The clean speech corpora were recorded using a low noise, directional microphone (AKG D3900) connected to a pre-amp (Symetrix SX202), which feeds the signals into the embedded ESS sound card of a Compaq desktop PC, thereby creating - 28 - digital sound files in the .wav format. The software used for the recordings is Cool Edit Pro. The sampling rate is 22 KHz, and each clip has 96000 samples, which translates to about 4.3 seconds in length. Each wave file contains one sentence that ranges from five to nine words long. There are sixteen such sentences used for testing, and there were two speakers: one male and one female. These clean signals were then digitally processed to add different levels of echoes. For the simple echo model described in Section 3.2, echoes of varying attenuation factors and delays were added. Specifically, the (attenuation, delay) pairs are (0.25, 11025), (0.50, 11025), (0.75, 11025), (0.5, 5513), and (0.5, 22050), where attenuation is a scalar, and delay is in number of samples. For the complex model, there are many variables and an infinite number of combinations of the different parameters. Therefore, in the interest of time, the test environments are simplified as low, medium, and high echo cases. The following chart specifies the parameters of the different environments: Table 4-1: Parameters for the different echo environments Low Echo Medium Echo High Echo Room Size (ft) 25 x 25 x 10 50 x 50 x 10 50 x 50 x 10 Source Coordinates (ft) (12.5, 12.5, 6) (25, 25, 5) (25, 25, 5) Mic1 Coordinates (ft) (25, 25, 5) (15, 35, 5) (15, 35, 5) Mic2 Coordinates (ft) (0.01, 25, 5) (25, 35, 5) (25, 35, 5) Mic3 Coordinates (ft) (12.5, .01, 5) (40, 35, 5) (25, 40, 5) Number of Echoes 20 350 1200 Surface Reflectivities (Floor, Ceiling, Walls) (0.2, 0.7, 0.7) (0.85, 0.85, 0.85) (1,1,1) - 29 - 4.3 Methods The metrics for measuring the effectiveness of the algorithms are the number of errors in speech recognition by Dragon Systems’ Naturally Speaking software and the percent of improvement in recognition. The number of errors is broken down into the number of misrecognized (wrong) words, added words, and missing words. For example: Original: the little blankets lay around on the floor Recognized: the little like racing lay around onward or “Like,” “onward,” and “or” are counted as wrong for “blankets,” “on,” and “the.” “Racing” is counted as an added word, and there is also a missing word at the end, since the original sentence had three words after “around,” but the recognized result only had two. These errors were counted manually and tallied for each test case. The percent improvement was also calculated for each algorithm in each test case. Percent improvement is defined as follows: % Improvement ≡ 100 × # of Errors for Unprocessed - # of Errors for Processed # of Errors for Unprocessed - # of Errors for Clean (4.1) While Unprocessed is signal environment specific, Clean is training environment specific. For instance, to determine the % Improvement of C2I on one microphone, complex, low echo signals, the number of errors for the unprocessed, one microphone, complex, low echo signals is used. On the other hand, Clean remains the same for all test cases within the same training environment. The comprehensive tables of these results are included as Appendix C. Meanwhile, the figures in the following section summarize the findings from the trials. Other metrics, such as the mean square error (MSE) relative to the clean signal, and the signal to noise ratio (SNR), were also considered. However, due to delays in the reverberant signal, the MSE will not provide a good measure of how much of the echoes - 30 - have been cancelled by the algorithms. The SNR is also inappropriate, because the algorithms adjust the gains of the signals to prevent clipping.* SNR is normally defined as 10 log (signal power/noise power), which could be calculated with the following formula: N 10 log( ås 2 o [n ] n =1 N å x o2 [n ] − n =1 N å ) s o2 [n ] (4.2) n =1 Where so[n] and xo[n] are the non DC biased versions of the original clean signal s[n] and the reverberant signal x[n], respectively. Mathematically, they are defined as so[i] = s[i] – mean(s[n]), for i = 1…N (4.3) xo[i] = x[i] – mean(x[n]), for i = 1…N (4.4) The problem arises when x[n] is normalized to the maximum volume that does not result in clipping, because the denominator is not really the noise power, since the signal in x[n] has been either amplified or attenuated. However, even without the normalization, there would be a problem with the SNR calculation, because the direct path signal is attenuated in the complex model, so signal power is lost. Therefore, the denominator could be negative, which would cause the log expression to equal negative infinity. * Matlab represents the volume of a signal at each sample as a decimal between –1 and 1, inclusive. If a sound vector contains a value outside of this range, then that sample is set to –1 or 1, depending on the original value’s sign. This is commonly referred to as “clipping.” - 31 - 4.4 Results Each of the following sections addresses one of the four questions posed in Section 4.1. The results are presented as numbers of errors, as well as percent improvement. 4.4.1 Simple Echo Environments The following results show the effects of varying delays and varying attenuations on speech recognition performance. Constant Attenuation, Varying Delay, Female Subject # Wrong # Added # Missing Total Errors 80 70 Number of Errors 60 50 40 30 20 10 0 none 5513 0.5 MPD2 5513 0.5 C2I 5513 0.5 none 11025 0.5 MPD2 11025 0.5 C2I 11025 0.5 none 22050 0.5 MPD2 22050 0.5 C2I 22050 0.5 Signal Environment Figure 4-1: Female subject’s breakdown of errors for varying delays (5513, 11025, and 22050 samples), with attenuation held constant at 0.5 - 32 - Constant Attenuation, Varying Delay, Male Subject # Wrong # Added # Missing Total Errors 80 70 Number of Errors 60 50 40 30 20 10 0 none 5513 0.5 MPD2 5513 0.5 C2I 5513 0.5 none 11025 0.5 MPD2 11025 0.5 C2I 11025 0.5 none 22050 0.5 MPD2 22050 0.5 C2I 22050 0.5 Signal Environment Figure 4-2: Male subject’s breakdown of errors for varying delays (5513, 11025, and 22050 samples), with attenuation held constant at 0.5 Looking only at the unprocessed cases, one notices that the breakdown of errors is similar for both the male and female subjects. Even though the total number of errors decreases from d = 11025 to d = 22050 for the male subject, the following trends prevail: - The number of words added increases with the delay interval. - The number of misrecognized words increases at first, but then decreases. - The number of missing words does not change much, although it does decrease a little, as the delay increases. - 33 - These trends make sense, because the echo overlaps with the original signal, so the longer the delay, the more “clean” speech appears at the beginning of the signal. This accounts for the decrease in misrecognized words, when the delay becomes large. However, the number of words added increases, because the signal duration becomes longer. As for the missing words, they tend to be short words, such as “a,” “to,” and so on, so there is no clear reason why delay should affect the number of missing words. Constant Delay, Varying Attenuation, Female Subject # Wrong # Added # Missing Total Errors 80 70 Number of Errors 60 50 40 30 20 10 0 none 11025 0.25 MPD2 11025 0.25 C2I 11025 0.25 none 11025 0.5 MPD2 11025 0.5 C2I 11025 0.5 none 11025 0.75 MPD2 11025 0.75 C2I 11025 0.75 Signal Environment Figure 4-3: Female subject’s breakdown of errors for varying attenuation factors (0.25, 0.5, and 0.75), with delay held constant at 11025 samples - 34 - Constant Delay, Varying Attenuation, Male Subject # Wrong # Added # Missing Total Errors 80 70 Number of Errors 60 50 40 30 20 10 0 none 11025 0.25 MPD2 11025 0.25 C2I 11025 0.25 none 11025 0.5 MPD2 11025 0.5 C2I 11025 0.5 none 11025 0.75 MPD2 11025 0.75 C2I 11025 0.75 Signal Environment Figure 4-4: Male subject’s breakdown of errors for varying attenuation factors (0.25, 0.5, and 0.75), with delay held constant at 11025 samples For constant delay and varying attenuation factors, the number of misrecognized word tend to increase with the increasing attenuation factor (which actually means less attenuation, or higher echo intensity), while the other types of errors are not affected as much. This is consistent with the argument in the previous case. With attenuation being the only variable, the number of misrecognized words increases as the echo intensity gets stronger, because the words are more distorted. This can, but does not necessarily, cause more added words, as one can see from the differences in the female and male cases. - 35 - Percent Improvement Constant Atten., Var. Delay, Male Constant Delay, Var. Atten., Male 120 120 100 100 80 80 60 60 40 40 20 20 0 0 −20 0.5 1 1.5 2 2.5 −20 0.2 0.4 0.6 0.8 1 4 x 10 Percent Improvement Constant Atten., Var. Delay, Female Constant Delay, Var. Atten., Female 120 120 100 100 80 80 60 60 40 40 20 20 0 0 −20 0.5 1 1.5 2 2.5 Delayed Samples [s/22050] x 104 −20 0.2 MPD2 C2I 0.4 0.6 0.8 Attenuation of Echo 1 Figure 4-5: Percent improvement as a function of delay and of attenuation for male and female subjects The graphs in Figure 4-5 show the following trends: - MPD2 increases first and then decreases with delay and attenuation. - C2I’s performance deteriorates more quickly than MPD2’s. - C2I is more sensitive to attenuation than it is to delay. The first observation can be explained by the nature of the MPD2 algorithm. Specifically, it has to do with how the echo’s spikes are detected. (Refer to Section 2.1.1.) When the echo intensity is low, the echo’s spikes are small, so they become harder to detect. For small delays, the early spikes are also harder to detect, because the - 36 - original signal’s cepstral content has not decreased enough for the spikes to stand out. These properties also explain why MPD2’s performance does not decrease as drastically as C2I’s when the intensity or delay increases. To explain the third observation that C2I is more sensitive to echo intensity, recall that in Section 2.1.2, it was pointed out that there are errors in estimating the echo’s intensity. Therefore, as the echo intensity increases, these errors become more noticeable. 4.4.2 Complex Echoes, One Microphone Breakdown of Errors for Complex, 1 Mic Environments, Female Subject # Wrong # Added # Missing Total Errors 120 100 Number of Errors 80 60 40 20 C2I High Echo,m1 MPD2 High Echo,m1 none High Echo,m1 C2I Medium Echo,m1 MPD2 Medium Echo,m1 none Medium Echo,m1 C2I Low Echo, m1 MPD2 Low Echo, m1 none Low Echo, m1 0 Signal Environment Figure 4-6: Female subject’s breakdown of errors for complex, one microphone signals - 37 - Breakdown of Errors for Complex, 1 Mic Environments, Male Subject # Wrong # Added # Missing Total Errors 120 100 Number of Errors 80 60 40 20 C2I High Echo,m1 MPD2 High Echo,m1 none High Echo,m1 C2I Medium Echo,m1 MPD2 Medium Echo,m1 none Medium Echo,m1 C2I Low Echo,m1 MPD2 Low Echo,m1 none Low Echo,m1 0 Signal Environment Figure 4-7: Male subject’s breakdown of errors for complex, one microphone signals For the complex model, the breakdown of errors is fairly consistent between the two subjects. However, there are two obvious differences from the simple model’s errors. First, there are very few words added, and second, there are more words missing. These observations show that the two models are indeed very different, which can account for the poor performance of C2I and MPD2 in these cases. - 38 - Percent Improvement for Complex, 1 Mic Environments, Female Subject C2I MPD2 70.00 60.00 Percent Imrpovement 50.00 40.00 30.00 20.00 10.00 0.00 -10.00 Low Echo, m1 Medium Echo,m1 High Echo,m1 Signal Environment Figure 4-8: Percent improvement vs. signal environment, female subject - 39 - Percent Improvement for Complex, 1 Mic, Male Subject C2I MPD2 5.00 0.00 Percent Improvement -5.00 -10.00 -15.00 -20.00 -25.00 Low Echo,m1 Medium Echo,m1 High Echo,m1 Signal Environment Figure 4-9: Percent improvement vs. signal environment, male subject Next, using the results for the complex model, percent improvement is plotted against the different echo environments. The results for the male and female subjects are very consistent for the medium and high echo environments. For both, the percent improvement was slightly greater for the high echo case than the medium echo case. This can be explained by the observation that the signals in the high echo environment are so distorted that it is even hard for humans to understand them. Therefore, they have less room for “negative improvement,” which occurs when the processed signal has more recognition errors that the unprocessed signal. One possible source of the extra errors is the rounding off that takes place when transforming a signal to another domain and back. - 40 - In the low echo environment, the results are drastically different between the two subjects, with a large positive percent improvement for the female, and a large negative improvement for the male. Unfortunately, there is no obvious explanation for this. 4.4.3 Complex Echoes, Three Microphones Breakdown of Errors for Complex, Multi. Mic Environments, Female Subject # Wrong # Added # Missing Total Errors 140 120 Number of Errors 100 80 60 40 20 Ec PD ho S Lo M w PD Ec S2 ho Lo M w PD Ec S3 ho Lo w C Ec 2I ho s Lo no w ne E M ch ed o iu D m SA Ec M ho ed iu SC m P E M ch ed o M iu PD m S Ec M ho M ed PD iu m S2 E M ch M ed o PD iu m S3 Ec M ho ed iu C m 2I s E M ch ed o iu m no Ec ne ho H ig h D E SA ch o H ig h SC Ec P ho H ig M h PD Ec ho S H M ig h PD E S2 ch o H ig M h PD E S3 ch o H ig h C E 2I c ho s H ig h Ec ho Ec ho SC P Lo w M Lo w D SA no ne Lo w Ec ho 0 Signal Environment Figure 4-10: Female subject’s breakdown of errors for complex, multiple microphone signals - 41 - Breakdown of Errors for Complex, Multi. Mic Environments, Male Subject # Wrong # Added # Missing Total Errors 140 120 Number of Errors 100 80 60 40 20 Ec ho ed iu m Ec M ho M ed PD iu m S2 E M ch M ed o PD iu m S3 Ec M ho ed iu C m 2I s E M ch ed o iu m no Ec ne ho H ig h D E SA ch o H ig h SC Ec ho P H ig M h PD Ec ho S H ig M h PD Ec S2 ho H ig M h PD E S3 ch o H ig h C Ec 2I ho s H ig h Ec ho Ec ho PD S M SC P M ed iu m ed iu m M M no ne D SA o Lo w Ec ho o Ec h w Lo PD S3 C 2I s o Ec h Ec h w PD S2 Lo M Ec ho PD S Lo w M Ec ho Lo w M SC P w Lo e D SA no n Lo w Ec ho 0 Signal Environments Figure 4-11: Male subject’s breakdown of errors for complex, multiple microphone signals Figures 4-10 and 4-11 show that the breakdown of errors is fairly consistent with the one microphone case’s data in the previous section. However, the number of errors is higher for the unprocessed signals in the low and medium echo environments with three microphones. This makes sense, because there are extra distortions that arise from simply adding the signals from three microphones, without accounting for their relative delays. Such a difference does not show up in the high echo environment, because as mentioned in the previous section, the signals are already very distorted. Therefore, the number of errors is already at a maximum. - 42 - Percent Improvement for Complex, Multi. Mic Environments, Female Subject C2Is DSA MPDs MPDs2 MPDs3 SCP 45.00 40.00 35.00 Percent Improvement 30.00 25.00 20.00 15.00 10.00 5.00 0.00 -5.00 Low Echo Medium Echo High Echo Signal Environment Figure 4-12: Percent improvement vs. echo environment, female subject - 43 - Percent Improvement for Complex, Multi. Mic Environments, Male Subject C2Is DSA MPDs MPDs2 MPDs3 SCP 35.00 30.00 25.00 Percent Improvement 20.00 15.00 10.00 5.00 0.00 -5.00 -10.00 Low Echo Medium Echo High Echo Signal Environment Figure 4-13: Percent improvement vs. echo environment, male subject In Figures 4-12 and 4-13, the percent improvement is plotted for the three microphone, complex signals. The trends are more consistent between the two subjects, compared to the one microphone, complex environments. In almost all of the cases, except for female subject, medium echo, DSA has the highest percent improvement. This is somewhat surprising, since DSA is simply the delay and sum method, which means that the extra work done in the other algorithms actually made the signals worse. - 44 - 4.4.4 Different Training Environments Most commercial speech recognition software programs, such as Dragon Systems’ Naturally Speaking, are user specific and require an initial training session for each user. This process allows the software to “learn” the characteristics of a user’s speech, and it is generally accomplished by having the person read sentences, as prompted by the program. A different user had to be created for each of the 14 training environments. However, the typical type of training described above could only be done for the “clean” environment, unless an actual effects box, with the capabilities of adding different types of echoes and performing the different algorithms, was built and put between the microphone and the computer’s sound card. Of course, building such a device was not feasible, given the nature and the timeframe of this project. Hence, for all of the other training environments, the mobile training feature of the software had to be used. Mobile training is intended for people who want to record their dictations onto tape or digital recorders and later transfer their speech to the computer for transcription by the software. Since the impulse response of a recorder is likely to be different from that of a microphone, it is necessary to have a different training process for mobile users, rather than to have them use the regular live training process with a microphone and then try to transcribe speech from a recorder. Mobile training generally involves recording about 20 minutes of speech, using a script provided by the software, and then instructing the software to read the sound file. In the following experiments, the training data was recorded and saved as a .wav file, using the microphone setup that was described in Section 4.2. This file, without any processing, was used for the “clean, mobile” training environment. The files was also processed accordingly to create all of the other training environments. In the following results, simple refers to d = 11025 samples, alpha = 0.5, and complex refers to the low echo environment. These environments were chosen, because they were generally the ones under which the algorithms showed the highest percentages - 45 - of improvement. Some of the other environments may have been so adverse, that no discernible differences will appear under the different training environments. Effect of Training Environment on Performance of Algorithms for Simple, 1 Mic, Female Subject C2i MPD2 120 Percent Improvement 100 80 60 40 20 simple echo, 1 mic, mpd2 simple echo, 1 mic, c2i simple echo, 1 mic mult. echo, 3 mic, scp mult. echo, 3 mic, mpds2 mult. echo, 3 mic, mpds mult. echo, 3 mic, dsa mult. echo, 3 mic, c2is mult. echo, 3 mic mult. echo, 1 mic, mpd2 mult. echo, 1 mic, c2i mult. echo, 1 mic clean, mobile clean 0 Training Environment Figure 4-14: How C2I and MPD2 perform on simple echo signals under different training environments - 46 - Effect of Training Environment on Performance of Algorithms for Complex, 1 Mic, Female Subject C2i MPD2 300 200 Percent Improvement 100 0 -100 -200 -300 simple echo, 1 mic, mpd2 simple echo, 1 mic, c2i simple echo, 1 mic mult. echo, 3 mic, scp mult. echo, 3 mic, mpds2 mult. echo, 3 mic, mpds mult. echo, 3 mic, dsa mult. echo, 3 mic, c2is mult. echo, 3 mic mult. echo, 1 mic, mpd2 mult. echo, 1 mic, c2i mult. echo, 1 mic clean, mobile clean -400 Training Environment Figure 4-15: How C2I and MPD2 perform on complex reverberation, one microphone signals under different training environments The original theory and purpose behind trying different training environments was that an algorithm might perform better when the training data was from the same environment, since the speech recognition software uses pattern matching to some extent to identify sounds and words. However, this turns out not to be the case, as shown by Figures 4-14 and 4-15. The likely explanation for the results lies in the nature of the mobile training process, because it does not allow the user to correct recognition errors. Therefore, when the training data is unrecognizable to the software, the software is not actually learning how specific words sound in a particular echo environment. One way to address this problem is to do live training (and maybe even testing) in a real reverberant environment. - 47 - However, this still does not allow for the training of processed environments, which was the main goal of this experiment. Another possible way would have been to use “pseudolive” training, where a tape player plays the desired training signals to a microphone, thereby fooling the software into “thinking” that there is a real person speaking. However, this too, may not work, because if the reverberant signal is too distorted, such that if the software is not satisfied with how a word sounds, it will keep asking that the word be repeated. Alternatively, the word can be skipped. This process would also be extremely tedious with so many training environments. Incidentally, for one-microphone tests, “clean, mobile” seems to yield the highest percentage of improvement for both of the algorithms. The fact that it does better than the “clean” environment suggests that similarity between training environment and signal environment does help. Namely, the similarity arises from the test files being transcribed as .wav files, which is how “clean, mobile” was trained, versus using a microphone, which was how “clean” was trained. - 48 - Effect of Traning Environment on Performance of Algorithms for Complex, Multi. Mic, Female Subject C2Is DSA MPDs MPDs2 MPDs3 SCP 100 80 Percent Improvement 60 40 20 0 -20 -40 simple echo, 1 mic, mpd2 simple echo, 1 mic, c2i simple echo, 1 mic mult. echo, 3 mic, scp mult. echo, 3 mic, mpds2 mult. echo, 3 mic, mpds mult. echo, 3 mic, dsa mult. echo, 3 mic, c2is mult. echo, 3 mic mult. echo, 1 mic, mpd2 mult. echo, 1 mic, c2i mult. echo, 1 mic clean, mobile clean -60 Training Environment Figure 4-16: How C2Is, DSA, MPDs, MPDs2, MPDs3, SCP perform on complex reverberation, multi-microphone signals under different training environments As with the one microphone cases, there is no correlation here between an algorithm and an environment trained under the same algorithm. The DSA environment yielded the most consistently high improvement percentages. While others had higher improvement for certain algorithms, they also had lower minimums. Interestingly, DSA also had the best performance in most of the training environments, though SCP had the highest improvement percentage in the DSA training environment. However, the reason behind this relationship is not obvious at this point. - 49 - Chapter 5 Conclusions and Future Directions 5.1 Conclusions The goal of this project was to research, develop, and compare algorithms for echo cancellation in the home environment. During the course of this project, it became obvious that the problem of echo cancellation with unknown and variable room impulse responses is a very general and hard one. However, with some practical assumptions that simplified the problem, it was possible to identify some promising algorithms to implement and test. After performing many tests and examining the results, the following observations can be made: - The complex, realistic reverberation model is very different from the simple echo model, and the algorithms that work well in the simple case do not carry over very well to the complex model. - Having multiple microphones is an effective way to improve speech, but if the echo environment is very high, nothing is effective. However, for most rooms, the surface reflectivities will not be as great as those used in the high—or even medium—echo environments. - Different algorithms work better under different environments. Therefore, it may be feasible to implement a system that can choose among a number of - 50 - algorithms, as well as arguments to their functions, based on user input on the room parameters. It is important to realize that while echo cancellation is a very general area, and much work has been done in this field in the last several decades, the efforts on room dereverberation in the context of smart houses are still relatively new. The idea of using very few microphones, as opposed to large arrays, is an even more novel approach. Therefore, the research presented in this thesis is still at a very early stage. Although some of the results are mixed, some of them—especially in the three microphone cases— are also very encouraging. Putting issues of cost aside, the results may seem to suggest that using many microphones would solve the problem. However, the results presented in [6] show that even with 46 microphones, the word recognition error rate was slightly over 50%. Note that the test environments and methods were different from those of this thesis, so there is no way to compare the relative performances. The point here is that using many microphones alone would not solve the problem at hand. A lot more needs to be done, and the next section addresses some of these open areas. 5.2 Future Work This section raises and reiterates some issues that are related to echo cancellation applied to the problem of speech recognition in home networking. However, it is by no means a complete analysis of the requirements for making smart houses a reality. 5.2.1 Testing in Real Echo Environments Although Cool Edit Pro does a good job of simulating room echo environments, it does have certain limitations, as mentioned in Section 3.2. Also, it is hard to specify some of the parameters, such as surface reflectivity, in order to model a realistic room. Therefore, while using simulations is efficient for this initial study, ultimately, it will be necessary to test the echo cancellation algorithms in real rooms. - 51 - 5.2.2 Types of Microphones Also mentioned in Section 3.2 is the fact that Cool Edit Pro’s simulation is based on omni-directional microphones. Other choices may be more suitable in the overall performance of speech recognition in the home environment. A good guide to microphones can be found at http://www.audio-technica.com/guide/type/index.html. 5.2.3 Microphone Placement Optimal sensor placement is another large area of study, and it takes into consideration the acoustic characteristics of a room. Also, for smart houses, the placement of the microphone(s) depends on the layout of the room and the objects in it, as well as the likelihood of people facing certain directions. 5.2.4 Real Time For smart houses to be practical, the echo cancellation system has to work in real time. The work in this thesis was done using Matlab v5 on a Windows NT, Pentium III 450 MHz, 128 MB RAM system, with mainly the echo cancellation capabilities of the algorithms, rather than speed and efficiency, in mind. The next step may be to improve and optimize the algorithms, translate them to DSP assembly code, and run them on a DSP processor. On a related note, other classes of algorithms that are not practical under the current development platform, such as adaptive processing, may also be considered, if a real time development platform is used. 5.2.5 Continual or Rapid Speaker Movement Although it is not likely that a person will move very much while giving a command to the house, in the ideal vision of smart houses it is desirable that there will be no restrictions on the person’s movements. In a more immediate sense, it is also true that even if the person moves a little, the multipath echoes change, so that the current - 52 - assumptions, though valid, are not perfect. Therefore, it will be worthwhile to explore methods of quickly tracking the speaker’s movements. 5.2.6 Multiple Speakers As mentioned in Section 1.2.3, this is yet another area of study (specifically, BSSD) that is relevant to home networking. Although it is not directly dealing with echo cancellation, it is necessary in order to make speech control of home networks realistic, since undoubtedly, there will be more than one person speaking at some point. 5.3 Final Thoughts After all is said and done, the fundamental question that remains is, “Will this really work in practice?” The answer is, “It depends.” As mentioned before, the performance of echo cancellation algorithms is very sensitive to the echo environment. Therefore, while voice control of the home network may work very well in the living room, it may not work nearly as well in the basement. It is also important to realize that while echo cancellation, noise cancellation, and other forms of speech enhancement are essential to successful speech recognition, recognition errors can occur even on clean speech. Furthermore, developing a truly “smart” speech interface that can understand humans beyond a limited vocabulary is another great challenge in the field of artificial intelligence research. Therefore, while it is reasonable to expect some basic functional form of smart houses to emerge in the near future, the truly smart house (a la The Jetsons) is still a long way from becoming reality. - 53 - Appendix A The Complex Cepstrum Using the complex cepstrum to cancel out echoes is also known as homomorphic deconvolution. In general, homomorphic systems are nonlinear in the classical sense, but through the combination of different operations, they satisfy a generalization of the principle of superposition [10]. The complex cepstrum, in particular, “changes” convolution into addition, with the aid of the Fourier Transform and logarithms. The following block diagram illustrates how the complex cepstrum of a signal is derived: s[n] DTFT S(w) Ŝ(w) log Inverse DTFT ŝ[n] Figure A-1: Block diagram of the complex cepstrum The complex cepstrum ŝ[n] is therefore defined as IDTFT(log(DTFT(s[n]))). To see why this changes convolution into addition, lets look at a signal x[n], such that x[n] = s[n]*h[n] where * denotes convolution. Now, lets follow through with the calculation of the complex cepstrum: - 54 - (A.1) X(w) = S(w) • H(w) (A.2) log(X(w)) = log(S(w)) + log(H(w)) (A.3) X̂( w ) = Ŝ(w) + Ĥ(w) (A.4) x̂[n] = ŝ[n] + ĥ[n] (A.5) The next step is to show that the spikes in x̂[n ] do indeed belong to ĥ[n]. Let’s look at the impulse response of a simple echo with delay δ and attenuation factor α (which may be positive or negative) : h[n] = δ[n] + α • δ[n-d] (A.6) Taking the Fourier Transform yields H(w) = 1 + α • e-jwd (A.7) Ĥ(w)= log(1 + α • e-jwd ) (A.8) Taking the logarithm gives Generally, the direct path signal will be greater than the echoes in amplitude, so it is valid to assume that |α| < 1. However, if this is not true, because of some strange room configuration, the algorithm will still be okay, due to the minimum phase-all pass factorization, which ensures that all of the poles and zeros of the z-transform are inside - 55 - the unit circle. Given |α| < 1, the right side of Equation A.8 can be approximated as follows by using the power series expansion: (−α ) n − jwdn e n n =1 ∞ Ĥ(w) = - å (A.9) Since the DTFT is defined as X(w) = ∞ å x[n ] e − jwn , (A.10) n = −∞ and change of variables m = dn yields ∞ m (−α ) d m =d m d Ĥ( w ) = − å e − jwm , (A.11) it then follows that m ì ï (−α ) d ï− m , ĥ[m] = í ï d ï0, î m≥d (A.12) m<d Finally, substituting back for m gives the following result: ì (−α ) n , n ≥1 ï− ĥ[dn] = í n ï0, n <1 î where n ∈ Ζ , the set of all integers. - 56 - (A.13) Therefore, ĥ[n] contains exponentially decaying impulses at every integer multiple of d. By zeroing out these spikes and then taking the inverse complex cepstrum, the result of applying the MPD algorithm is an estimated version of s[n]. For multiple echoes, the math generally becomes much more complicated, but the presence of impulses at multiples of each echo’s delay is still observed. For further discussions of the complex cepstrum, refer to [10] and [12]. - 57 - Appendix B Matlab Functions The functions coded can be broken down into three subcategories: test, algorithm, and support. The algorithm functions were already listed and described in Section 2.2. Note that through the course of the thesis work, many functions were coded and later changed or discarded. This accounts for the numbers that appear at the end of some of the function names. The first two subsections of this appendix will a high level explanation of the test and support functions, and the third section will include the source code for all of the functions. B.1 Test Functions Test functions, as their name suggests, are used to automate testing. While they test different functions, they all do basically the same things: - open the appropriate clean original files - open the unprocessed files (or create them in the simple model tests) - create the appropriate output directories for the new files created/to be created - process the unprocessed speech - find the MSE between the original and the processed files - 58 - - find the SNR of the processed files* - write the results to a text file - write the processed speech to new files (also write the new unprocessed speech created in test_simple) Here’s the list of the test functions: • test_c2i(path, template) path = output directory path, excluding the “test_c2i” part template = filename template ex: if input files are *tr5_m1.wav, then the template is “tr5_m1” and output files are *tr5_m1_c2i.wav • test_c2is(room) room = name of test room, for instance, “tr5” refers to the low echo room configuration • test_mpd2(path, template) • test_multi(room) Tests DSA, MPDs, MPDs2, MPDs3, SCP • test_simple(alpha, delay) alpha = vector of attenuation(s) of the echo(s) to be added delay = vetcor of delay(s) of the echo(s) to be added B.2 Support Functions • addecho2wav(A, alpha, delay) f = addecho2wav(A, alpha, delay) A = wave vector alpha = row vector of attenuations delay = row vector of delays * Although the MSE and SNR were not used as metrics in the final analysis, they are still calculated by the test methods. - 59 - Check to make sure that alpha and delay are the same size, iterate through the alpha and delay vectors to create the echoes and add them to A, return the sum f. • allpass(A) ap = allpass(A) returns only the all pass component of the vector A [ap, mp] = allpass(A) returns both the all pass and the minimum phase components of A • deconvolve(x, h) s = deconvolve(x, h) x = unprocessed signal h = impulse response (actual or estimated) Assume x = s*h, deconvolve h from x using FFT’s. If FFT(h) contains a sample with the value 0, the built in Matlab function deconv(x, h) is called. • delay(A, d) f = delay(A, d) A = a column vector d = delay factor Returns a version of A, delayed by d samples, through zero padding the first d samples. • findpeaks2(A, b, e, N, alpha) f = findpeaks2(A, b, e, N, alpha) A = cepstral domain vector b = begin e = end N = frame size alpha = threshold factor Finds large positive and negative spikes in A(b:e) and zeros them out, returns the altered cepstrum. A(b:e) is cut up into consecutive frames of size N. At any given time, the maxima of three consecutive frames are compared. To get rid of positive peaks, if max(frame i) > alpha*mean(max(frame i-1), max(frame i+1)), then the value at max(frame 2) is set to 0. A similar rule is used to get rid of the negative peaks. The process is iterative for i = 2 : (# of frames – 1). Used by C2I. • mixdown(room) room = name of test room, for instance, “tr5” refers to the low echo room configuration - 60 - Adds the inputs from the three microphones (generated by Cool Edit Pro), divide by three, and write the new sound vector into a .wav file. • mse(s, x) m = mse(s, x) s = original signal x = processed or unprocessed signal Takes the difference between s and x, square the components of the difference vector, take the sum of the vector, and return the result as the mean squared error. • rfindpeaks(A, b, e, N, alpha) Recursive version of findpeaks2, continues to call itself until no peaks are detected. Used by MPD2. • snr(s, x) f = snr(s, x) s = original signal x = processed or unprocessed signal Returns the signal to noise ratio of x, given that s is the clean version of it. The method is easily explained with the source code: s = s-mean(s); %subtract DC components x = x - mean(x); S = sum(s.^2); X = sum(x.^2); %energy of s %energy of x r = (S.^2)/(S.^2-X.^2); % signal to noise ratio, denominator is the noise % energy • wavplay2(v) v = sound vector with sampling rate of 22050 samples/second Calls wavplay(v, 22050). The default sampling rate for wavplay is 11025. - 61 - B.3 Source Code B.3.1 Main Algorithms function [s,h]=c2i(v) % % % % [s,h] = c2i(v) v = reverberant signal s = estimation of original, h = estimation of impulse response, based on peaks in the auto correlation of v x = xcorr(v); %autocorrelation x = x(96000:191999); %because of symmetry, don't need first half y = findpeaks2(x, 1, 96000, 1000, 3); %get rid of spikes z = x-y; %impulse response is the difference between the original %and the one without the spikes z(1) = x(1); %first peak was not taken out by rfindpeaks2 h = z/max(abs(z)); sp = deconvolve(v,h); s = sp/max(abs(sp)); function f = c2is(M) %c = c2is(M) % performs c2i for multiple mic inputs % M = matrix whose columns are sound vectors % output is the dsa of each individually processed signal [r c] = size(M); %r = # of rows, c = # of columns if c>5, warning('more than 5 sound vectors?'); end C = zeros(r,c); for i = 1:c, s = c2i(M(:,i)); C(:,i) = s; end f = dsa2(C); - 62 - function f = dsa2(M) % delay-sum-avg method % m = matrix whose columns are sound vectors [r c] = size(M); %r = # of rows, c = # of columns if c>5, warning('more than 5 sound vectors?'); end S = M(:,1); for i = 2:c, T = M(:,i); X = xcorr(S,T); [x y] = max(X); d = y-96000; if d>0, D = delay(S, d); %S is earlier than T if d>0 S = D+T; elseif d<0, D = delay(T,-d); S = D+S; else S = S+T; %case for d=0 end end S = S/c; f = S/max(abs(S)); %divide by number of columns %normalize volume function [m, old_cm, new_cm] = mpd2(s) %m = %[m, %the %[m, %cm2 %min %s = mpd(s, delay) returns the processed signal cm] = mpd(s, delay) returns the processed signal and unprocessed cepstrum cm, cm2] = mpd(s, delay) = processed cepstrum phase echo removal method sound vector, delay = echo delay [ap, mp] = allpass(s); [cm nd] = cceps(mp); old_cm = cm; cm2 = rfindpeaks2(cm, 1000, 95000, 100,5); new_cm = cm2; mp2 = icceps(cm2, nd); MP = fft(mp2); AP = fft(ap); Sf = MP.*AP; - 63 - s2 = ifft(Sf); s2 = real(s2); m = s2/max(abs(s2)); function f = mpds(M) %f = mpds(M) %takes in matrix M, whose columns are sound vectors %takes the avg of the min phase components, transforms avg to cepstral %domain and calls rfindpeaks2 %lines up and averages the all pass components [r c] = size(M); Mp = zeros(r,c); Ap = zeros(r,c); for i = 1:c, [ap mp] = allpass(M(:,i)); Mp(:,i) = mp; Ap(:,i) = ap; end amp = avg(Mp); %avg min phase [cm nd] = cceps(amp); cm2 = rfindpeaks2(cm, 1000, 95000, 75, 2); amp2 = icceps(cm2, nd); %%line up and average all pass components S = Ap(:,1); if c>1, for i = 2:c, T = Ap(:,i); X = xcorr(S,T); [x y] = max(X); d = y-96000; if d>0, D = delay(S, d); %S is earlier than T if d>0 S = D+T; elseif d<0, D = delay(T,-d); S = D+S; else S = S+T; %case for d=0 end end end aap = S/c; %avg all pass component %%reconstruct signal from aap and amp - 64 - MP = fft(amp2); AP = fft(aap); Sf = MP.*AP; s2 = ifft(Sf); s2 = real(s2); f = s2/max(abs(s2)); function f = mpds2(M) %f = mpds2(M) %takes in matrix M, whose columns are sound vectors %calls findpeaks2 on each cm, takes the average, and then convert back %to the time domain %lines up and averages the all pass components [r c] = size(M); Cm2 = zeros(r,c); Ap = zeros(r,c); Nd = zeros(1,c); for i = 1:c, [ap mp] = allpass(M(:,i)); [cm nd] = cceps(mp); Nd(i) = nd; cm2 = rfindpeaks2(cm, 1000, 95000, 100, 2); Cm2(:,i) = cm2; Ap(:,i) = ap; end acm2 = sum(Cm2,2)/c; %avg altered cepstral min phase amp2 = icceps(acm2, median(Nd)); %inverse cepstrum %%line up and average all pass components S = Ap(:,1); if c>1, for i = 2:c, T = Ap(:,i); X = xcorr(S,T); [x y] = max(X); d = y-96000; if d>0, D = delay(S, d); %S is earlier than T if d>0 S = D+T; elseif d<0, D = delay(T,-d); S = D+S; else S = S+T; %case for d=0 end end end - 65 - aap = S/c; %avg all pass component %%reconstruct signal from aap and amp MP = fft(amp2); AP = fft(aap); Sf = MP.*AP; s2 = ifft(Sf); s2 = real(s2); f = s2/max(abs(s2)); %normalize to prevent clipping function f = mpds3(M) %f = mpds3(M) %takes in matrix M, whose columns are sound vectors %another variation of mpds %findpeaks2 performed on individual cm's, but averaging is still done %in the time domain [r c] = size(M); Mp2 = zeros(r,c); Ap = zeros(r,c); Nd = zeros(1,c); for i = 1:c, [ap mp] = allpass(M(:,i)); [cm nd] = cceps(mp); cm2 = rfindpeaks2(cm, 1000, 95000, 100, 2); mp2 = icceps(cm2,nd); Mp2(:,i) = mp2; Ap(:,i) = ap; end amp2 = sum(Mp2,2)/c; %avg altered min phase %%line up and average all pass components S = Ap(:,1); if c>1, for i = 2:c, T = Ap(:,i); X = xcorr(S,T); [x y] = max(X); d = y-96000; if d>0, D = delay(S, d); %S is earlier than T if d>0 S = D+T; elseif d<0, D = delay(T,-d); S = D+S; else - 66 - S = S+T; %case for d=0 end end end aap = S/c; %avg all pass component %%reconstruct signal from aap and amp MP = fft(amp2); AP = fft(aap); Sf = MP.*AP; s2 = ifft(Sf); s2 = real(s2); f = s2/max(abs(s2)); function m = scp(M) % % % % m = scp(M) takes in the matrix M, does DSA on the all-pass components, averages the min-phase components in the cepstral domain (no peak detection), and recombines [r c] = size(M); Mp = zeros(r,c); Ap = zeros(r,c); for i = 1:c, [ap mp] = allpass(M(:,i)); Mp(:,i) = mp; Ap(:,i) = ap; end Cm = zeros(r,c); Nd = zeros(1,c); for i=1:c, [cm nd] = cceps(Mp(:,i)); Cm(:,i) = cm; Nd(i) = nd; end acm = avg(Cm); amp = icceps(acm, median(Nd)); %average the complex cepstrums %average min phase %%line up and avg all pass component S = Ap(:,1); if c>1, for i = 2:c, T = Ap(:,i); - 67 - X = xcorr(S,T); [x y] = max(X); d = y-96000; if d>0, D = delay(S, d); %S is earlier than T if d>0 S = D+T; elseif d<0, D = delay(T,-d); S = D+S; else S = S+T; %case for d=0 end end end aap = S/c; %avg all pass component MP = fft(amp); AP = fft(aap); Sf = MP.*AP; s2 = ifft(Sf); s2 = real(s2); m = s2/max(abs(s2)); B.3.2 Test Functions function f= test_c2i(path, template); % test script for c2i % path = output directory under d:\speech\{gina,murray}22\, excluding the test_c2i part % template = filename template % ex: if input files are *tr5_m1.wav and output files are *tr5_m1_c2i.wav % then the template is tr5_m1 ms_results = []; gy_results = []; mkdir(['d:\speech\gina22\' path '\'],'test_c2i'); mkdir(['d:\speech\gina11\' path '\'],'test_c2i'); mkdir(['d:\speech\murray22\' path '\'],'test_c2i'); mkdir(['d:\speech\murray11\' path '\'],'test_c2i'); gy_path = ['d:\speech\gina22\' path '\test_c2i\']; ms_path = ['d:\speech\murray22\' path '\test_c2i\']; gy_path11 = ['d:\speech\gina11\' path '\test_c2i\']; ms_path11 = ['d:\speech\murray11\' path '\test_c2i\']; - 68 - for i = 1:20 if (i~=9) & (i~=18), %skip 9 and 18 gy_file = ['S' sprintf('%i',i) 'gy22_' template '.wav']; %disp(gy_file); ms_file = ['S' sprintf('%i',i) 'ms22_' template '.wav']; %disp(ms_file); gy = wavread(gy_file); ms = wavread(ms_file); m1 = c2i(gy); wavwrite(m1, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' template '_c2i.wav']); m2 = c2i(ms); wavwrite(m2, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' template '_c2i.wav']); %downsample for speech recognition gy11 = resample(m1,1,2); wavwrite(gy11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' template '_c2i.wav']); ms11 = resample(m2,1,2); wavwrite(ms11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' template '_c2i.wav']); % load originals gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']); ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']); % normalize orignals gy_orig2 = gy_orig/max(abs(gy_orig)); ms_orig2 = ms_orig/max(abs(ms_orig)); gy_mse = mse(gy_orig2, m1); ms_mse = mse(ms_orig2, m2); gy_snr = snr(gy_orig2, m1); ms_snr = snr(ms_orig2, m2); gy_results = [gy_results [gy_mse; gy_snr]]; ms_results = [ms_results [ms_mse; ms_snr]]; end end gy_avg_mse = mean(gy_results(1,:)); gy_avg_snr = mean(gy_results(2,:)); ms_avg_mse = mean(ms_results(1,:)); ms_avg_snr = mean(ms_results(2,:)); - 69 - gy_fid = fopen(['d:\speech\test_results\gy_c2i_' template '.txt'], 'w'); fprintf(gy_fid, '%6.3f %6.3f\n', gy_results); fprintf(gy_fid, '%s\n', 'gy average mse'); fprintf(gy_fid, '%6.3f\n', gy_avg_mse); fprintf(gy_fid, '%s\n', 'gy average snr'); fprintf(gy_fid, '%6.3f\n', gy_avg_snr); fclose(gy_fid); ms_fid = fopen(['d:\speech\test_results\ms_c2i_' template '.txt'], 'w'); fprintf(ms_fid, '%6.3f %6.3f\n', ms_results); fprintf(ms_fid, '%s\n', 'ms average mse'); fprintf(ms_fid, '%6.3f\n', ms_avg_mse); fprintf(ms_fid, '%s\n', 'ms average snr'); fprintf(ms_fid, '%6.3f\n', ms_avg_snr); fclose(ms_fid); function f = test_c2is(room) gy_mse_results gy_snr_results ms_mse_results ms_snr_results gy_path = ms_path = gy_path11 ms_path11 = = = = []; []; []; []; ['d:\speech\gina22\' room '\test_multi\']; ['d:\speech\murray22\' room '\test_multi\']; = ['d:\speech\gina11\' room '\test_multi\']; = ['d:\speech\murray11\' room '\test_multi\']; for i = 1:20 if (i~=9) & (i~=18), %skip 9 and 18 gy_file1 = ['S' sprintf('%i',i) 'gy22_' room '_m1.wav']; gy_file2 = ['S' sprintf('%i',i) 'gy22_' room '_m2.wav']; gy_file3 = ['S' sprintf('%i',i) 'gy22_' room '_m3.wav']; ms_file1 = ['S' sprintf('%i',i) 'ms22_' room '_m1.wav']; ms_file2 = ['S' sprintf('%i',i) 'ms22_' room '_m2.wav']; ms_file3 = ['S' sprintf('%i',i) 'ms22_' room '_m3.wav']; gy1 = wavread(gy_file1); gy2 = wavread(gy_file2); gy3 = wavread(gy_file3); ms1 = wavread(ms_file1); ms2 = wavread(ms_file2); ms3 = wavread(ms_file3); GY = [gy1 gy2 gy3]; MS = [ms1 ms2 ms3]; - 70 - gy_c2is = c2is(GY); wavwrite(gy_c2is, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' room '_c2is.wav']); ms_c2is = c2is(MS); wavwrite(ms_c2is, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' room '_c2is.wav']); %resample for speech recognition gy_c2is11 = resample(gy_c2is,1,2); wavwrite(gy_c2is11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' room '_c2is.wav']); ms_c2is11 = resample(ms_c2is,1,2); wavwrite(ms_c2is11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' room '_c2is.wav']); % load originals gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']); ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']); % normalize orignals gy_orig2 = gy_orig/max(abs(gy_orig)); ms_orig2 = ms_orig/max(abs(ms_orig)); gy_mse_c2is = mse(gy_orig2, gy_c2is); ms_mse_c2is = mse(ms_orig2, ms_c2is); gy_snr_c2is = snr(gy_orig2, gy_c2is); ms_snr_c2is = snr(ms_orig2, ms_c2is); gy_mse_results = [gy_mse_results gy_mse_c2is]; %append new results gy_snr_results = [gy_snr_results gy_snr_c2is]; ms_mse_results = [ms_mse_results ms_mse_c2is]; ms_snr_results = [ms_snr_results ms_snr_c2is]; end end %get averages gy_avg_mses = gy_avg_snrs = ms_avg_mses = ms_avg_snrs = mean(gy_mse_results); %average across the row mean(gy_snr_results); mean(ms_mse_results); mean(ms_snr_results); gy_fid = fopen(['d:\speech\test_results\gy_' room '_multi_c2is.txt'], 'w'); fprintf(gy_fid, '%s\n', 'MSE'); fprintf(gy_fid, '%s\n','c2is'); fprintf(gy_fid, '%6.3f\n', gy_mse_results); fprintf(gy_fid, '%s\n', 'gy average mse'); - 71 - fprintf(gy_fid, fprintf(gy_fid, fprintf(gy_fid, fprintf(gy_fid, fprintf(gy_fid, fclose(gy_fid); '%6.3f\n\n', gy_avg_mses); '%s\n','SNR'); '%6.3f\n', gy_snr_results); '%s\n', 'gy average snr'); '%6.3f\n', gy_avg_snrs); ms_fid = fopen(['d:\speech\test_results\ms_' room '_multi_c2is.txt'], 'w'); fprintf(ms_fid, '%s\n', 'MSE'); fprintf(ms_fid, '%s\n','c2is'); fprintf(ms_fid, '%6.3f\n', ms_mse_results); fprintf(ms_fid, '%s\n', 'ms average mse'); fprintf(ms_fid, '%6.3f\n\n', ms_avg_mses); fprintf(ms_fid, '%s\n', 'SNR'); fprintf(ms_fid, '%6.3f\n', ms_snr_results); fprintf(ms_fid, '%s\n', 'ms average snr'); fprintf(ms_fid, '%6.3f\n', ms_avg_snrs); fclose(ms_fid); function f= test_mpd2(path, template); % % % % % % test script for mpd2 path = output directory under d:\speech\{gina,murray}22\, excluding the test_mpd2 part template = filename template ex: if input files are *tr5_m1.wav and output files are *tr5_m1_mpd2.wav, then the template is tr5_m1 ms_results = []; gy_results = []; mkdir(['d:\speech\gina22\' path '\'],'test_mpd2'); mkdir(['d:\speech\gina11\' path '\'],'test_mpd2'); mkdir(['d:\speech\murray22\' path '\'],'test_mpd2'); mkdir(['d:\speech\murray11\' path '\'],'test_mpd2'); gy_path = ['d:\speech\gina22\' path '\test_mpd2\']; ms_path = ['d:\speech\murray22\' path '\test_mpd2\']; gy_path11 = ['d:\speech\gina11\' path '\test_mpd2\']; ms_path11 = ['d:\speech\murray11\' path '\test_mpd2\']; for i = 1:20 if (i~=9) & (i~=18), %skip 9 and 18 gy_file = ['S' sprintf('%i',i) 'gy22_' template '.wav']; ms_file = ['S' sprintf('%i',i) 'ms22_' template '.wav']; gy = wavread(gy_file); ms = wavread(ms_file); - 72 - [m1 c1 d1] = mpd2(gy); wavwrite(m1, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' template '_mpd2.wav']); [m2 c2 d2] = mpd2(ms); wavwrite(m2, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' template '_mpd2.wav']); %downsample for speech recognition gy11 = resample(m1,1,2); wavwrite(gy11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' template '_mpd2.wav']); ms11 = resample(m2,1,2); wavwrite(ms11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' template '_mpd2.wav']); % load originals gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']); ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']); % normalize orignals gy_orig2 = gy_orig/max(abs(gy_orig)); ms_orig2 = ms_orig/max(abs(ms_orig)); gy_mse = mse(gy_orig2, m1); ms_mse = mse(ms_orig2, m2); gy_snr = snr(gy_orig2, m1); ms_snr = snr(ms_orig2, m2); gy_results = [gy_results [gy_mse; gy_snr]]; ms_results = [ms_results [ms_mse; ms_snr]]; end end gy_avg_mse = mean(gy_results(1,:)); gy_avg_snr = mean(gy_results(2,:)); ms_avg_mse = mean(ms_results(1,:)); ms_avg_snr = mean(ms_results(2,:)); %disp(transpose(gy_results)); disp(gy_avg_mse); disp(gy_avg_snr); %disp(transpose(ms_results)); disp(ms_avg_mse); disp(ms_avg_snr); gy_fid = fopen(['d:\speech\test_results\gy_mpd2_' template '.txt'], 'w'); fprintf(gy_fid, '%6.3f %6.3f\n', gy_results); fprintf(gy_fid, '%s\n', 'gy average mse'); fprintf(gy_fid, '%6.3f\n', gy_avg_mse); - 73 - fprintf(gy_fid, '%s\n', 'gy average snr'); fprintf(gy_fid, '%6.3f\n', gy_avg_snr); fclose(gy_fid); ms_fid = fopen(['d:\speech\test_results\ms_mpd2_' template '.txt'], 'w'); fprintf(ms_fid, '%6.3f %6.3f\n', ms_results); fprintf(ms_fid, '%s\n', 'ms average mse'); fprintf(ms_fid, '%6.3f\n', ms_avg_mse); fprintf(ms_fid, '%s\n', 'ms average snr'); fprintf(ms_fid, '%6.3f\n', ms_avg_snr); fclose(ms_fid); function f = test_multi(room) %room = name of room % tests mpds, mpds2, mpds3, dsa2, scp gy_mse_results gy_snr_results ms_mse_results ms_snr_results gy_path = ms_path = gy_path11 ms_path11 = = = = []; []; []; []; ['d:\speech\gina22\' room '\test_multi\']; ['d:\speech\murray22\' room '\test_multi\']; = ['d:\speech\gina11\' room '\test_multi\']; = ['d:\speech\murray11\' room '\test_multi\']; for i = 1:20 if (i~=9) & (i~=18), %skip 9 and 18 gy_file1 = ['S' sprintf('%i',i) 'gy22_' room '_m1.wav']; gy_file2 = ['S' sprintf('%i',i) 'gy22_' room '_m2.wav']; gy_file3 = ['S' sprintf('%i',i) 'gy22_' room '_m3.wav']; ms_file1 = ['S' sprintf('%i',i) 'ms22_' room '_m1.wav']; ms_file2 = ['S' sprintf('%i',i) 'ms22_' room '_m2.wav']; ms_file3 = ['S' sprintf('%i',i) 'ms22_' room '_m3.wav']; gy1 = wavread(gy_file1); gy2 = wavread(gy_file2); gy3 = wavread(gy_file3); ms1 = wavread(ms_file1); ms2 = wavread(ms_file2); ms3 = wavread(ms_file3); GY = [gy1 gy2 gy3]; MS = [ms1 ms2 ms3]; gy_mpds = mpds(GY); - 74 - wavwrite(gy_mpds, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' room '_mpds.wav']); gy_mpds2 = mpds2(GY); wavwrite(gy_mpds2, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' room '_mpds2.wav']); gy_mpds3 = mpds3(GY); wavwrite(gy_mpds3, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' room '_mpds3.wav']); gy_dsa2 = dsa2(GY); wavwrite(gy_dsa2, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' room '_dsa2.wav']); gy_scp = scp(GY); wavwrite(gy_scp, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' room '_scp.wav']); ms_mpds = mpds(MS); wavwrite(ms_mpds, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' room '_mpds.wav']); ms_mpds2 = mpds2(MS); wavwrite(ms_mpds2, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' room '_mpds2.wav']); ms_mpds3 = mpds3(MS); wavwrite(ms_mpds3, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' room '_mpds3.wav']); ms_dsa2 = dsa2(MS); wavwrite(ms_dsa2, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' room '_dsa2.wav']); ms_scp = scp(MS); wavwrite(ms_scp, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' room '_scp.wav']); %resample for speech recognition gy_mpds11 = resample(gy_mpds, 1,2); wavwrite(gy_mpds11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' room '_mpds.wav']); gy_mpds211 = resample(gy_mpds2, 1,2); wavwrite(gy_mpds211, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' room '_mpds2.wav']); gy_mpds311 = resample(gy_mpds3, 1,2); wavwrite(gy_mpds311, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' room '_mpds3.wav']); gy_dsa211 = resample(gy_dsa2, 1,2); wavwrite(gy_dsa211, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' room '_dsa2.wav']); gy_scp11 = resample(gy_scp, 1,2); wavwrite(gy_scp11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' room '_scp.wav']); ms_mpds11 = resample(ms_mpds, 1,2); wavwrite(ms_mpds11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' room '_mpds.wav']); ms_mpds211 = resample(ms_mpds2, 1,2); wavwrite(ms_mpds211, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' room '_mpds2.wav']); - 75 - ms_mpds311 = resample(ms_mpds3, 1,2); wavwrite(ms_mpds311, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' room '_mpds3.wav']); ms_dsa211 = resample(ms_dsa2, 1,2); wavwrite(ms_dsa211, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' room '_dsa2.wav']); ms_scp11 = resample(ms_scp, 1,2); wavwrite(ms_scp11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' room '_scp.wav']); % load originals gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']); ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']); % normalize orignals gy_orig2 = gy_orig/max(abs(gy_orig)); ms_orig2 = ms_orig/max(abs(ms_orig)); gy_mse_mpds = mse(gy_orig2, gy_mpds); gy_mse_mpds2 = mse(gy_orig2, gy_mpds2); gy_mse_mpds3 = mse(gy_orig2, gy_mpds3); gy_mse_dsa2 = mse(gy_orig2, gy_dsa2); gy_mse_scp = mse(gy_orig2, gy_scp); %column vector of gy_mse_* gy_mse_v = [gy_mse_mpds; gy_mse_mpds2; gy_mse_mpds3; gy_mse_dsa2; gy_mse_scp]; ms_mse_mpds = mse(ms_orig2, ms_mpds); ms_mse_mpds2 = mse(ms_orig2, ms_mpds2); ms_mse_mpds3 = mse(ms_orig2, ms_mpds3); ms_mse_dsa2 = mse(ms_orig2, ms_dsa2); ms_mse_scp = mse(ms_orig2, ms_scp); ms_mse_v = [ms_mse_mpds; ms_mse_mpds2; ms_mse_mpds3; ms_mse_dsa2; ms_mse_scp]; gy_snr_mpds = snr(gy_orig2, gy_mpds); gy_snr_mpds2 = snr(gy_orig2, gy_mpds2); gy_snr_mpds3 = snr(gy_orig2, gy_mpds3); gy_snr_dsa2 = snr(gy_orig2, gy_dsa2); gy_snr_scp = snr(gy_orig2, gy_scp); gy_snr_v = [gy_snr_mpds; gy_snr_mpds2; gy_snr_mpds3; gy_snr_dsa2; gy_snr_scp]; ms_snr_mpds = snr(ms_orig2, ms_mpds); ms_snr_mpds2 = snr(ms_orig2, ms_mpds2); ms_snr_mpds3 = snr(ms_orig2, ms_mpds3); ms_snr_dsa2 = snr(ms_orig2, ms_dsa2); ms_snr_scp = snr(ms_orig2, ms_scp); - 76 - ms_snr_v = [ms_snr_mpds; ms_snr_mpds2; ms_snr_mpds3; ms_snr_dsa2; ms_snr_scp]; gy_mse_results = [gy_mse_results gy_mse_v]; %append new results gy_snr_results = [gy_snr_results gy_snr_v]; ms_mse_results = [ms_mse_results ms_mse_v]; ms_snr_results = [ms_snr_results ms_snr_v]; end end %get averages gy_avg_mses = zeros(5,1); gy_avg_snrs = zeros(5,1); ms_avg_mses = zeros(5,1); ms_avg_snrs = zeros(5,1); for i = 1:5 gy_avg_mses(i) = mean(gy_mse_results(i,:)); %average across the row gy_avg_snrs(i) = mean(gy_snr_results(i,:)); ms_avg_mses(i) = mean(ms_mse_results(i,:)); ms_avg_snrs(i) = mean(ms_snr_results(i,:)); end gy_fid = fopen(['d:\speech\test_results\gy_' room '_multi.txt'], 'w'); fprintf(gy_fid, '%s\n', 'MSE'); fprintf(gy_fid, '%s\n','mpds mpds2 mpds3 dsa2 scp'); fprintf(gy_fid, '%6.3f %6.3f %6.3f %6.3f %6.3f\n', gy_mse_results); fprintf(gy_fid, '%s\n', 'gy average mse'); fprintf(gy_fid, '%6.3f %6.3f %6.3f %6.3f %6.3f\n\n', gy_avg_mses); fprintf(gy_fid, '%6.3f %6.3f %6.3f %6.3f %6.3f\n', gy_snr_results); fprintf(gy_fid, '%s\n', 'gy average snr'); fprintf(gy_fid, '%6.3f %6.3f %6.3f %6.3f %6.3f\n', gy_avg_snrs); fclose(gy_fid); ms_fid = fopen(['d:\speech\test_results\ms_' room '_multi.txt'], 'w'); fprintf(ms_fid, '%s\n', 'MSE'); fprintf(ms_fid, '%s\n','mpds mpds2 mpds3 dsa2 scp'); fprintf(ms_fid, '%6.3f %6.3f %6.3f %6.3f %6.3f\n', ms_mse_results); fprintf(ms_fid, '%s\n', 'ms average mse'); fprintf(ms_fid, '%6.3f %6.3f %6.3f %6.3f %6.3f\n\n', ms_avg_mses); fprintf(ms_fid, '%6.3f %6.3f %6.3f %6.3f %6.3f\n', ms_snr_results); fprintf(ms_fid, '%s\n', 'ms average snr'); fprintf(ms_fid, '%6.3f %6.3f %6.3f %6.3f %6.3f\n', ms_avg_snrs); - 77 - fclose(ms_fid); function f= test_simple(alpha, delay) %test simple echo signals ms_results = []; gy_results = []; version = [sprintf('%2.2f', alpha) '_' sprintf('%i',delay)]; mkdir('d:\speech\gina22\simple\', version); mkdir('d:\speech\murray22\simple\', version); mkdir('d:\speech\gina11\simple\', version); mkdir('d:\speech\murray11\simple\', version); gy_path = ['d:\speech\gina22\simple\' version '\']; %disp(gy_path); ms_path = ['d:\speech\murray22\simple\' version '\']; gy_path11 = ['d:\speech\gina11\simple\' version '\']; ms_path11 = ['d:\speech\murray11\simple\' version '\']; for i = 1:20 if (i~=9) & (i~=18), %skip 9 and 18 % load originals gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']); ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']); % normalize orignals gy_orig2 = gy_orig/max(abs(gy_orig)); ms_orig2 = ms_orig/max(abs(ms_orig)); %add echoes gy = addecho2wav(gy_orig, alpha, delay); ms = addecho2wav(ms_orig, alpha, delay); %normalize gy2 = gy/max(abs(gy)); wavwrite(gy2, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' version '.wav']); ms2 = ms/max(abs(ms)); wavwrite(ms2, 22050, [ms_path, 'S' sprintf('%i',i) 'ms22_' version '.wav']); %downsample for speech recognition gy11 = resample(gy2, 1,2); - 78 - wavwrite(gy11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' version '.wav']); ms11 = resample(ms2, 1,2); wavwrite(ms11, 11025, [ms_path11, 'S' sprintf('%i',i) 'ms11_' version '.wav']); %find mse's gy_mse = mse(gy_orig2, ms_mse = mse(ms_orig2, %find snr's gy_snr = snr(gy_orig2, ms_snr = snr(ms_orig2, disp(ms_snr); gy2); ms2); gy2); ms2); gy_results = [gy_results [gy_mse; gy_snr]]; ms_results = [ms_results [ms_mse; ms_snr]]; end end gy_avg_mse = mean(gy_results(1,:)); gy_avg_snr = mean(gy_results(2,:)); ms_avg_mse = mean(ms_results(1,:)); ms_avg_snr = mean(ms_results(2,:)); gy_fid = fopen(['d:\speech\test_results\gy_simple' version '.txt'], 'w'); fprintf(gy_fid, '%6.3f %6.3f\n', gy_results); fprintf(gy_fid, '%s\n', 'gy average mse'); fprintf(gy_fid, '%6.3f\n', gy_avg_mse); fprintf(gy_fid, '%s\n', 'gy average snr'); fprintf(gy_fid, '%6.3f\n', gy_avg_snr); fclose(gy_fid); ms_fid = fopen(['d:\speech\test_results\ms_simple' version '.txt'], 'w'); fprintf(ms_fid, '%6.3f %6.3f\n', ms_results); fprintf(ms_fid, '%s\n', 'ms average mse'); fprintf(ms_fid, '%6.3f\n', ms_avg_mse); fprintf(ms_fid, '%s\n', 'ms average snr'); fprintf(ms_fid, '%6.3f\n', ms_avg_snr); fclose(ms_fid); B.3.3 Support Functions function f = addecho2wav(A, alpha, delay) %f = addecho2wav(A, alpha, delay) % A = wave vector % alpha = row vector of attenuations % delay = row vector of delays - 79 - sa = size(alpha); sd = size(delay); if ~(sa(1) == sd(1)) | ~(sa(2) ==sd(2)), error('attenuation and delay vectors are not the same size'); end n = sa(2); %number of echoes to add temp = transpose(A); for i=1:n, temp = temp + alpha(i)*[zeros(1, delay(i)) transpose(A(1:96000delay(i)))]; end f = transpose(temp); function [ap, mp] = allpass(a); %ap = allpass(A) is the all pass component of A %[ap, mp] = allpass(A) give both the all pass %and the min phase components [y, ym] = rceps(a); %ym = min phase component of a A = fft(a); Ym = fft(ym); Ap = A./Ym; %Fourier transform of the all pass component ap = ifft(Ap); mp = ym; function s = deconvolve(x,h) X = fft(x,192000); H = fft(h,192000); iH = 1./H; if sum(iH) == Inf, x2 = zeros(1, 192000); x2(1:96000) = x; s = deconv(x2,h); else, S = X./H; s_long = real(ifft(S)); s = s_long(1:96000); end - 80 - function D = delay(A, d) % D = delay(A,d) % shifts a vector A forward by d samples % add d zeros in front, chops off tail B = zeros(96000,1); B(1:d) = 0; B((d+1):96000) = A(1:(96000-d)); D = B; function f=findpeaks2(A, b, e, N, alpha) %f=rfindpeaks2(A, b, e, N) %A = cepstral domain vector, b=begin, e=end, N=frame size %finds large pos and neg spikes in A and zeros them out chunks=floor((e-b)/N); B = A(b:e); M = zeros(3,2); %get rid of positive peaks [M(1,1) M(1,2)] = max(B(1:N)); [M(2,1) M(2,2)] = max(B(N+1:2*N)); for i=2:chunks-1, [M(3,1) M(3,2)] = max(B((N*i+1):N*(i+1))); if abs(M(2,1)) > alpha*abs(mean([M(1,1) M(3,1)])) B((i-1)*N+M(2,2))=0; %disp(M); end M(1,:) = M(2,:); M(2,:) = M(3,:); end %get rid of neg peaks [M(1,1) M(1,2)] = min(B(1:N)); [M(2,1) M(2,2)] = min(B(N+1:2*N)); for i=2:chunks-1, [M(3,1) M(3,2)] = min(B((N*i+1):N*(i+1))); if abs(M(2,1)) > alpha*abs(mean([M(1,1) M(3,1)])) B((i-1)*N+M(2,2))=0; %disp(M); end M(1,:) = M(2,:); M(2,:) = M(3,:); end A(b:e)=B; f=A; - 81 - function f = mixdown(room) %f = mixdown(room) %adds up the input from three microphones and divide by three to prevent clipping %sub paths sgy_path = sms_path = sgy_path11 sms_path11 ['d:\speech\gina22\' room '\']; ['d:\speech\murray22\' room '']; = ['d:\speech\gina11\' room '\']; = ['d:\speech\murray11\' room '\']; %output paths gy_path = ['d:\speech\gina22\' room '\mixed\']; ms_path = ['d:\speech\murray22\' room '\mixed\']; gy_path11 = ['d:\speech\gina11\' room '\mixed\']; ms_path11 = ['d:\speech\murray11\' room '\mixed\']; for i = 1:20 gy_file1 = ['S' sprintf('%i',i) 'gy22_' room '_m1.wav']; gy_file2 = ['S' sprintf('%i',i) 'gy22_' room '_m2.wav']; gy_file3 = ['S' sprintf('%i',i) 'gy22_' room '_m3.wav']; ms_file1 = ['S' sprintf('%i',i) 'ms22_' room '_m1.wav']; ms_file2 = ['S' sprintf('%i',i) 'ms22_' room '_m2.wav']; ms_file3 = ['S' sprintf('%i',i) 'ms22_' room '_m3.wav']; gy1 = wavread(gy_file1); gy2 = wavread(gy_file2); gy3 = wavread(gy_file3); ms1 = wavread(ms_file1); ms2 = wavread(ms_file2); ms3 = wavread(ms_file3); gy = (gy1+gy2+gy3)/3; ms = (ms1+ms2+ms3)/3; wavwrite(gy, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' room '.wav']); wavwrite(ms, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' room '.wav']); %downsample for speech recognition gy11 = resample(gy,1,2); ms11= resample(ms,1,2); wavwrite(gy11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' room '.wav']); wavwrite(ms11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' room '.wav']); end - 82 - function m = mse(X, Y) D = X-Y; D2 = D.*D; m = sum(D2); function f=rfindpeaks2(A, b, e, N, alpha) %f=rfindpeaks2(A, b, e, N) %A = cepstral domain vector, b=begin, e=end, N=frame size %alpha = threshold level, smaller alpha means more stringent threshold %recursive version of findpeaks2 %finds large pos and neg spikes in A and zeros them out chunks=floor((e-b)/N); B = A(b:e); M = zeros(3,2); %get rid of positive peaks [M(1,1) M(1,2)] = max(B(1:N)); [M(2,1) M(2,2)] = max(B(N+1:2*N)); for i=2:chunks-1, [M(3,1) M(3,2)] = max(B((N*i+1):N*(i+1))); if abs(M(2,1)) > alpha*abs(mean([M(1,1) M(3,1)])) B((i-1)*N+M(2,2))=0; %disp(M); end M(1,:) = M(2,:); M(2,:) = M(3,:); end %get rid of neg peaks [M(1,1) M(1,2)] = min(B(1:N)); [M(2,1) M(2,2)] = min(B(N+1:2*N)); for i=2:chunks-1, [M(3,1) M(3,2)] = min(B((N*i+1):N*(i+1))); if abs(M(2,1)) > alpha*abs(mean([M(1,1) M(3,1)])) B((i-1)*N+M(2,2))=0; end M(1,:) = M(2,:); M(2,:) = M(3,:); end if sum(A(b:e) - B) ~=0, A(b:e) = B; f =rfindpeaks2(A,b,e,N,alpha); else A(b:e)=B; f=A; end %call itself again - 83 - function r = snr(s, x) %signal to noise ratio, where s = clean signal, and x = corrupted signal s = s-mean(s); %subtract DC component x = x - mean(x); S = sum(s.^2); %energy of s X = sum(x.^2); r = (S.^2)/(S.^2-X.^2); function w=wavplay2(X); %plays a wav file at a sampling rate of 22050 wavplay(X, 22050); - 84 - Appendix C Tables of Results C.1 Results for Simple Model Table C-1: Female Subject’s Table of Results alpha delay Algorithm Wrong benchmark Added Missing Total Percent improvement 3 0 2 5 0.5 0.5 0.5 5513 5513 5513 none MPD2 C2I 35 5 8 7 1 0 12 3 5 54 9 13 83.33 75.93 0.5 0.5 0.5 11025 11025 11025 none MPD2 C2I 39 6 8 20 0 2 9 2 7 68 8 17 88.24 75.00 0.5 0.5 0.5 22050 22050 22050 none MPD2 C2I 14 5 6 48 2 11 6 3 5 68 10 22 85.29 67.65 0.25 0.25 0.25 11025 11025 11025 none MPD2 C2I 15 6 3 20 1 3 5 4 3 40 11 9 72.50 77.50 - 85 - alpha delay Algorithm Wrong Added Missing Total Percent improvement 0.5 0.5 0.5 11025 11025 11025 none MPD2 C2I 39 6 8 20 0 2 9 2 7 68 8 17 88.24 75.00 0.75 0.75 0.75 11025 11025 11025 none MPD2 C2I 39 9 20 26 1 16 9 7 9 74 17 45 77.03 39.19 Table C-2: Male Subject’s Tables for Simple Model alpha delay Algorithm Wrong benchmark Added Missing Total Percent Improvement 6 0 6 12 0.5 0.5 0.5 5513 5513 5513 none MPD2 C2I 14 8 7 8 1 2 14 9 9 36 18 18 75.00 75.00 0.5 0.5 0.5 11025 11025 11025 none MPD2 C2I 39 6 8 20 0 2 9 2 7 68 8 17 107.14 91.07 0.5 0.5 0.5 22050 22050 22050 none MPD2 C2I 7 8 7 46 2 23 11 12 10 64 22 40 80.77 46.15 0.25 0.25 0.25 11025 11025 11025 none MPD2 C2I 13 6 7 17 1 0 9 9 7 39 16 14 85.19 92.59 0.5 0.5 0.5 11025 11025 11025 none MPD2 C2I 39 6 8 20 0 2 9 2 7 68 8 17 107.14 91.07 0.75 0.75 0.75 11025 11025 11025 none MPD2 C2I 40 11 28 18 1 29 10 13 17 68 25 74 76.79 -10.71 - 86 - C.2 Tables for Complex Model Signals with One Microphone Table C-3: Female Subject Signal Environment Algorithm Wrong Missing Added Total Percent Improvement benchmark N/A 3 0 2 5 Low Echo, m1 Low Echo, m1 Low Echo, m1 none MPD2 C2I 16 10 9 1 0 4 7 5 0 24 15 13 47.37 57.89 Medium Echo,m1 Medium Echo,m1 Medium Echo,m1 none MPD2 C2I 22 26 27 1 1 1 22 20 19 45 47 47 -5.00 -5.00 High Echo,m1 High Echo,m1 High Echo,m1 none MPD2 C2I 74 76 76 1 3 3 37 35 35 112 114 114 -1.87 -1.87 Table C-4: Male Subject Signal Environment Algorithm Wrong Missing Added Total Percent Improvement benchmark N/A 6 0 6 12 Low Echo,m1 Low Echo,m1 Low Echo,m1 none MPD2 C2I 11 12 12 0 0 0 14 16 16 25 28 28 -23.08 -23.08 Medium Echo,m1 Medium Echo,m1 Medium Echo,m1 none MPD2 C2I 34 35 35 0 0 0 38 37 38 72 72 73 0.00 -1.67 High Echo,m1 High Echo,m1 High Echo,m1 none MPD2 C2I 53 57 55 1 0 0 56 52 54 110 109 109 1.02 1.02 - 87 - C.3 Tables for Complex Signals with Three Microphones Table C-5: Female Subject Signal Environment Algorithm Wrong Missing Added Total Percent Improvement benchmark N/A 3 0 2 5 Low Echo Low Echo Low Echo Low Echo Low Echo Low Echo Low Echo none DSA SCP MPDS MPDS2 MPDS3 C2Is 17 13 15 18 17 18 17 0 0 0 0 0 0 0 11 6 7 9 8 9 8 28 19 22 27 25 27 25 39.13 26.09 4.35 13.04 4.35 13.04 Medium Echo Medium Echo Medium Echo Medium Echo Medium Echo Medium Echo Medium Echo none DSA SCP MPDS MPDS2 MPDS3 C2Is 39 37 38 39 37 38 36 6 4 4 4 4 4 4 25 27 26 25 25 25 25 70 68 68 68 66 67 65 3.08 3.08 3.08 6.15 4.62 7.69 High Echo High Echo High Echo High Echo High Echo High Echo High Echo none DSA SCP MPDS MPDS2 MPDS3 C2Is 78 72 65 59 61 63 73 1 1 0 0 0 0 1 35 38 52 58 53 53 38 114 111 117 117 114 116 112 2.75 -2.75 -2.75 0.00 -1.83 1.83 - 88 - Table C-6: Male Subject Signal Environment Algorithm Wrong Missing Added Total Percent Improvement benchmark N/A 6 0 6 12 Low Echo Low Echo Low Echo Low Echo Low Echo Low Echo Low Echo none DSA SCP MPDS MPDS2 MPDS3 C2Is 16 12 12 15 13 14 12 0 0 1 0 0 0 0 19 16 17 17 16 18 16 35 28 30 32 29 32 28 30.43 21.74 13.04 26.09 13.04 30.43 Medium Echo Medium Echo Medium Echo Medium Echo Medium Echo Medium Echo Medium Echo none DSA SCP MPDS MPDS2 MPDS3 C2Is 41 38 41 43 43 44 41 0 1 0 0 0 0 1 35 33 37 34 36 35 34 76 72 78 77 79 79 76 6.25 -3.13 -1.56 -4.69 -4.69 0.00 High Echo High Echo High Echo High Echo High Echo High Echo High Echo none DSA SCP MPDS MPDS2 MPDS3 C2Is 53 50 63 62 63 64 69 2 0 0 1 0 1 0 56 62 50 52 51 51 43 111 112 113 115 114 116 112 -1.01 -2.02 -4.04 -3.03 -5.05 -1.01 - 89 - C.4 Different Training Environments Table C-7: “Clean” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2I none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 3 39 6 8 16 10 9 17 13 15 18 17 18 17 0 20 0 2 1 0 4 0 0 0 0 0 0 0 2 9 2 3 7 5 0 11 6 7 9 8 9 8 5 68 8 13 24 15 13 28 19 22 27 25 27 25 95.24 87.30 47.37 57.89 39.13 26.09 4.35 13.04 4.35 13.04 Table C-8: “Clean, mobile” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 15 40 8 16 9 10 9 15 11 14 17 17 17 16 - 90 - 0 18 0 2 0 0 0 0 0 0 0 0 0 0 2 8 4 5 7 8 8 10 7 9 10 10 10 6 17 66 12 23 16 18 17 25 18 23 27 27 27 22 110.20 87.76 200.00 100.00 87.50 25.00 -25.00 -25.00 -25.00 37.50 Table C-9: “Simple Echo, 1 Microphone” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 11 59 23 30 22 24 23 38 26 25 27 27 30 26 0 20 1 8 1 2 2 1 2 3 1 2 3 3 4 12 5 6 10 11 11 20 9 8 16 15 14 10 15 91 29 44 33 37 36 59 37 36 44 44 47 39 81.58 61.84 -22.22 -16.67 50.00 52.27 34.09 34.09 27.27 45.45 Table C-10: “Simple echo, 1 Microphone, MPD2” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 9 50 16 23 17 16 16 29 18 21 25 22 25 21 - 91 - 0 15 0 1 0 0 0 1 0 0 0 0 0 0 5 15 7 9 15 10 11 17 7 11 14 12 14 10 14 80 23 33 32 26 27 47 25 32 39 34 39 31 86.36 71.21 33.33 27.78 66.67 45.45 24.24 39.39 24.24 48.48 Table C-11: “Simple Echo, 1 Microphone, C2I” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 8 45 13 22 19 19 19 29 22 33 30 29 29 30 0 15 0 1 1 0 0 2 0 0 1 0 1 1 3 22 4 8 11 7 7 12 5 23 19 21 20 17 11 82 17 31 31 26 26 43 27 56 50 50 50 48 91.55 71.83 25.00 25.00 50.00 -40.63 -21.88 -21.88 -21.88 -15.63 Table C-12: “Multiple Echo, 1 Microphone” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 17 43 19 23 16 15 17 22 21 20 20 19 21 19 - 92 - 0 24 0 5 0 0 0 2 1 1 1 1 1 1 6 15 7 10 8 9 10 9 9 7 8 8 8 8 23 82 26 38 24 24 27 33 31 28 29 28 30 28 94.92 74.58 0.00 -300.00 20.00 50.00 40.00 50.00 30.00 50.00 Table C-13: “Multiple Echo, 1 Microphone, MPD2” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 18 59 19 25 17 18 17 21 18 18 19 17 19 17 1 24 1 5 1 1 1 2 3 2 2 2 2 2 7 16 7 11 12 10 11 10 10 10 9 9 9 8 26 99 27 41 30 29 29 33 31 30 30 28 30 27 98.63 79.45 25.00 25.00 28.57 42.86 42.86 71.43 42.86 85.71 Table C-14: “Multiple Echo, 1 Microphone, C2I” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 18 60 19 24 19 17 17 21 20 16 18 16 18 18 - 93 - 1 24 1 5 2 1 1 2 2 2 2 2 2 2 7 19 7 10 10 10 10 10 9 10 9 9 9 9 26 103 27 39 31 28 28 33 31 28 29 27 29 29 98.70 83.12 60.00 60.00 28.57 71.43 57.14 85.71 57.14 57.14 Table C-15: “Multiple Echo, 3 Microphone” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 17 67 21 27 25 21 21 23 22 25 24 23 26 23 0 22 1 4 1 1 1 2 1 1 1 1 1 1 6 13 6 9 9 8 8 11 10 7 7 7 7 10 23 102 28 40 35 30 30 36 33 33 32 31 34 34 93.67 78.48 41.67 41.67 23.08 23.08 30.77 38.46 15.38 15.38 Table C-16: “Multiple Echo, 3 Microphone, C2Is” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 13 59 19 28 20 19 19 24 19 18 17 18 19 20 - 94 - 1 17 1 4 1 1 1 2 1 1 1 1 1 1 6 14 7 11 9 10 10 10 11 11 10 11 11 11 20 90 27 43 30 30 30 36 31 30 28 30 31 32 90.00 67.14 0.00 0.00 31.25 37.50 50.00 37.50 31.25 25.00 Table C-17: “Multiple Echo, 3 Microphone, DSA” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 14 56 9 22 16 14 16 25 17 15 16 16 16 16 1 24 1 5 1 1 1 2 1 1 1 1 1 1 6 16 6 11 10 9 9 11 11 11 12 12 12 12 21 96 16 38 27 24 26 38 29 27 29 29 29 29 106.67 77.33 50.00 16.67 52.94 64.71 52.94 52.94 52.94 52.94 Table C-18: “Multiple Echo, 3 Microphone, MPDs” training environment Signal environment Processing Algorithm clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 19 54 20 26 23 23 23 22 26 22 26 22 22 23 - 95 - 1 17 1 5 1 1 1 2 1 1 1 1 1 1 7 16 11 12 11 11 11 10 9 8 10 8 8 9 27 87 32 43 35 35 35 34 36 31 37 31 31 33 91.67 73.33 0.00 0.00 -28.57 42.86 -42.86 42.86 42.86 14.29 Table C-19: “Multiple Echo, 3 Microphone, MPDs2” training environment Signal environment clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic Processing Algorithm N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 15 1 9 25 56 24 16 96 25 1 9 35 85.92 30 4 11 45 71.83 24 1 11 36 23 1 12 36 0.00 24 1 13 38 -18.18 28 1 11 40 26 2 10 38 13.33 27 2 12 41 -6.67 27 1 13 41 -6.67 29 1 12 42 -13.33 28 1 11 40 0.00 29 1 13 43 -20.00 Table C-20: “Multiple Echo, 3 Microphone, SCP” training environment Signal environment clean simple echo, 1 mic simple echo, 1 mic simple echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 1 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic mult. echo, 3 mic Processing Algorithm N/A none MPD C2I none MPD C2Is none DSA SCP MPDS MPDS2 MPDS3 C2Is Wrong Added Missing Total Percent Improvement 18 1 6 25 59 18 14 91 22 1 11 34 86.36 34 1 12 47 66.67 24 1 13 38 23 1 15 39 -7.69 23 1 14 38 0.00 29 3 8 40 27 2 11 40 0.00 27 2 14 43 -20.00 27 2 13 42 -13.33 26 2 13 41 -6.67 27 2 13 42 -13.33 27 2 12 41 -6.67 - 96 - References [1] T. G. Stockham, T. M. Cannon, and R. B. Ingebretsen, “Blind Deconvolution through Digital Signal Processing”, Proceedings of the IEEE, v 63, n 4, pp. 678-692, Apr. 1975. [2] A. P. Petropulu and S. Subramaniam, “Cepstrum Based Deconvolution for Speech Dereverberation”, Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. I-9-12, Apr. 1994. [3] S. Affes and Y. Grenier, “A Signal Subspace Tracking Algorithm for Microphone Array Processing of Speech”, IEEE Transactions on Speech and Audio Processing, v 5, n 5, pp. 425-437, Sept. 1997. [4] M. A. Casey, W. G. Gardner, and S. Basu, “Vision Steered Beam-forming and Transaural Rendering for the Artificial Life Interactive Environment (ALIVE)”, Proceedings of the 99th Convention of the Audio Engineering Society (AES), 1995. [5] P. Maes, T. Darrell, B. Blumberg, and A. Pentland, “The ALIVE System: Full-body Interaction with Autonomous Agents”, Proceedings of the Computer Animation Conference, Switzerland, IEEE Press, 1995. [6] J. Flanagan, “Autodirective Sound Capture: Towards Smarter Conference Rooms”, IEEE Intelligent Systems, March/April 1999. [7] A. Westner, “Object-Based Audio Capture”, Master’s thesis, Media Arts and Sciences Program at the Massachusetts Institute of Technology, Feb. 1999. [8] P. Clarkson, Optimal and Adaptive Signal Processing, CRC Press, Inc., 1993. [9] Q. Liu, B. Champagne and P. Kabal, “Room Speech Dereverberation via Minimum Phase and All-pass Component Processing of Multi-microphone Signals”, IEEE Pacific RIM Conference on Communications, Computers, and Signal Processing – Proceedings, pp. 571-574, May 1995. [10] A. Oppenheim and R. Schafer, Discrete-Time Signal Processing, Prentice Hall, 1989. [11] J. Allen and D. Berkeley, “Image Method for Efficiently Simulating Small-room Acoustics”, Journal of the Acoustical Society of America, v 65, n 4, pp. 943-950, April 1979. - 97 - [12] B. P. Bogert, M. J. R. Healy and J. W. Tukey, “The quefrency alanysis of time series echoes: cepstrum, pseudo-autocovariance, cross-cepstrum and saphe-cracking”, Proceedings of the Symposium on Time Series Analysis, Chapter 15, 209-243, Wiley, New York. - 98 -