3D Echo Cancellation in a Home Environment by Gina F. Yip

3D Echo Cancellation in a Home Environment
by
Gina F. Yip
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
February 6, 2001
Copyright 2001 Gina F. Yip. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis
and to grant others the right to do so.
Author__________________________________________________________________
Department of Electrical Engineering and Computer Science
February 6, 2001
Certified by______________________________________________________________
David L. Waring
VIA Company Supervisor
Telcordia Technologies
Certified by______________________________________________________________
David H. Staelin
Thesis Supervisor
Accepted by_____________________________________________________________
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
3D Echo Cancellation in a Home Environment
by
Gina F. Yip
Submitted to the
Department of Electrical Engineering and Computer Science
February 6, 2001
In Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
ABSTRACT
This thesis describes the work done to research, implement, and compare various
algorithms for the cancellation of echoes in a home environment, where the room impulse
response is unknown and variable. The general problem, where the speaker’s movements
are completely unrestricted, is a very hard one, and research in this area has only begun in
the last several years. Therefore, this thesis addresses a simplified version of the
problem, where the impulse response of the multipath environment is assumed to be
stationary within the duration of a verbal command. Given this assumption, which is
reasonable for most situations, algorithms based on the complex cepstrum,
autocorrelation, and delay and sum methods of echo cancellation were chosen and
developed for the study.
Many simulation tests were done to determine the behavior of the algorithms
under different echo environments. The test signals were based on the simple delay and
attenuation echo model with one microphone, and on a more realistic echo model,
generated by the Cool Edit Pro software, with one or three microphones. The
performance metrics were the number of errors and the percent of improvement in speech
recognition by Dragon Systems’ Naturally Speaking software. The results showed vast
improvement for the cepstral domain methods on the simple echo signals, but the
numbers were mixed for the complex model, one microphone cases. However, with three
microphones, the delay and sum algorithm showed consistent improvement. Given that
research in this specific area of 3D echo cancellation in a home environment, where 3D
refers to the moving speech source, is still in its early stage, the results are encouraging.
VIA Company Supervisor: David L. Waring
Title: Director of Broadband Access & Premises Internetworking Group, Telcordia
Technologies
Thesis Supervisor: David H. Staelin
Title: Professor of Electrical Engineering & Computer Science, Assistant Director of
Lincoln Lab
-2-
Acknowledgements
Resounding thanks to my supervisor at Telcordia, Dave Waring, for being extremely
supportive in providing me everything I needed to complete the project.
Loud thanks to my thesis advisor, Professor David H. Staelin, for his technical advice and
guidance.
Thanks to my mentor, Craig Valenti, at Telcordia for helping me get the project off the
ground and for reading my thesis, and thanks to Murray Spiegel for his sound advice.
Also, thanks to Stefano Galli, Kevin Lu, Joanne Spino, Brenda Fields, and everyone else
at Telcordia who helped me along the way.
Thanks to Jason, my officemate and fellow 6A intern, for being my sounding board and
lunch buddy.
A shout of thanks to my friends, who kept me sane during these long, quiet months in
Morristown, NJ: Anne, Jenny, Linda, Lucy, Nkechi, Teresa, Xixi, and Yu.
Finally, deep gratitude to my parents for their love, support, and sacrifices through the
years!
-3-
Table of Contents
ABSTRACT................................................................................................................................................... 2
ACKNOWLEDGEMENTS ......................................................................................................................... 3
TABLE OF CONTENTS ............................................................................................................................. 4
LIST OF FIGURES ...................................................................................................................................... 6
CHAPTER 1.................................................................................................................................................. 8
1.1 HOME NETWORKING ......................................................................................................................... 8
1.1.1
Ideal Home Networking: Smart Houses................................................................................. 8
1.1.2
Problems ................................................................................................................................. 8
1.2 RELATED WORK ................................................................................................................................ 9
1.2.1
Visual Tracking by MIT Media Lab ...................................................................................... 10
1.2.2
Array Processing................................................................................................................... 10
1.2.3
Blind Source Separation and Deconvolution (BSSD) ........................................................... 10
1.2.4
Adaptive Processing.............................................................................................................. 11
1.2.5
Simpler Techniques ............................................................................................................... 11
1.3 SCOPE OF THESIS ............................................................................................................................. 11
1.4 STRUCTURE OF THESIS .................................................................................................................... 12
CHAPTER 2................................................................................................................................................ 13
2.1 MAIN ALGORITHMS......................................................................................................................... 13
2.1.1
MPD...................................................................................................................................... 13
2.1.2
C2I ........................................................................................................................................ 17
2.1.3
DSA ....................................................................................................................................... 22
2.2 ACTUAL METHODS IMPLEMENTED .................................................................................................. 22
CHAPTER 3................................................................................................................................................ 24
3.1
3.2
BASIC ECHO MODEL ....................................................................................................................... 24
COMPLEX ECHO ENVIRONMENT SIMULATION ................................................................................. 26
CHAPTER 4................................................................................................................................................ 28
4.1 GOALS ............................................................................................................................................. 28
4.2 SPEECH DATA USED ........................................................................................................................ 28
4.3 METHODS ........................................................................................................................................ 30
4.4 RESULTS .......................................................................................................................................... 32
4.4.1
Simple Echo Environments ................................................................................................... 32
4.4.2
Complex Echoes, One Microphone....................................................................................... 37
4.4.3
Complex Echoes, Three Microphones................................................................................... 41
4.4.4
Different Training Environments .......................................................................................... 45
CHAPTER 5................................................................................................................................................ 50
5.1 CONCLUSIONS ................................................................................................................................. 50
5.2 FUTURE WORK ................................................................................................................................ 51
5.2.1
Testing in Real Echo Environments ...................................................................................... 51
5.2.2
Types of Microphones ........................................................................................................... 52
-4-
5.2.3
Microphone Placement ......................................................................................................... 52
5.2.4
Real Time .............................................................................................................................. 52
5.2.5
Continual or Rapid Speaker Movement ................................................................................ 52
5.2.6
Multiple Speakers.................................................................................................................. 53
5.3 FINAL THOUGHTS ............................................................................................................................ 53
APPENDIX A.............................................................................................................................................. 54
APPENDIX B .............................................................................................................................................. 58
B.1 TEST FUNCTIONS ............................................................................................................................... 58
B.2 SUPPORT FUNCTIONS ........................................................................................................................ 59
B.3 SOURCE CODE ................................................................................................................................... 62
B.3.1 Main Algorithms ....................................................................................................................... 62
B.3.2 Test Functions........................................................................................................................... 68
B.3.3 Support Functions...................................................................................................................... 79
APPENDIX C.............................................................................................................................................. 85
C.1
C.2
C.3
C.4
RESULTS FOR SIMPLE MODEL ........................................................................................................... 85
TABLES FOR COMPLEX MODEL SIGNALS WITH ONE MICROPHONE ................................................... 87
TABLES FOR COMPLEX SIGNALS WITH THREE MICROPHONES .......................................................... 88
DIFFERENT TRAINING ENVIRONMENTS ............................................................................................. 90
REFERENCES............................................................................................................................................ 97
-5-
List of Figures
Figure 2-1: Complex cepstrum of the min-phase component of a signal with an echo at
delay = 0.5s, attenuation = 0.5 .................................................................................. 14
Figure 2-2: Zoomed in version of Figure 2-1................................................................... 14
Figure 2-3: Block diagram of the MPD algorithm........................................................... 15
Figure 2-4: Complex cepstrum from Figure 2-1, after the spikes were taken out, using
MPD .......................................................................................................................... 16
Figure 2-5: The spikes that were detected and taken out by MPD .................................. 16
Figure 2-6: Block diagram for C2I algorithm .................................................................. 18
Figure 2-7: Autocorrelation of the original clean signal .................................................. 19
Figure 2-8: Autocorrelation of the signal with an echo at delay = 0.5s, attenuation = 0.5
................................................................................................................................... 19
Figure 2-9: Autocorrelation of the resultant signal after processing the reverberant signal
with C2I..................................................................................................................... 20
Figure 2-10: Impulse response of an echo at delay = 0.5s, attenuation = 0.5 .................. 21
Figure 2-11: Impulse response estimated by C2I............................................................. 21
Figure 3-1: Simple model of an echo as a reflection that is a delayed copy of the original
signal ......................................................................................................................... 24
Figure 3-2: Screen shot of the 3-D Echo Chamber menu in Cool Edit Pro 1.2 ............... 26
Figure 4-1: Female subject’s breakdown of errors for varying delays, with attenuation
held constant at 0.5.................................................................................................... 32
Figure 4-2: Male subject’s breakdown of errors for varying delays, with attenuation held
constant at 0.5............................................................................................................ 33
Figure 4-3: Female subject’s breakdown of errors for varying attenuation factors, with
delay held constant at 11025 samples (0.5 seconds)................................................. 34
Figure 4-4: Male subject’s breakdown of errors for varying attenuation factors, with
delay held constant at 11025 samples (0.5 seconds)................................................. 35
Figure 4-5: Percent improvement as a function of delay and of attenuation for male and
female subjects .......................................................................................................... 36
Figure 4-6: Female subject’s breakdown of errors for complex, one microphone signals
................................................................................................................................... 37
Figure 4-7: Male subject’s breakdown of errors for complex, one microphone signals.. 38
Figure 4-8: Percent improvement vs. signal environment, female subject ...................... 39
Figure 4-9: Percent improvement vs. signal environment, male subject ......................... 40
Figure 4-10: Female subject’s breakdown of errors for complex, multiple microphone
signals........................................................................................................................ 41
Figure 4-11: Male subject’s breakdown of errors for complex, multiple microphone
signals........................................................................................................................ 42
-6-
Figure 4-12: Percent improvement vs. echo environment, female subject ...................... 43
Figure 4-13: Percent improvement vs. echo environment, male subject ......................... 44
Figure 4-14: How C2I and MPD2 perform on simple echo signals under different
training environments................................................................................................ 46
Figure 4-15: How C2I and MPD2 perform on complex reverberation, one microphone
signals under different training environments........................................................... 47
Figure 4-16: How C2Is, DSA, MPDs, MPDs2, MPDs3, SCP perform on complex
reverberation, multi-microphone signals under different training environments ..... 49
Figure A-1: Block diagram of the complex cepstrum...................................................... 54
-7-
Chapter 1
Introduction
1.1
Home Networking
Home networking can refer to anything from simply having a few interconnected
computers in a house, to having appliances that are wired to the Internet, to having fully
connected "smart houses.” The last definition is the one used in this thesis.
1.1.1 Ideal Home Networking: Smart Houses
As the digital revolution rages on, the notion of smart houses is no longer just a
science fiction writer’s creation. These houses are computerized and networked to
receive and execute verbal commands, such as to open the door, turn on the lights, and
turn on appliances. Ideally, microphones are placed throughout the house, and the
homeowner is free to move about and speak naturally, without having to focus his speech
in any particular direction or being encumbered by handheld or otherwise attached
microphones. However, many problems must be solved first, before science fiction
becomes reality.
1.1.2 Problems
Specifically, speech recognition is crucial to the success of home networking,
since home security, personal safety, and the overall system’s effectiveness are all
-8-
affected by this component’s ability to decode the speech input, recognize commands,
and distinguish between different people’s voices. However, the performance of current
speech recognition technology is drastically degraded by distance from the
microphone(s), background noise, and room reverberation.
Therefore, to increase the speed and accuracy of the speech recognition process, a
pre-filtering operation should be used to adjust gain, eliminate noise, and cancel echoes.
Of these desired functions, echo cancellation will be one of the hardest to design. Hence,
the topic of this master’s thesis research is providing “clean” speech to the voice
recognition engine by canceling the 3D echoes that are produced, when a person is
speaking and moving about in a home environment.
1.2
Related Work
It is true that much work has been done on echo cancellation. One especially
famous project is Stockham et al.’s restoration of Caruso’s singing voice from old
phonographic recordings [1]. However, there are additional factors in the home
environment that complicate matters. For instance, different objects and materials in the
house absorb and reflect sound waves differently, and many of these objects are not
permanent, or at least, they are not always placed in the same location. Additionally, the
processing must be done in real time (or pseudo real time), so speed and efficiency,
which were less crucial in the Caruso project, need to be considered. For example, in the
Caruso project, the researchers used a modern recording of the same song to estimate the
impulse response of the original recording environment, but this is impractical for the
task at hand. Finally, when the source of the signal is moving around, there is a Doppler
Effect, and the system must either track the source’s location to accurately estimate the
multipath echoes, adapt to the changing location of the source, or work independently of
the source’s location.
Therefore, while an overwhelming amount of work has been done on removing
echoes, very few methods actually address the problem of unknown and changing source
locations. For instance, in recent years, many solutions for applications such as hands-9-
free telephony and smart cars have been published [2], [3], but in all of these cases, the
speaker does not move very much, and the general direction of the speaker remains
relatively constant.
1.2.1 Visual Tracking by MIT Media Lab
One method, proposed by the MIT Media Lab, addresses the tracking problem
visually [4]. This solution uses cameras and a software program called Pfinder [5] to
track human forms and steers the microphone beam accordingly. However, the use of
video cameras and image processing may be expensive—both computationally and
monetarily. Also, while people may be willing to have microphones in their houses, they
may still be uncomfortable with the possible violations of privacy due to having cameras
in their homes.
1.2.2 Array Processing
In addition, there has been a lot of research on using large microphone arrays to
do beamforming [6]. However, these approaches require anywhere from tens to hundreds
of microphones, which can be very expensive, especially for private homes with multiple
rooms. Also, the math becomes very complicated, so processing speed and processing
power may become issues.
1.2.3 Blind Source Separation and Deconvolution (BSSD)
Another MIT student, Alex Westner, examined in his master’s thesis ways to
separate audio mixtures using BSSD algorithms. These algorithms were adaptive and
based on higher order statistics. Although his project focused on ways to separate
multiple speech sources, it had the potential of shedding some light into the echo
cancellation problem at hand. After all, one way to view echo cancellation is the
deconvolution of an unknown room impulse response from a reverberant signal (also
known as blind deconvolution). Also, the original speaker and the echoes could be
viewed as multiple sources. However, further reading revealed that the BSSD algorithms
- 10 -
assumed that the sources were statistically independent, which is not the case for echoes,
since echoes are generally attenuated and delayed copies of the original source. Also,
Westner found that even a small amount of reverberation severely impairs the
performance of these algorithms [7].
1.2.4 Adaptive Processing
Adaptive processing algorithms are very popular for noise cancellation, though
they are used sometimes for echo cancellation as well. However, since this project
focuses specifically on the context of home networking, it is reasonable to assume that
utterances will tend to be limited to a few seconds. (e.g. “Close the refrigerator door.” or
“Turn off the air conditioner.”) Therefore, the algorithms (normally iterative or
recursive) are not likely to converge within the duration of the signals [8].
1.2.5 Simpler Techniques
Therefore, this thesis will focus on simpler, classical approaches, such as
cancellation in the cepstral domain, estimating the multipath impulse response through
the reverberant signal’s autocorrelation, and for the multiple microphone case, delaying
and summing the signals (also known as cross spectra processing or delay and sum
beamforming). In addition, the first two methods are combined with the third when there
are multiple microphones, and the multi-microphone cepstral domain processing case is
based on work done by Liu, Champagne, and Kabal in [9].
1.3
Scope of Thesis
Based on the background research described above, it seems that even with large
arrays or highly complex algorithms, developing a system that effectively removes
echoes in a highly variable environment remains extremely challenging. However, there
are some assumptions that can be made, based on the nature of home networking, which
will better define and simplify the problem.
- 11 -
As suggested previously, the echo environment is not stationary (i.e. objects and
speakers are not in fixed locations), so the algorithms cannot assume any fixed impulse
responses. This rules out predetermining the room impulse response by sending out a
known signal and recording the signal that reaches the microphone.
However, a key assumption, as mentioned in Section 1.2.4, is that utterances will
tend to be short, so that the multipath environment can be considered stationary within
the duration of an utterance. In other words, the person is not moving very fast while
speaking. Note, though, that “movement” refers to any change in position, including
turning one’s head. Change in the direction that the speaker faces will alter the multipath
more drastically that other forms of movement. Therefore, in order for the stationary
assumption to hold, the speaker must keep his head motionless while uttering a
command.
Another assumption is that the detection of silence is possible, which is valid,
since most speech recognition software programs already have this feature. As a result,
pauses can be used to separate utterances.
Therefore, the purpose of this thesis is to develop, simulate, and compare echo
cancellation algorithms in the context of smart houses. There are many other issues, such
as dealing with multiple simultaneous speakers or external speech sources, (e.g.
televisions and radios). However, these problems are beyond the scope of this thesis.
1.4
•
Structure of Thesis
Chapter 1 gives background information, motivation, and an overview of the
problem, as well as defining the scope of the thesis.
•
Chapter 2 describes the algorithms that were chosen and implemented.
•
Chapter 3 describes the echo environments and how they were simulated.
•
Chapter 4 explains the experiments that were set up and run, and the various metrics
used to compare the different algorithms and methods.
•
Chapter 5 gives conclusions and suggests future work to be done in this area.
- 12 -
Chapter 2
Methods
2.1
Main Algorithms
The actual methods implemented are combinations of three basic ideas: MPD
(Min-phase Peak Detection), C2I (Correlation to Impulse), and DSA (Delay, Sum,
Average).
2.1.1 MPD
This algorithm is based on the observation by Kabal et al. in [9] that in the
cepstral domain, the minimum-phase∗ component of a reverberant speech signal shows
distinct decaying spikes at times that are multiples of each echo’s delay. For instance, if
there is an echo that begins at t = 0.5s, then there will be noticeable impulses at t = 0.5n,
where n = 1,2,3… The height of these impulses depends on the echo intensity. Please
see Appendix A for a detailed explanation of the complex cepstrum, why the echoes
show up as spikes, and how zeroing them out results in canceling out the echoes.
The following figures show the complex cepstrum of the minimum phase
component of a signal with an echo at a delay of 0.5s and attenuated by 0.5:
∗
A signal is said to be minimum phase if its z-transform contains no poles or zeros outside the unit circle.
- 13 -
Complex cepstrum of min phase component of signal with
echo at delay = 0.5s, echo attenuation = 0.5
10
5
0
−5
−10
0
1
2
3
4
5
[s/22050]
6
7
8
9
4
x 10
Figure 2-1: Complex cepstrum of the min-phase component of a signal with an echo
at delay = 0.5s, attenuation = 0.5
Zoomed in version of complex cepstrum
0.5
0.4
0.3
0.2
0.1
0
−0.1
−0.2
−0.3
−0.4
−0.5
1
1.5
2
2.5
[s/22050]
Figure 2-2: Zoomed in version of Figure 2-1
- 14 -
3
3.5
4
4
x 10
Given the above characterizations, the MPD algorithm works as follows:
1) decompose signal into its all-pass (ap) and minimum-phase (mp) components
[10]
2) take the complex cepstrum of the mp component (cm)
3) put cm through a comb filter (done by a method called rfindpeaks2, which
detects impulsive values and zeros them out)
4) take the inverse complex cepstrum of the altered cm
5) recombine with the all-pass component
Here’s the algorithm in block diagram form:
x[n]
All pass and
minimum phase
decomposition
mp[n]
Complex
Cepstrum
ap[n]
cm[n]
Recombine
All pass and
Min Phase
Components
Comb
Filter
cm’[n]
Inverse
Complex
Cepstrum
mp’[n]
s’[n]
Figure 2-3: Block diagram of the MPD algorithm
The following figures show the result of applying the algorithm on a signal with a
simple echo at t = 0.5s, attenuation = 0.5:
- 15 -
Complex cepstrum with echo spikes zeroed out
10
5
0
−5
−10
0
1
2
3
4
5
[s/22050]
6
7
8
9
4
x 10
Figure 2-4: Complex cepstrum from Figure 2-1, after the spikes were taken out,
using MPD
Difference between complex cepstrum with echoes and
complex cepstrum after echoes were taken out
0.6
0.5
0.4
0.3
0.2
0.1
0
−0.1
−0.2
0
1
2
3
4
5
[s/22050]
6
7
8
Figure 2-5: The spikes that were detected and taken out by MPD
- 16 -
9
4
x 10
From Figures 2-4 and 2-5, it is clear that this method should work very well for
simple echoes. However, when signals from complex and highly reverberant
environments are used, the ambience effect will not be removed. This is because rather
than having discrete echoes only, there are also echoes that are so closely spaced that they
are perceived as a single decaying sound. These closely spaced echoes cannot be
distinguished from the original signal content in the cepstral domain.
2.1.2 C2I
This algorithm takes advantage of the observation that the autocorrelation
function of the reverberant signal will have peaks at the echo delay(s). Therefore, the
autocorrelation can be used to estimate the multipath impulse response. The following
algorithm was used:
Let:
x[n] = reverberant signal of length N
Rx[n] = autocorrelation of x (xcorr(x(n)) in Matlab)
1) Find Rx[n].
2) Since Rx[n] is symmetric, with length 2N+1, where N = length of x[n], and
the maximum of Rx is Rx[N] (normally, this occurs at Rx[0], but Matlab
scales it differently), it is only necessary to look at Rx[N:2N+1]. Therefore,
let Rx2[n] = Rx[N:2N+1].
3) Use the findpeaks2 method (similar to rfindpeaks from MPD) to find the
spikes in Rx2[n], which make up a scaled version of the estimated impulse
response, h’[n].
4) To actually get h’[n], the spikes are normalized, such that h’[0] = 1.
- 17 -
5) The estimated original signal, s’(n), is found by IFFT(X[k])/H[k]), where
X[k] = FFT(x[n]) and H[k] = FFT(h[n])*.
In Figure 2-6, the algorithm is translated into block diagram form.
x[n]
xcorr( )
Rx[n]
Estimate
Impulse
Response
h’[n]
FFT
H’[k]
Inverse
1
H' [k ]
FFT
X[k]
Multiply
S’[k]
IFFT
s’[n]
Figure 2-6: Block diagram for C2I algorithm
It is important to note that it is possible for H’[k] to include samples with the
value zero. In this case, the algorithm would not work, due to the inversion step. Instead,
direct deconvolution in the time domain is done using the deconv(x, h’) command in
Matlab. However, this takes considerably longer (minutes, as compared to seconds when
using the frequency domain method).
Figures 2-7 to 2-11 show how this algorithm works in a simple case, where the
echo attenuation (alpha) is 0.5, and the echo delay is 0.5 seconds (or 11025 samples).
*
The FFT (Fast Fourier Transform) is an algorithm for computing the DFT (Discrete Fourier Transform).
The DFT is made up of samples of the DTFT (Discrete Time Fourier Transform), which is a continuous
function. For a discrete time domain signal x[n], its “Fourier Transform” generally refers to its DTFT,
which is expressed as either X(w) or X(ejw), while its DFT is generally expressed as X[k].
- 18 -
Auto correlation of the original signal
700
600
500
400
300
200
100
0
−100
−200
−300
−400
0
2
4
6
8
10
time [s/22050]
12
14
16
18
4
x 10
Figure 2-7: Autocorrelation of the original clean signal
Auto correlation of the reverberant signal
2000
1600
1200
800
400
0
−400
−800
−1200
0
2
4
6
8
10
time [s/22050]
12
14
16
18
4
x 10
Figure 2-8: Autocorrelation of the signal with an echo at delay = 0.5s, attenuation =
0.5
- 19 -
Auto correlation of the estimated original signal
1800
1400
1000
600
200
−200
−600
−1000
0
2
4
6
8
10
time [s/22050]
12
14
16
18
4
x 10
Figure 2-9: Autocorrelation of the resultant signal after processing the reverberant
signal with C2I
The following figures show the actual impulse response and the one estimated by
C2I, respectively:
- 20 -
Impulse response for simple case of delay = 0.5s, echo attenuation = 0.5
1
0.9
0.8
0.7
h
0.6
(0.5)
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
time [s/22050]
6
7
8
9
4
x 10
(11025=0.5s)
Figure 2-10: Impulse response of an echo at delay = 0.5s, attenuation = 0.5
C2I estimated impulse response for simple case of delay = 0.5s, echo attenuation = 0.5
1
0.9
0.8
0.7
h
0.6
0.5
(0.4038)
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
time [s/22050]
Figure 2-11: Impulse response estimated by C2I
- 21 -
6
7
8
9
4
x 10
This algorithm is not likely to be as good as MPD is at eliminating simple echoes,
because the estimated impulse response is not exact, even for the most basic case, as
illustrated by Figure 2-11. Meanwhile, as illustrated by Figures 2-3 to 2-5, the MPD
algorithm can very effectively detect all of the spikes for a simple case. However, it is
harder to predict how C2I will perform in a complex environment, so it is still worthwhile
to consider this algorithm.
2.1.3 DSA
When there are multiple microphones, speech will generally reach the different
microphones at different times. Therefore, to combine these signals, it is important to
line the signals up first. This can be accomplished through finding the cross correlation
between two signals and finding the maximum of the cross correlation function.
Knowing the location of the maximum will then allow the relative delay to be calculated,
and the signals can be lined up and added. For more than two microphones, the first two
signals are lined up and summed, and that sum is used to line up with and add to the third
signal, and so on. The sum is then divided by the number of input signals, thereby
yielding the average.
By lining the signals up and taking the average, the original speech signal adds
constructively, while the echoes are generally attenuated.
2.2
Actual Methods Implemented
The algorithms mentioned above can be combined in various ways, especially
when there are multiple microphones, because the averaging can be done at different
stages. The following is a list of the main methods that have been coded and tested:
•
mpd2(v) – takes in the sound vector v and performs the basic MPD algorithm
•
c2i(v) – takes in the sound vector v and performs the C2I algorithm
- 22 -
•
mpds(m) – takes in the matrix m (whose columns are sound vectors), averages the allpass components using the DSA algorithm, takes a normal average (without lining
up) of the min-phase components, and does MPD (steps 2-5)
•
mpds2(m) – takes in the matrix m, averages the all-pass components using the DSA
algorithm, take the complex cepstrum and eliminate the impulses of the individual
min-phase components, take the average of the cepstra, and then take the inverse
cepstrum and recombine
•
mpds3(m) – similar to the previous two, except that the averaging of the min-phase
component takes place after the inverse cepstrum has been taken for each processed
signal
•
scp(m) – Spatial Cepstral Processing – takes in the matrix m, does DSA on the allpass components, averages the min-phase components in the cepstral domain (no
peak detection), and recombines
•
dsa2(m) – takes in the matrix m and does plain DSA (i.e., without separating the allpass and min-phase components)
•
c2is(m) – takes in matrix m and applies the C2I algorithm to each column vector, and
then does DSA averaging on the resultant vectors
Of course, other functions have also been coded in Matlab in order to support
these methods. A comprehensive list of all the methods and their respective descriptions
will be included as Appendix B.
- 23 -
Chapter 3
Simulation of Multipath Environments
This chapter will explain the echo models and simulations. A detailed description
of the actual speech corpora generated will be discussed in the Experiments section of the
next chapter.
3.1
Basic Echo Model
The most basic echo model is that of a copy of the original signal reflected off a
surface, and therefore, is delayed and attenuated relative to the “direct path” copy. Figure
3.1 illustrates this model.
Source
Direct Path
Microphone
Reflection
Figure 3-1: Simple model of an echo as a reflection that is a delayed copy of the
original signal
- 24 -
The two main parameters for each echo are the delay and the attenuation. To
generate sound files based on this echo model, a Matlab function, addecho2wav(A, alpha,
delay), was implemented. A is the original signal, represented as a vector with values
between -1 and 1, inclusive. Alpha is a row vector whose elements indicate the
attenuation of each echo, and delay is a row vector of the corresponding delays.
Therefore, this function can add multiple echoes.
In general, a reverberant signal is represented as a convolution (denoted by *)
between the original signal and the room impulse response:
x[n] = s[n] * h[n]
(3.1)
For the simple model, the form of the impulse response can be generalized as
h[n] = δ[n] + α1 • δ[n-delay1] + α2 • δ[n-delay2] + … + αN • δ[n-delayN],
(3.2)
where N is the number of echoes, δ[n] is the unit impulse function, and “•” denotes
multiplication. Given 3.2, the reverberant signal can also be expressed as follows:
x[n] = s[n] + α1 • s[n-delay1] + α2 • s[n-delay2] + … + αN • s[n-delayN]
(3.3)
This is considered a simple model, because it does not take a lot of other factors
into consideration, such as room size, damping ratios of different surfaces, and
positioning of microphones and sound sources. The next section will discuss how to
simulate more realistic reverberant signals.
- 25 -
3.2
Complex Echo Environment Simulation
A well-known mathematical model for simulating room reverberation is the
Image Method [11]. Instead of actually implementing this method to simulate
reverberant data, a popular audio editing software program, Cool Edit Pro 1.2, was used.
This powerful (and fast) tool includes functions such as filtering, noise reduction, and 3D
echo simulation, as well as multi-track mixing.
The following figure is a screen shot of the 3D Echo Chamber menu:
Figure 3-2: Screen shot of the 3-D Echo Chamber menu in Cool Edit Pro 1.2
As Figure 3-2 shows, this feature allows the specification of the room dimensions,
speaker and microphone locations, damping factors of the room’s surfaces, number of
echoes, etc.
However, while Cool Edit Pro generates a fairly realistic reverberant signal, the
software does have some limitations. For instance, it assumes that the speech source is a
point source (i.e. speech radiates equally in all directions), which is not true, because the
- 26 -
direction a person is facing affects the signal that will be received by the microphone.
Also, the software does not allow the user to specify which type of microphone is being
used. An omni-directional microphone is assumed, which, as the name suggests, picks
up sound from all directions with equal gain. Other types of microphones with different
beam patterns are available, and they may be more practical for the room environment.
Nevertheless, it is still possible to evaluate and compare the effectiveness of
various echo cancellation algorithms, despite the points mentioned above. For instance,
while the use of different microphones may improve the signal to noise ratios, it should
not affect how well one algorithm performs relative to another algorithm. The same is
true for having a directional source, which means that the signal content will be lower at
some microphones. Therefore, these factors may affect the overall performance of the
speech recognition system, but not the relative performance of the algorithms.
- 27 -
Chapter 4
Experiments and Results
4.1
Goals
The experiments described in this chapter were designed to answer the following
questions:
1) Under the simple echo model and using one microphone, how are C2I and
MPD2 affected by echo attenuation (intensity) and by echo delay?
2) Under the complex echo model and using one microphone, how do C2I and
MPD2 perform in low, medium, and high echo environments?
3) Under the complex echo model and using multiple (three) microphones, how
do C2Is, DSA, MPDs, MPDs2, MPDs3, and SCP perform in low, medium,
and high echo environments?
4) How does the training environment affect the algorithms’ performance in the
above cases?
4.2
Speech Data Used
The clean speech corpora were recorded using a low noise, directional
microphone (AKG D3900) connected to a pre-amp (Symetrix SX202), which feeds the
signals into the embedded ESS sound card of a Compaq desktop PC, thereby creating
- 28 -
digital sound files in the .wav format. The software used for the recordings is Cool Edit
Pro. The sampling rate is 22 KHz, and each clip has 96000 samples, which translates to
about 4.3 seconds in length.
Each wave file contains one sentence that ranges from five to nine words long.
There are sixteen such sentences used for testing, and there were two speakers: one male
and one female.
These clean signals were then digitally processed to add different levels of
echoes. For the simple echo model described in Section 3.2, echoes of varying
attenuation factors and delays were added. Specifically, the (attenuation, delay) pairs are
(0.25, 11025), (0.50, 11025), (0.75, 11025), (0.5, 5513), and (0.5, 22050), where
attenuation is a scalar, and delay is in number of samples. For the complex model, there
are many variables and an infinite number of combinations of the different parameters.
Therefore, in the interest of time, the test environments are simplified as low, medium,
and high echo cases. The following chart specifies the parameters of the different
environments:
Table 4-1: Parameters for the different echo environments
Low Echo
Medium Echo
High Echo
Room Size (ft)
25 x 25 x 10
50 x 50 x 10
50 x 50 x 10
Source Coordinates (ft)
(12.5, 12.5, 6)
(25, 25, 5)
(25, 25, 5)
Mic1 Coordinates (ft)
(25, 25, 5)
(15, 35, 5)
(15, 35, 5)
Mic2 Coordinates (ft)
(0.01, 25, 5)
(25, 35, 5)
(25, 35, 5)
Mic3 Coordinates (ft)
(12.5, .01, 5)
(40, 35, 5)
(25, 40, 5)
Number of Echoes
20
350
1200
Surface Reflectivities
(Floor, Ceiling, Walls)
(0.2, 0.7, 0.7)
(0.85, 0.85, 0.85)
(1,1,1)
- 29 -
4.3
Methods
The metrics for measuring the effectiveness of the algorithms are the number of
errors in speech recognition by Dragon Systems’ Naturally Speaking software and the
percent of improvement in recognition.
The number of errors is broken down into the number of misrecognized (wrong)
words, added words, and missing words. For example:
Original: the little blankets lay around on the floor
Recognized: the little like racing lay around onward or
“Like,” “onward,” and “or” are counted as wrong for “blankets,” “on,” and “the.”
“Racing” is counted as an added word, and there is also a missing word at the end, since
the original sentence had three words after “around,” but the recognized result only had
two. These errors were counted manually and tallied for each test case.
The percent improvement was also calculated for each algorithm in each test case.
Percent improvement is defined as follows:
% Improvement ≡ 100 ×
# of Errors for Unprocessed - # of Errors for Processed
# of Errors for Unprocessed - # of Errors for Clean
(4.1)
While Unprocessed is signal environment specific, Clean is training environment
specific. For instance, to determine the % Improvement of C2I on one microphone,
complex, low echo signals, the number of errors for the unprocessed, one microphone,
complex, low echo signals is used. On the other hand, Clean remains the same for all test
cases within the same training environment.
The comprehensive tables of these results are included as Appendix C.
Meanwhile, the figures in the following section summarize the findings from the trials.
Other metrics, such as the mean square error (MSE) relative to the clean signal,
and the signal to noise ratio (SNR), were also considered. However, due to delays in the
reverberant signal, the MSE will not provide a good measure of how much of the echoes
- 30 -
have been cancelled by the algorithms. The SNR is also inappropriate, because the
algorithms adjust the gains of the signals to prevent clipping.*
SNR is normally defined as 10 log (signal power/noise power), which could be
calculated with the following formula:
N
10 log(
ås
2
o [n ]
n =1
N
å
x o2 [n ] −
n =1
N
å
)
s o2 [n ]
(4.2)
n =1
Where so[n] and xo[n] are the non DC biased versions of the original clean signal s[n] and
the reverberant signal x[n], respectively. Mathematically, they are defined as
so[i] = s[i] – mean(s[n]), for i = 1…N
(4.3)
xo[i] = x[i] – mean(x[n]), for i = 1…N
(4.4)
The problem arises when x[n] is normalized to the maximum volume that does
not result in clipping, because the denominator is not really the noise power, since the
signal in x[n] has been either amplified or attenuated. However, even without the
normalization, there would be a problem with the SNR calculation, because the direct
path signal is attenuated in the complex model, so signal power is lost. Therefore, the
denominator could be negative, which would cause the log expression to equal negative
infinity.
*
Matlab represents the volume of a signal at each sample as a decimal between –1 and 1, inclusive. If a
sound vector contains a value outside of this range, then that sample is set to –1 or 1, depending on the
original value’s sign. This is commonly referred to as “clipping.”
- 31 -
4.4
Results
Each of the following sections addresses one of the four questions posed in
Section 4.1. The results are presented as numbers of errors, as well as percent
improvement.
4.4.1 Simple Echo Environments
The following results show the effects of varying delays and varying attenuations
on speech recognition performance.
Constant Attenuation, Varying Delay, Female Subject
# Wrong
# Added
# Missing
Total Errors
80
70
Number of Errors
60
50
40
30
20
10
0
none
5513
0.5
MPD2
5513
0.5
C2I
5513
0.5
none
11025
0.5
MPD2
11025
0.5
C2I
11025
0.5
none
22050
0.5
MPD2
22050
0.5
C2I
22050
0.5
Signal Environment
Figure 4-1: Female subject’s breakdown of errors for varying delays (5513, 11025,
and 22050 samples), with attenuation held constant at 0.5
- 32 -
Constant Attenuation, Varying Delay, Male Subject
# Wrong
# Added
# Missing
Total Errors
80
70
Number of Errors
60
50
40
30
20
10
0
none
5513
0.5
MPD2
5513
0.5
C2I
5513
0.5
none
11025
0.5
MPD2
11025
0.5
C2I
11025
0.5
none
22050
0.5
MPD2
22050
0.5
C2I
22050
0.5
Signal Environment
Figure 4-2: Male subject’s breakdown of errors for varying delays (5513, 11025,
and 22050 samples), with attenuation held constant at 0.5
Looking only at the unprocessed cases, one notices that the breakdown of errors is
similar for both the male and female subjects. Even though the total number of errors
decreases from d = 11025 to d = 22050 for the male subject, the following trends prevail:
-
The number of words added increases with the delay interval.
-
The number of misrecognized words increases at first, but then decreases.
-
The number of missing words does not change much, although it does
decrease a little, as the delay increases.
- 33 -
These trends make sense, because the echo overlaps with the original signal, so
the longer the delay, the more “clean” speech appears at the beginning of the signal. This
accounts for the decrease in misrecognized words, when the delay becomes large.
However, the number of words added increases, because the signal duration becomes
longer. As for the missing words, they tend to be short words, such as “a,” “to,” and so
on, so there is no clear reason why delay should affect the number of missing words.
Constant Delay, Varying Attenuation, Female Subject
# Wrong
# Added
# Missing
Total Errors
80
70
Number of Errors
60
50
40
30
20
10
0
none
11025
0.25
MPD2
11025
0.25
C2I
11025
0.25
none
11025
0.5
MPD2
11025
0.5
C2I
11025
0.5
none
11025
0.75
MPD2
11025
0.75
C2I
11025
0.75
Signal Environment
Figure 4-3: Female subject’s breakdown of errors for varying attenuation factors
(0.25, 0.5, and 0.75), with delay held constant at 11025 samples
- 34 -
Constant Delay, Varying Attenuation, Male Subject
# Wrong
# Added
# Missing
Total Errors
80
70
Number of Errors
60
50
40
30
20
10
0
none
11025
0.25
MPD2
11025
0.25
C2I
11025
0.25
none
11025
0.5
MPD2
11025
0.5
C2I
11025
0.5
none
11025
0.75
MPD2
11025
0.75
C2I
11025
0.75
Signal Environment
Figure 4-4: Male subject’s breakdown of errors for varying attenuation factors
(0.25, 0.5, and 0.75), with delay held constant at 11025 samples
For constant delay and varying attenuation factors, the number of misrecognized
word tend to increase with the increasing attenuation factor (which actually means less
attenuation, or higher echo intensity), while the other types of errors are not affected as
much. This is consistent with the argument in the previous case. With attenuation being
the only variable, the number of misrecognized words increases as the echo intensity gets
stronger, because the words are more distorted. This can, but does not necessarily, cause
more added words, as one can see from the differences in the female and male cases.
- 35 -
Percent Improvement
Constant Atten., Var. Delay, Male
Constant Delay, Var. Atten., Male
120
120
100
100
80
80
60
60
40
40
20
20
0
0
−20
0.5
1
1.5
2
2.5
−20
0.2
0.4
0.6
0.8
1
4
x 10
Percent Improvement
Constant Atten., Var. Delay, Female
Constant Delay, Var. Atten., Female
120
120
100
100
80
80
60
60
40
40
20
20
0
0
−20
0.5
1
1.5
2
2.5
Delayed Samples [s/22050] x 104
−20
0.2
MPD2
C2I
0.4
0.6
0.8
Attenuation of Echo
1
Figure 4-5: Percent improvement as a function of delay and of attenuation for male
and female subjects
The graphs in Figure 4-5 show the following trends:
-
MPD2 increases first and then decreases with delay and attenuation.
-
C2I’s performance deteriorates more quickly than MPD2’s.
-
C2I is more sensitive to attenuation than it is to delay.
The first observation can be explained by the nature of the MPD2 algorithm.
Specifically, it has to do with how the echo’s spikes are detected. (Refer to Section
2.1.1.) When the echo intensity is low, the echo’s spikes are small, so they become
harder to detect. For small delays, the early spikes are also harder to detect, because the
- 36 -
original signal’s cepstral content has not decreased enough for the spikes to stand out.
These properties also explain why MPD2’s performance does not decrease as drastically
as C2I’s when the intensity or delay increases.
To explain the third observation that C2I is more sensitive to echo intensity, recall
that in Section 2.1.2, it was pointed out that there are errors in estimating the echo’s
intensity. Therefore, as the echo intensity increases, these errors become more
noticeable.
4.4.2 Complex Echoes, One Microphone
Breakdown of Errors for Complex, 1 Mic Environments, Female Subject
# Wrong
# Added
# Missing
Total Errors
120
100
Number of Errors
80
60
40
20
C2I
High Echo,m1
MPD2
High Echo,m1
none
High Echo,m1
C2I
Medium Echo,m1
MPD2
Medium Echo,m1
none
Medium Echo,m1
C2I
Low Echo, m1
MPD2
Low Echo, m1
none
Low Echo, m1
0
Signal Environment
Figure 4-6: Female subject’s breakdown of errors for complex, one microphone
signals
- 37 -
Breakdown of Errors for Complex, 1 Mic Environments, Male Subject
# Wrong
# Added
# Missing
Total Errors
120
100
Number of Errors
80
60
40
20
C2I
High Echo,m1
MPD2
High Echo,m1
none
High Echo,m1
C2I
Medium Echo,m1
MPD2
Medium Echo,m1
none
Medium Echo,m1
C2I
Low Echo,m1
MPD2
Low Echo,m1
none
Low Echo,m1
0
Signal Environment
Figure 4-7: Male subject’s breakdown of errors for complex, one microphone
signals
For the complex model, the breakdown of errors is fairly consistent between the
two subjects. However, there are two obvious differences from the simple model’s
errors. First, there are very few words added, and second, there are more words missing.
These observations show that the two models are indeed very different, which can
account for the poor performance of C2I and MPD2 in these cases.
- 38 -
Percent Improvement for Complex, 1 Mic Environments, Female Subject
C2I
MPD2
70.00
60.00
Percent Imrpovement
50.00
40.00
30.00
20.00
10.00
0.00
-10.00
Low Echo, m1
Medium Echo,m1
High Echo,m1
Signal Environment
Figure 4-8: Percent improvement vs. signal environment, female subject
- 39 -
Percent Improvement for Complex, 1 Mic, Male Subject
C2I
MPD2
5.00
0.00
Percent Improvement
-5.00
-10.00
-15.00
-20.00
-25.00
Low Echo,m1
Medium Echo,m1
High Echo,m1
Signal Environment
Figure 4-9: Percent improvement vs. signal environment, male subject
Next, using the results for the complex model, percent improvement is plotted
against the different echo environments. The results for the male and female subjects are
very consistent for the medium and high echo environments. For both, the percent
improvement was slightly greater for the high echo case than the medium echo case.
This can be explained by the observation that the signals in the high echo environment
are so distorted that it is even hard for humans to understand them. Therefore, they have
less room for “negative improvement,” which occurs when the processed signal has more
recognition errors that the unprocessed signal. One possible source of the extra errors is
the rounding off that takes place when transforming a signal to another domain and back.
- 40 -
In the low echo environment, the results are drastically different between the two
subjects, with a large positive percent improvement for the female, and a large negative
improvement for the male. Unfortunately, there is no obvious explanation for this.
4.4.3 Complex Echoes, Three Microphones
Breakdown of Errors for Complex, Multi. Mic Environments, Female Subject
# Wrong
# Added
# Missing
Total Errors
140
120
Number of Errors
100
80
60
40
20
Ec
PD
ho
S
Lo
M
w
PD
Ec
S2
ho
Lo
M
w
PD
Ec
S3
ho
Lo
w
C
Ec
2I
ho
s
Lo
no
w
ne
E
M
ch
ed
o
iu
D
m
SA
Ec
M
ho
ed
iu
SC
m
P
E
M
ch
ed
o
M
iu
PD
m
S
Ec
M
ho
M
ed
PD
iu
m
S2
E
M
ch
M
ed
o
PD
iu
m
S3
Ec
M
ho
ed
iu
C
m
2I
s
E
M
ch
ed
o
iu
m
no
Ec
ne
ho
H
ig
h
D
E
SA
ch
o
H
ig
h
SC
Ec
P
ho
H
ig
M
h
PD
Ec
ho
S
H
M
ig
h
PD
E
S2
ch
o
H
ig
M
h
PD
E
S3
ch
o
H
ig
h
C
E
2I
c
ho
s
H
ig
h
Ec
ho
Ec
ho
SC
P
Lo
w
M
Lo
w
D
SA
no
ne
Lo
w
Ec
ho
0
Signal Environment
Figure 4-10: Female subject’s breakdown of errors for complex, multiple
microphone signals
- 41 -
Breakdown of Errors for Complex, Multi. Mic Environments, Male Subject
# Wrong
# Added
# Missing
Total Errors
140
120
Number of Errors
100
80
60
40
20
Ec
ho
ed
iu
m
Ec
M
ho
M
ed
PD
iu
m
S2
E
M
ch
M
ed
o
PD
iu
m
S3
Ec
M
ho
ed
iu
C
m
2I
s
E
M
ch
ed
o
iu
m
no
Ec
ne
ho
H
ig
h
D
E
SA
ch
o
H
ig
h
SC
Ec
ho
P
H
ig
M
h
PD
Ec
ho
S
H
ig
M
h
PD
Ec
S2
ho
H
ig
M
h
PD
E
S3
ch
o
H
ig
h
C
Ec
2I
ho
s
H
ig
h
Ec
ho
Ec
ho
PD
S
M
SC
P
M
ed
iu
m
ed
iu
m
M
M
no
ne
D
SA
o
Lo
w
Ec
ho
o
Ec
h
w
Lo
PD
S3
C
2I
s
o
Ec
h
Ec
h
w
PD
S2
Lo
M
Ec
ho
PD
S
Lo
w
M
Ec
ho
Lo
w
M
SC
P
w
Lo
e
D
SA
no
n
Lo
w
Ec
ho
0
Signal Environments
Figure 4-11: Male subject’s breakdown of errors for complex, multiple microphone
signals
Figures 4-10 and 4-11 show that the breakdown of errors is fairly consistent with
the one microphone case’s data in the previous section. However, the number of errors is
higher for the unprocessed signals in the low and medium echo environments with three
microphones. This makes sense, because there are extra distortions that arise from
simply adding the signals from three microphones, without accounting for their relative
delays. Such a difference does not show up in the high echo environment, because as
mentioned in the previous section, the signals are already very distorted. Therefore, the
number of errors is already at a maximum.
- 42 -
Percent Improvement for Complex, Multi. Mic Environments, Female Subject
C2Is
DSA
MPDs
MPDs2
MPDs3
SCP
45.00
40.00
35.00
Percent Improvement
30.00
25.00
20.00
15.00
10.00
5.00
0.00
-5.00
Low Echo
Medium Echo
High Echo
Signal Environment
Figure 4-12: Percent improvement vs. echo environment, female subject
- 43 -
Percent Improvement for Complex, Multi. Mic Environments, Male Subject
C2Is
DSA
MPDs
MPDs2
MPDs3
SCP
35.00
30.00
25.00
Percent Improvement
20.00
15.00
10.00
5.00
0.00
-5.00
-10.00
Low Echo
Medium Echo
High Echo
Signal Environment
Figure 4-13: Percent improvement vs. echo environment, male subject
In Figures 4-12 and 4-13, the percent improvement is plotted for the three
microphone, complex signals. The trends are more consistent between the two subjects,
compared to the one microphone, complex environments. In almost all of the cases,
except for female subject, medium echo, DSA has the highest percent improvement. This
is somewhat surprising, since DSA is simply the delay and sum method, which means
that the extra work done in the other algorithms actually made the signals worse.
- 44 -
4.4.4 Different Training Environments
Most commercial speech recognition software programs, such as Dragon
Systems’ Naturally Speaking, are user specific and require an initial training session for
each user. This process allows the software to “learn” the characteristics of a user’s
speech, and it is generally accomplished by having the person read sentences, as
prompted by the program.
A different user had to be created for each of the 14 training environments.
However, the typical type of training described above could only be done for the “clean”
environment, unless an actual effects box, with the capabilities of adding different types
of echoes and performing the different algorithms, was built and put between the
microphone and the computer’s sound card. Of course, building such a device was not
feasible, given the nature and the timeframe of this project.
Hence, for all of the other training environments, the mobile training feature of
the software had to be used. Mobile training is intended for people who want to record
their dictations onto tape or digital recorders and later transfer their speech to the
computer for transcription by the software. Since the impulse response of a recorder is
likely to be different from that of a microphone, it is necessary to have a different training
process for mobile users, rather than to have them use the regular live training process
with a microphone and then try to transcribe speech from a recorder.
Mobile training generally involves recording about 20 minutes of speech, using a
script provided by the software, and then instructing the software to read the sound file.
In the following experiments, the training data was recorded and saved as a .wav file,
using the microphone setup that was described in Section 4.2. This file, without any
processing, was used for the “clean, mobile” training environment. The files was also
processed accordingly to create all of the other training environments.
In the following results, simple refers to d = 11025 samples, alpha = 0.5, and
complex refers to the low echo environment. These environments were chosen, because
they were generally the ones under which the algorithms showed the highest percentages
- 45 -
of improvement. Some of the other environments may have been so adverse, that no
discernible differences will appear under the different training environments.
Effect of Training Environment on Performance of Algorithms for Simple, 1 Mic, Female
Subject
C2i
MPD2
120
Percent Improvement
100
80
60
40
20
simple echo, 1 mic,
mpd2
simple echo, 1 mic,
c2i
simple echo, 1 mic
mult. echo, 3 mic,
scp
mult. echo, 3 mic,
mpds2
mult. echo, 3 mic,
mpds
mult. echo, 3 mic,
dsa
mult. echo, 3 mic,
c2is
mult. echo, 3 mic
mult. echo, 1 mic,
mpd2
mult. echo, 1 mic, c2i
mult. echo, 1 mic
clean, mobile
clean
0
Training Environment
Figure 4-14: How C2I and MPD2 perform on simple echo signals under different
training environments
- 46 -
Effect of Training Environment on Performance of Algorithms for Complex, 1 Mic, Female
Subject
C2i
MPD2
300
200
Percent Improvement
100
0
-100
-200
-300
simple echo, 1 mic,
mpd2
simple echo, 1 mic,
c2i
simple echo, 1 mic
mult. echo, 3 mic,
scp
mult. echo, 3 mic,
mpds2
mult. echo, 3 mic,
mpds
mult. echo, 3 mic,
dsa
mult. echo, 3 mic,
c2is
mult. echo, 3 mic
mult. echo, 1 mic,
mpd2
mult. echo, 1 mic, c2i
mult. echo, 1 mic
clean, mobile
clean
-400
Training Environment
Figure 4-15: How C2I and MPD2 perform on complex reverberation, one
microphone signals under different training environments
The original theory and purpose behind trying different training environments was
that an algorithm might perform better when the training data was from the same
environment, since the speech recognition software uses pattern matching to some extent
to identify sounds and words. However, this turns out not to be the case, as shown by
Figures 4-14 and 4-15.
The likely explanation for the results lies in the nature of the mobile training
process, because it does not allow the user to correct recognition errors. Therefore, when
the training data is unrecognizable to the software, the software is not actually learning
how specific words sound in a particular echo environment. One way to address this
problem is to do live training (and maybe even testing) in a real reverberant environment.
- 47 -
However, this still does not allow for the training of processed environments, which was
the main goal of this experiment. Another possible way would have been to use “pseudolive” training, where a tape player plays the desired training signals to a microphone,
thereby fooling the software into “thinking” that there is a real person speaking.
However, this too, may not work, because if the reverberant signal is too distorted, such
that if the software is not satisfied with how a word sounds, it will keep asking that the
word be repeated. Alternatively, the word can be skipped. This process would also be
extremely tedious with so many training environments.
Incidentally, for one-microphone tests, “clean, mobile” seems to yield the highest
percentage of improvement for both of the algorithms. The fact that it does better than
the “clean” environment suggests that similarity between training environment and signal
environment does help. Namely, the similarity arises from the test files being transcribed
as .wav files, which is how “clean, mobile” was trained, versus using a microphone,
which was how “clean” was trained.
- 48 -
Effect of Traning Environment on Performance of Algorithms for Complex, Multi. Mic, Female
Subject
C2Is
DSA
MPDs
MPDs2
MPDs3
SCP
100
80
Percent Improvement
60
40
20
0
-20
-40
simple echo, 1 mic,
mpd2
simple echo, 1 mic,
c2i
simple echo, 1 mic
mult. echo, 3 mic,
scp
mult. echo, 3 mic,
mpds2
mult. echo, 3 mic,
mpds
mult. echo, 3 mic,
dsa
mult. echo, 3 mic,
c2is
mult. echo, 3 mic
mult. echo, 1 mic,
mpd2
mult. echo, 1 mic,
c2i
mult. echo, 1 mic
clean, mobile
clean
-60
Training Environment
Figure 4-16: How C2Is, DSA, MPDs, MPDs2, MPDs3, SCP perform on complex
reverberation, multi-microphone signals under different training environments
As with the one microphone cases, there is no correlation here between an
algorithm and an environment trained under the same algorithm. The DSA environment
yielded the most consistently high improvement percentages. While others had higher
improvement for certain algorithms, they also had lower minimums. Interestingly, DSA
also had the best performance in most of the training environments, though SCP had the
highest improvement percentage in the DSA training environment. However, the reason
behind this relationship is not obvious at this point.
- 49 -
Chapter 5
Conclusions and Future Directions
5.1
Conclusions
The goal of this project was to research, develop, and compare algorithms for
echo cancellation in the home environment. During the course of this project, it became
obvious that the problem of echo cancellation with unknown and variable room impulse
responses is a very general and hard one. However, with some practical assumptions that
simplified the problem, it was possible to identify some promising algorithms to
implement and test.
After performing many tests and examining the results, the following
observations can be made:
-
The complex, realistic reverberation model is very different from the simple
echo model, and the algorithms that work well in the simple case do not carry
over very well to the complex model.
-
Having multiple microphones is an effective way to improve speech, but if the
echo environment is very high, nothing is effective. However, for most
rooms, the surface reflectivities will not be as great as those used in the
high—or even medium—echo environments.
-
Different algorithms work better under different environments. Therefore, it
may be feasible to implement a system that can choose among a number of
- 50 -
algorithms, as well as arguments to their functions, based on user input on the
room parameters.
It is important to realize that while echo cancellation is a very general area, and
much work has been done in this field in the last several decades, the efforts on room
dereverberation in the context of smart houses are still relatively new. The idea of using
very few microphones, as opposed to large arrays, is an even more novel approach.
Therefore, the research presented in this thesis is still at a very early stage. Although
some of the results are mixed, some of them—especially in the three microphone cases—
are also very encouraging.
Putting issues of cost aside, the results may seem to suggest that using many
microphones would solve the problem. However, the results presented in [6] show that
even with 46 microphones, the word recognition error rate was slightly over 50%. Note
that the test environments and methods were different from those of this thesis, so there is
no way to compare the relative performances. The point here is that using many
microphones alone would not solve the problem at hand. A lot more needs to be done,
and the next section addresses some of these open areas.
5.2
Future Work
This section raises and reiterates some issues that are related to echo cancellation
applied to the problem of speech recognition in home networking. However, it is by no
means a complete analysis of the requirements for making smart houses a reality.
5.2.1 Testing in Real Echo Environments
Although Cool Edit Pro does a good job of simulating room echo environments, it
does have certain limitations, as mentioned in Section 3.2. Also, it is hard to specify
some of the parameters, such as surface reflectivity, in order to model a realistic room.
Therefore, while using simulations is efficient for this initial study, ultimately, it will be
necessary to test the echo cancellation algorithms in real rooms.
- 51 -
5.2.2 Types of Microphones
Also mentioned in Section 3.2 is the fact that Cool Edit Pro’s simulation is based
on omni-directional microphones. Other choices may be more suitable in the overall
performance of speech recognition in the home environment. A good guide to
microphones can be found at http://www.audio-technica.com/guide/type/index.html.
5.2.3 Microphone Placement
Optimal sensor placement is another large area of study, and it takes into
consideration the acoustic characteristics of a room. Also, for smart houses, the
placement of the microphone(s) depends on the layout of the room and the objects in it,
as well as the likelihood of people facing certain directions.
5.2.4 Real Time
For smart houses to be practical, the echo cancellation system has to work in real
time. The work in this thesis was done using Matlab v5 on a Windows NT, Pentium III
450 MHz, 128 MB RAM system, with mainly the echo cancellation capabilities of the
algorithms, rather than speed and efficiency, in mind. The next step may be to improve
and optimize the algorithms, translate them to DSP assembly code, and run them on a
DSP processor.
On a related note, other classes of algorithms that are not practical under the
current development platform, such as adaptive processing, may also be considered, if a
real time development platform is used.
5.2.5 Continual or Rapid Speaker Movement
Although it is not likely that a person will move very much while giving a
command to the house, in the ideal vision of smart houses it is desirable that there will be
no restrictions on the person’s movements. In a more immediate sense, it is also true that
even if the person moves a little, the multipath echoes change, so that the current
- 52 -
assumptions, though valid, are not perfect. Therefore, it will be worthwhile to explore
methods of quickly tracking the speaker’s movements.
5.2.6 Multiple Speakers
As mentioned in Section 1.2.3, this is yet another area of study (specifically,
BSSD) that is relevant to home networking. Although it is not directly dealing with echo
cancellation, it is necessary in order to make speech control of home networks realistic,
since undoubtedly, there will be more than one person speaking at some point.
5.3
Final Thoughts
After all is said and done, the fundamental question that remains is, “Will this
really work in practice?” The answer is, “It depends.” As mentioned before, the
performance of echo cancellation algorithms is very sensitive to the echo environment.
Therefore, while voice control of the home network may work very well in the living
room, it may not work nearly as well in the basement.
It is also important to realize that while echo cancellation, noise cancellation, and
other forms of speech enhancement are essential to successful speech recognition,
recognition errors can occur even on clean speech. Furthermore, developing a truly
“smart” speech interface that can understand humans beyond a limited vocabulary is
another great challenge in the field of artificial intelligence research. Therefore, while it
is reasonable to expect some basic functional form of smart houses to emerge in the near
future, the truly smart house (a la The Jetsons) is still a long way from becoming reality.
- 53 -
Appendix A
The Complex Cepstrum
Using the complex cepstrum to cancel out echoes is also known as homomorphic
deconvolution. In general, homomorphic systems are nonlinear in the classical sense, but
through the combination of different operations, they satisfy a generalization of the
principle of superposition [10]. The complex cepstrum, in particular, “changes”
convolution into addition, with the aid of the Fourier Transform and logarithms. The
following block diagram illustrates how the complex cepstrum of a signal is derived:
s[n]
DTFT
S(w)
Ŝ(w)
log
Inverse
DTFT
ŝ[n]
Figure A-1: Block diagram of the complex cepstrum
The complex cepstrum ŝ[n] is therefore defined as IDTFT(log(DTFT(s[n]))).
To see why this changes convolution into addition, lets look at a signal x[n], such
that
x[n] = s[n]*h[n]
where * denotes convolution. Now, lets follow through with the calculation of the
complex cepstrum:
- 54 -
(A.1)
X(w) = S(w) • H(w)
(A.2)
log(X(w)) = log(S(w)) + log(H(w))
(A.3)
X̂( w ) = Ŝ(w) + Ĥ(w)
(A.4)
x̂[n] = ŝ[n] + ĥ[n]
(A.5)
The next step is to show that the spikes in x̂[n ] do indeed belong to ĥ[n]. Let’s
look at the impulse response of a simple echo with delay δ and attenuation factor
α (which may be positive or negative) :
h[n] = δ[n] + α • δ[n-d]
(A.6)
Taking the Fourier Transform yields
H(w) = 1 + α • e-jwd
(A.7)
Ĥ(w)= log(1 + α • e-jwd )
(A.8)
Taking the logarithm gives
Generally, the direct path signal will be greater than the echoes in amplitude, so it is valid
to assume that |α| < 1. However, if this is not true, because of some strange room
configuration, the algorithm will still be okay, due to the minimum phase-all pass
factorization, which ensures that all of the poles and zeros of the z-transform are inside
- 55 -
the unit circle. Given |α| < 1, the right side of Equation A.8 can be approximated as
follows by using the power series expansion:
(−α ) n − jwdn
e
n
n =1
∞
Ĥ(w) = - å
(A.9)
Since the DTFT is defined as
X(w) =
∞
å x[n ] e − jwn ,
(A.10)
n = −∞
and change of variables m = dn yields
∞
m
(−α ) d
m =d
m
d
Ĥ( w ) = − å
e − jwm ,
(A.11)
it then follows that
m
ì
ï (−α ) d
ï− m ,
ĥ[m] = í
ï
d
ï0,
î
m≥d
(A.12)
m<d
Finally, substituting back for m gives the following result:
ì (−α ) n
, n ≥1
ï−
ĥ[dn] = í
n
ï0,
n <1
î
where n ∈ Ζ , the set of all integers.
- 56 -
(A.13)
Therefore, ĥ[n] contains exponentially decaying impulses at every integer
multiple of d. By zeroing out these spikes and then taking the inverse complex cepstrum,
the result of applying the MPD algorithm is an estimated version of s[n]. For multiple
echoes, the math generally becomes much more complicated, but the presence of
impulses at multiples of each echo’s delay is still observed. For further discussions of the
complex cepstrum, refer to [10] and [12].
- 57 -
Appendix B
Matlab Functions
The functions coded can be broken down into three subcategories: test,
algorithm, and support. The algorithm functions were already listed and described in
Section 2.2. Note that through the course of the thesis work, many functions were coded
and later changed or discarded. This accounts for the numbers that appear at the end of
some of the function names. The first two subsections of this appendix will a high level
explanation of the test and support functions, and the third section will include the source
code for all of the functions.
B.1 Test Functions
Test functions, as their name suggests, are used to automate testing. While they
test different functions, they all do basically the same things:
-
open the appropriate clean original files
-
open the unprocessed files (or create them in the simple model tests)
-
create the appropriate output directories for the new files created/to be created
-
process the unprocessed speech
-
find the MSE between the original and the processed files
- 58 -
-
find the SNR of the processed files*
-
write the results to a text file
-
write the processed speech to new files (also write the new unprocessed
speech created in test_simple)
Here’s the list of the test functions:
•
test_c2i(path, template)
path = output directory path, excluding the “test_c2i” part
template = filename template
ex: if input files are *tr5_m1.wav, then the template is “tr5_m1” and output files are
*tr5_m1_c2i.wav
•
test_c2is(room)
room = name of test room, for instance, “tr5” refers to the low echo room
configuration
•
test_mpd2(path, template)
•
test_multi(room)
Tests DSA, MPDs, MPDs2, MPDs3, SCP
•
test_simple(alpha, delay)
alpha = vector of attenuation(s) of the echo(s) to be added
delay = vetcor of delay(s) of the echo(s) to be added
B.2 Support Functions
•
addecho2wav(A, alpha, delay)
f = addecho2wav(A, alpha, delay)
A = wave vector
alpha = row vector of attenuations
delay = row vector of delays
*
Although the MSE and SNR were not used as metrics in the final analysis, they are still calculated by the
test methods.
- 59 -
Check to make sure that alpha and delay are the same size, iterate through the alpha
and delay vectors to create the echoes and add them to A, return the sum f.
•
allpass(A)
ap = allpass(A) returns only the all pass component of the vector A
[ap, mp] = allpass(A) returns both the all pass and the minimum phase components of
A
•
deconvolve(x, h)
s = deconvolve(x, h)
x = unprocessed signal
h = impulse response (actual or estimated)
Assume x = s*h, deconvolve h from x using FFT’s. If FFT(h) contains a sample with
the value 0, the built in Matlab function deconv(x, h) is called.
•
delay(A, d)
f = delay(A, d)
A = a column vector
d = delay factor
Returns a version of A, delayed by d samples, through zero padding the first d
samples.
•
findpeaks2(A, b, e, N, alpha)
f = findpeaks2(A, b, e, N, alpha)
A = cepstral domain vector
b = begin
e = end
N = frame size
alpha = threshold factor
Finds large positive and negative spikes in A(b:e) and zeros them out, returns the
altered cepstrum. A(b:e) is cut up into consecutive frames of size N. At any given
time, the maxima of three consecutive frames are compared. To get rid of positive
peaks, if max(frame i) > alpha*mean(max(frame i-1), max(frame i+1)), then the value
at max(frame 2) is set to 0. A similar rule is used to get rid of the negative peaks.
The process is iterative for i = 2 : (# of frames – 1). Used by C2I.
•
mixdown(room)
room = name of test room, for instance, “tr5” refers to the low echo room
configuration
- 60 -
Adds the inputs from the three microphones (generated by Cool Edit Pro), divide by
three, and write the new sound vector into a .wav file.
•
mse(s, x)
m = mse(s, x)
s = original signal
x = processed or unprocessed signal
Takes the difference between s and x, square the components of the difference vector,
take the sum of the vector, and return the result as the mean squared error.
•
rfindpeaks(A, b, e, N, alpha)
Recursive version of findpeaks2, continues to call itself until no peaks are detected.
Used by MPD2.
•
snr(s, x)
f = snr(s, x)
s = original signal
x = processed or unprocessed signal
Returns the signal to noise ratio of x, given that s is the clean version of it. The
method is easily explained with the source code:
s = s-mean(s);
%subtract DC components
x = x - mean(x);
S = sum(s.^2);
X = sum(x.^2);
%energy of s
%energy of x
r = (S.^2)/(S.^2-X.^2); % signal to noise ratio, denominator is the noise
% energy
•
wavplay2(v)
v = sound vector with sampling rate of 22050 samples/second
Calls wavplay(v, 22050). The default sampling rate for wavplay is 11025.
- 61 -
B.3 Source Code
B.3.1 Main Algorithms
function [s,h]=c2i(v)
%
%
%
%
[s,h] = c2i(v)
v = reverberant signal
s = estimation of original, h = estimation of impulse response, based
on peaks in the auto correlation of v
x = xcorr(v); %autocorrelation
x = x(96000:191999); %because of symmetry, don't need first half
y = findpeaks2(x, 1, 96000, 1000, 3); %get rid of spikes
z = x-y; %impulse response is the difference between the original
%and the one without the spikes
z(1) = x(1); %first peak was not taken out by rfindpeaks2
h = z/max(abs(z));
sp = deconvolve(v,h);
s = sp/max(abs(sp));
function f = c2is(M)
%c = c2is(M)
% performs c2i for multiple mic inputs
% M = matrix whose columns are sound vectors
% output is the dsa of each individually processed signal
[r c] = size(M); %r = # of rows, c = # of columns
if c>5,
warning('more than 5 sound vectors?');
end
C = zeros(r,c);
for i = 1:c,
s = c2i(M(:,i));
C(:,i) = s;
end
f = dsa2(C);
- 62 -
function f = dsa2(M)
% delay-sum-avg method
% m = matrix whose columns are sound vectors
[r c] = size(M); %r = # of rows, c = # of columns
if c>5,
warning('more than 5 sound vectors?');
end
S = M(:,1);
for i = 2:c,
T = M(:,i);
X = xcorr(S,T);
[x y] = max(X);
d = y-96000;
if d>0,
D = delay(S, d); %S is earlier than T if d>0
S = D+T;
elseif d<0,
D = delay(T,-d);
S = D+S;
else
S = S+T; %case for d=0
end
end
S = S/c;
f = S/max(abs(S));
%divide by number of columns
%normalize volume
function [m, old_cm, new_cm] = mpd2(s)
%m =
%[m,
%the
%[m,
%cm2
%min
%s =
mpd(s, delay) returns the processed signal
cm] = mpd(s, delay) returns the processed signal and
unprocessed cepstrum
cm, cm2] = mpd(s, delay)
= processed cepstrum
phase echo removal method
sound vector, delay = echo delay
[ap, mp] = allpass(s);
[cm nd] = cceps(mp);
old_cm = cm;
cm2 = rfindpeaks2(cm, 1000, 95000, 100,5);
new_cm = cm2;
mp2 = icceps(cm2, nd);
MP = fft(mp2);
AP = fft(ap);
Sf = MP.*AP;
- 63 -
s2 = ifft(Sf);
s2 = real(s2);
m = s2/max(abs(s2));
function f = mpds(M)
%f = mpds(M)
%takes in matrix M, whose columns are sound vectors
%takes the avg of the min phase components, transforms avg to cepstral
%domain and calls rfindpeaks2
%lines up and averages the all pass components
[r c] = size(M);
Mp = zeros(r,c);
Ap = zeros(r,c);
for i = 1:c,
[ap mp] = allpass(M(:,i));
Mp(:,i) = mp;
Ap(:,i) = ap;
end
amp = avg(Mp); %avg min phase
[cm nd] = cceps(amp);
cm2 = rfindpeaks2(cm, 1000, 95000, 75, 2);
amp2 = icceps(cm2, nd);
%%line up and average all pass components
S = Ap(:,1);
if c>1,
for i = 2:c,
T = Ap(:,i);
X = xcorr(S,T);
[x y] = max(X);
d = y-96000;
if d>0,
D = delay(S, d); %S is earlier than T if d>0
S = D+T;
elseif d<0,
D = delay(T,-d);
S = D+S;
else
S = S+T; %case for d=0
end
end
end
aap = S/c; %avg all pass component
%%reconstruct signal from aap and amp
- 64 -
MP = fft(amp2);
AP = fft(aap);
Sf = MP.*AP;
s2 = ifft(Sf);
s2 = real(s2);
f = s2/max(abs(s2));
function f = mpds2(M)
%f = mpds2(M)
%takes in matrix M, whose columns are sound vectors
%calls findpeaks2 on each cm, takes the average, and then convert back
%to the time domain
%lines up and averages the all pass components
[r c] = size(M);
Cm2 = zeros(r,c);
Ap = zeros(r,c);
Nd = zeros(1,c);
for i = 1:c,
[ap mp] = allpass(M(:,i));
[cm nd] = cceps(mp);
Nd(i) = nd;
cm2 = rfindpeaks2(cm, 1000, 95000, 100, 2);
Cm2(:,i) = cm2;
Ap(:,i) = ap;
end
acm2 = sum(Cm2,2)/c; %avg altered cepstral min phase
amp2 = icceps(acm2, median(Nd)); %inverse cepstrum
%%line up and average all pass components
S = Ap(:,1);
if c>1,
for i = 2:c,
T = Ap(:,i);
X = xcorr(S,T);
[x y] = max(X);
d = y-96000;
if d>0,
D = delay(S, d); %S is earlier than T if d>0
S = D+T;
elseif d<0,
D = delay(T,-d);
S = D+S;
else
S = S+T; %case for d=0
end
end
end
- 65 -
aap = S/c; %avg all pass component
%%reconstruct signal from aap and amp
MP = fft(amp2);
AP = fft(aap);
Sf = MP.*AP;
s2 = ifft(Sf);
s2 = real(s2);
f = s2/max(abs(s2)); %normalize to prevent clipping
function f = mpds3(M)
%f = mpds3(M)
%takes in matrix M, whose columns are sound vectors
%another variation of mpds
%findpeaks2 performed on individual cm's, but averaging is still done
%in the time domain
[r c] = size(M);
Mp2 = zeros(r,c);
Ap = zeros(r,c);
Nd = zeros(1,c);
for i = 1:c,
[ap mp] = allpass(M(:,i));
[cm nd] = cceps(mp);
cm2 = rfindpeaks2(cm, 1000, 95000, 100, 2);
mp2 = icceps(cm2,nd);
Mp2(:,i) = mp2;
Ap(:,i) = ap;
end
amp2 = sum(Mp2,2)/c; %avg altered min phase
%%line up and average all pass components
S = Ap(:,1);
if c>1,
for i = 2:c,
T = Ap(:,i);
X = xcorr(S,T);
[x y] = max(X);
d = y-96000;
if d>0,
D = delay(S, d); %S is earlier than T if d>0
S = D+T;
elseif d<0,
D = delay(T,-d);
S = D+S;
else
- 66 -
S = S+T; %case for d=0
end
end
end
aap = S/c; %avg all pass component
%%reconstruct signal from aap and amp
MP = fft(amp2);
AP = fft(aap);
Sf = MP.*AP;
s2 = ifft(Sf);
s2 = real(s2);
f = s2/max(abs(s2));
function m = scp(M)
%
%
%
%
m = scp(M)
takes in the matrix M, does DSA on the all-pass components, averages
the min-phase components in the cepstral domain (no peak detection),
and recombines
[r c] = size(M);
Mp = zeros(r,c);
Ap = zeros(r,c);
for i = 1:c,
[ap mp] = allpass(M(:,i));
Mp(:,i) = mp;
Ap(:,i) = ap;
end
Cm = zeros(r,c);
Nd = zeros(1,c);
for i=1:c,
[cm nd] = cceps(Mp(:,i));
Cm(:,i) = cm;
Nd(i) = nd;
end
acm = avg(Cm);
amp = icceps(acm, median(Nd));
%average the complex cepstrums
%average min phase
%%line up and avg all pass component
S = Ap(:,1);
if c>1,
for i = 2:c,
T = Ap(:,i);
- 67 -
X = xcorr(S,T);
[x y] = max(X);
d = y-96000;
if d>0,
D = delay(S, d); %S is earlier than T if d>0
S = D+T;
elseif d<0,
D = delay(T,-d);
S = D+S;
else
S = S+T; %case for d=0
end
end
end
aap = S/c; %avg all pass component
MP = fft(amp);
AP = fft(aap);
Sf = MP.*AP;
s2 = ifft(Sf);
s2 = real(s2);
m = s2/max(abs(s2));
B.3.2 Test Functions
function f= test_c2i(path, template);
% test script for c2i
% path = output directory under d:\speech\{gina,murray}22\, excluding
the test_c2i part
% template = filename template
% ex: if input files are *tr5_m1.wav and output files are
*tr5_m1_c2i.wav
% then the template is tr5_m1
ms_results = [];
gy_results = [];
mkdir(['d:\speech\gina22\' path '\'],'test_c2i');
mkdir(['d:\speech\gina11\' path '\'],'test_c2i');
mkdir(['d:\speech\murray22\' path '\'],'test_c2i');
mkdir(['d:\speech\murray11\' path '\'],'test_c2i');
gy_path = ['d:\speech\gina22\' path '\test_c2i\'];
ms_path = ['d:\speech\murray22\' path '\test_c2i\'];
gy_path11 = ['d:\speech\gina11\' path '\test_c2i\'];
ms_path11 = ['d:\speech\murray11\' path '\test_c2i\'];
- 68 -
for i = 1:20
if (i~=9) & (i~=18), %skip 9 and 18
gy_file = ['S' sprintf('%i',i) 'gy22_' template '.wav'];
%disp(gy_file);
ms_file = ['S' sprintf('%i',i) 'ms22_' template '.wav'];
%disp(ms_file);
gy = wavread(gy_file);
ms = wavread(ms_file);
m1 = c2i(gy);
wavwrite(m1, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' template
'_c2i.wav']);
m2 = c2i(ms);
wavwrite(m2, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' template
'_c2i.wav']);
%downsample for speech recognition
gy11 = resample(m1,1,2);
wavwrite(gy11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_'
template '_c2i.wav']);
ms11 = resample(m2,1,2);
wavwrite(ms11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_'
template '_c2i.wav']);
% load originals
gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']);
ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']);
% normalize orignals
gy_orig2 = gy_orig/max(abs(gy_orig));
ms_orig2 = ms_orig/max(abs(ms_orig));
gy_mse = mse(gy_orig2, m1);
ms_mse = mse(ms_orig2, m2);
gy_snr = snr(gy_orig2, m1);
ms_snr = snr(ms_orig2, m2);
gy_results = [gy_results [gy_mse; gy_snr]];
ms_results = [ms_results [ms_mse; ms_snr]];
end
end
gy_avg_mse = mean(gy_results(1,:));
gy_avg_snr = mean(gy_results(2,:));
ms_avg_mse = mean(ms_results(1,:));
ms_avg_snr = mean(ms_results(2,:));
- 69 -
gy_fid = fopen(['d:\speech\test_results\gy_c2i_' template '.txt'],
'w');
fprintf(gy_fid, '%6.3f
%6.3f\n', gy_results);
fprintf(gy_fid, '%s\n', 'gy average mse');
fprintf(gy_fid, '%6.3f\n', gy_avg_mse);
fprintf(gy_fid, '%s\n', 'gy average snr');
fprintf(gy_fid, '%6.3f\n', gy_avg_snr);
fclose(gy_fid);
ms_fid = fopen(['d:\speech\test_results\ms_c2i_' template '.txt'],
'w');
fprintf(ms_fid, '%6.3f
%6.3f\n', ms_results);
fprintf(ms_fid, '%s\n', 'ms average mse');
fprintf(ms_fid, '%6.3f\n', ms_avg_mse);
fprintf(ms_fid, '%s\n', 'ms average snr');
fprintf(ms_fid, '%6.3f\n', ms_avg_snr);
fclose(ms_fid);
function f = test_c2is(room)
gy_mse_results
gy_snr_results
ms_mse_results
ms_snr_results
gy_path =
ms_path =
gy_path11
ms_path11
=
=
=
=
[];
[];
[];
[];
['d:\speech\gina22\' room '\test_multi\'];
['d:\speech\murray22\' room '\test_multi\'];
= ['d:\speech\gina11\' room '\test_multi\'];
= ['d:\speech\murray11\' room '\test_multi\'];
for i = 1:20
if (i~=9) & (i~=18), %skip 9 and 18
gy_file1 = ['S' sprintf('%i',i) 'gy22_' room '_m1.wav'];
gy_file2 = ['S' sprintf('%i',i) 'gy22_' room '_m2.wav'];
gy_file3 = ['S' sprintf('%i',i) 'gy22_' room '_m3.wav'];
ms_file1 = ['S' sprintf('%i',i) 'ms22_' room '_m1.wav'];
ms_file2 = ['S' sprintf('%i',i) 'ms22_' room '_m2.wav'];
ms_file3 = ['S' sprintf('%i',i) 'ms22_' room '_m3.wav'];
gy1 = wavread(gy_file1);
gy2 = wavread(gy_file2);
gy3 = wavread(gy_file3);
ms1 = wavread(ms_file1);
ms2 = wavread(ms_file2);
ms3 = wavread(ms_file3);
GY = [gy1 gy2 gy3];
MS = [ms1 ms2 ms3];
- 70 -
gy_c2is = c2is(GY);
wavwrite(gy_c2is, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_'
room '_c2is.wav']);
ms_c2is = c2is(MS);
wavwrite(ms_c2is, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_'
room '_c2is.wav']);
%resample for speech recognition
gy_c2is11 = resample(gy_c2is,1,2);
wavwrite(gy_c2is11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_'
room '_c2is.wav']);
ms_c2is11 = resample(ms_c2is,1,2);
wavwrite(ms_c2is11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_'
room '_c2is.wav']);
% load originals
gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']);
ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']);
% normalize orignals
gy_orig2 = gy_orig/max(abs(gy_orig));
ms_orig2 = ms_orig/max(abs(ms_orig));
gy_mse_c2is = mse(gy_orig2, gy_c2is);
ms_mse_c2is = mse(ms_orig2, ms_c2is);
gy_snr_c2is = snr(gy_orig2, gy_c2is);
ms_snr_c2is = snr(ms_orig2, ms_c2is);
gy_mse_results = [gy_mse_results gy_mse_c2is]; %append new
results
gy_snr_results = [gy_snr_results gy_snr_c2is];
ms_mse_results = [ms_mse_results ms_mse_c2is];
ms_snr_results = [ms_snr_results ms_snr_c2is];
end
end
%get averages
gy_avg_mses =
gy_avg_snrs =
ms_avg_mses =
ms_avg_snrs =
mean(gy_mse_results); %average across the row
mean(gy_snr_results);
mean(ms_mse_results);
mean(ms_snr_results);
gy_fid = fopen(['d:\speech\test_results\gy_' room '_multi_c2is.txt'],
'w');
fprintf(gy_fid, '%s\n', 'MSE');
fprintf(gy_fid, '%s\n','c2is');
fprintf(gy_fid, '%6.3f\n', gy_mse_results);
fprintf(gy_fid, '%s\n', 'gy average mse');
- 71 -
fprintf(gy_fid,
fprintf(gy_fid,
fprintf(gy_fid,
fprintf(gy_fid,
fprintf(gy_fid,
fclose(gy_fid);
'%6.3f\n\n', gy_avg_mses);
'%s\n','SNR');
'%6.3f\n', gy_snr_results);
'%s\n', 'gy average snr');
'%6.3f\n', gy_avg_snrs);
ms_fid = fopen(['d:\speech\test_results\ms_' room '_multi_c2is.txt'],
'w');
fprintf(ms_fid, '%s\n', 'MSE');
fprintf(ms_fid, '%s\n','c2is');
fprintf(ms_fid, '%6.3f\n', ms_mse_results);
fprintf(ms_fid, '%s\n', 'ms average mse');
fprintf(ms_fid, '%6.3f\n\n', ms_avg_mses);
fprintf(ms_fid, '%s\n', 'SNR');
fprintf(ms_fid, '%6.3f\n', ms_snr_results);
fprintf(ms_fid, '%s\n', 'ms average snr');
fprintf(ms_fid, '%6.3f\n', ms_avg_snrs);
fclose(ms_fid);
function f= test_mpd2(path, template);
%
%
%
%
%
%
test script for mpd2
path = output directory under d:\speech\{gina,murray}22\, excluding
the test_mpd2 part
template = filename template
ex: if input files are *tr5_m1.wav and output files are
*tr5_m1_mpd2.wav, then the template is tr5_m1
ms_results = [];
gy_results = [];
mkdir(['d:\speech\gina22\' path '\'],'test_mpd2');
mkdir(['d:\speech\gina11\' path '\'],'test_mpd2');
mkdir(['d:\speech\murray22\' path '\'],'test_mpd2');
mkdir(['d:\speech\murray11\' path '\'],'test_mpd2');
gy_path = ['d:\speech\gina22\' path '\test_mpd2\'];
ms_path = ['d:\speech\murray22\' path '\test_mpd2\'];
gy_path11 = ['d:\speech\gina11\' path '\test_mpd2\'];
ms_path11 = ['d:\speech\murray11\' path '\test_mpd2\'];
for i = 1:20
if (i~=9) & (i~=18), %skip 9 and 18
gy_file = ['S' sprintf('%i',i) 'gy22_' template '.wav'];
ms_file = ['S' sprintf('%i',i) 'ms22_' template '.wav'];
gy = wavread(gy_file);
ms = wavread(ms_file);
- 72 -
[m1 c1 d1] = mpd2(gy);
wavwrite(m1, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' template
'_mpd2.wav']);
[m2 c2 d2] = mpd2(ms);
wavwrite(m2, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' template
'_mpd2.wav']);
%downsample for speech recognition
gy11 = resample(m1,1,2);
wavwrite(gy11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_'
template '_mpd2.wav']);
ms11 = resample(m2,1,2);
wavwrite(ms11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_'
template '_mpd2.wav']);
% load originals
gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']);
ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']);
% normalize orignals
gy_orig2 = gy_orig/max(abs(gy_orig));
ms_orig2 = ms_orig/max(abs(ms_orig));
gy_mse = mse(gy_orig2, m1);
ms_mse = mse(ms_orig2, m2);
gy_snr = snr(gy_orig2, m1);
ms_snr = snr(ms_orig2, m2);
gy_results = [gy_results [gy_mse; gy_snr]];
ms_results = [ms_results [ms_mse; ms_snr]];
end
end
gy_avg_mse = mean(gy_results(1,:));
gy_avg_snr = mean(gy_results(2,:));
ms_avg_mse = mean(ms_results(1,:));
ms_avg_snr = mean(ms_results(2,:));
%disp(transpose(gy_results));
disp(gy_avg_mse);
disp(gy_avg_snr);
%disp(transpose(ms_results));
disp(ms_avg_mse);
disp(ms_avg_snr);
gy_fid = fopen(['d:\speech\test_results\gy_mpd2_' template '.txt'],
'w');
fprintf(gy_fid, '%6.3f
%6.3f\n', gy_results);
fprintf(gy_fid, '%s\n', 'gy average mse');
fprintf(gy_fid, '%6.3f\n', gy_avg_mse);
- 73 -
fprintf(gy_fid, '%s\n', 'gy average snr');
fprintf(gy_fid, '%6.3f\n', gy_avg_snr);
fclose(gy_fid);
ms_fid = fopen(['d:\speech\test_results\ms_mpd2_' template '.txt'],
'w');
fprintf(ms_fid, '%6.3f
%6.3f\n', ms_results);
fprintf(ms_fid, '%s\n', 'ms average mse');
fprintf(ms_fid, '%6.3f\n', ms_avg_mse);
fprintf(ms_fid, '%s\n', 'ms average snr');
fprintf(ms_fid, '%6.3f\n', ms_avg_snr);
fclose(ms_fid);
function f = test_multi(room)
%room = name of room
% tests mpds, mpds2, mpds3, dsa2, scp
gy_mse_results
gy_snr_results
ms_mse_results
ms_snr_results
gy_path =
ms_path =
gy_path11
ms_path11
=
=
=
=
[];
[];
[];
[];
['d:\speech\gina22\' room '\test_multi\'];
['d:\speech\murray22\' room '\test_multi\'];
= ['d:\speech\gina11\' room '\test_multi\'];
= ['d:\speech\murray11\' room '\test_multi\'];
for i = 1:20
if (i~=9) & (i~=18), %skip 9 and 18
gy_file1 = ['S' sprintf('%i',i) 'gy22_' room '_m1.wav'];
gy_file2 = ['S' sprintf('%i',i) 'gy22_' room '_m2.wav'];
gy_file3 = ['S' sprintf('%i',i) 'gy22_' room '_m3.wav'];
ms_file1 = ['S' sprintf('%i',i) 'ms22_' room '_m1.wav'];
ms_file2 = ['S' sprintf('%i',i) 'ms22_' room '_m2.wav'];
ms_file3 = ['S' sprintf('%i',i) 'ms22_' room '_m3.wav'];
gy1 = wavread(gy_file1);
gy2 = wavread(gy_file2);
gy3 = wavread(gy_file3);
ms1 = wavread(ms_file1);
ms2 = wavread(ms_file2);
ms3 = wavread(ms_file3);
GY = [gy1 gy2 gy3];
MS = [ms1 ms2 ms3];
gy_mpds = mpds(GY);
- 74 -
wavwrite(gy_mpds, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_'
room '_mpds.wav']);
gy_mpds2 = mpds2(GY);
wavwrite(gy_mpds2, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_'
room '_mpds2.wav']);
gy_mpds3 = mpds3(GY);
wavwrite(gy_mpds3, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_'
room '_mpds3.wav']);
gy_dsa2 = dsa2(GY);
wavwrite(gy_dsa2, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_'
room '_dsa2.wav']);
gy_scp = scp(GY);
wavwrite(gy_scp, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' room
'_scp.wav']);
ms_mpds = mpds(MS);
wavwrite(ms_mpds, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_'
room '_mpds.wav']);
ms_mpds2 = mpds2(MS);
wavwrite(ms_mpds2, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_'
room '_mpds2.wav']);
ms_mpds3 = mpds3(MS);
wavwrite(ms_mpds3, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_'
room '_mpds3.wav']);
ms_dsa2 = dsa2(MS);
wavwrite(ms_dsa2, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_'
room '_dsa2.wav']);
ms_scp = scp(MS);
wavwrite(ms_scp, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' room
'_scp.wav']);
%resample for speech recognition
gy_mpds11 = resample(gy_mpds, 1,2);
wavwrite(gy_mpds11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_'
room '_mpds.wav']);
gy_mpds211 = resample(gy_mpds2, 1,2);
wavwrite(gy_mpds211, 11025, [gy_path11 'S' sprintf('%i',i)
'gy11_' room '_mpds2.wav']);
gy_mpds311 = resample(gy_mpds3, 1,2);
wavwrite(gy_mpds311, 11025, [gy_path11 'S' sprintf('%i',i)
'gy11_' room '_mpds3.wav']);
gy_dsa211 = resample(gy_dsa2, 1,2);
wavwrite(gy_dsa211, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_'
room '_dsa2.wav']);
gy_scp11 = resample(gy_scp, 1,2);
wavwrite(gy_scp11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_'
room '_scp.wav']);
ms_mpds11 = resample(ms_mpds, 1,2);
wavwrite(ms_mpds11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_'
room '_mpds.wav']);
ms_mpds211 = resample(ms_mpds2, 1,2);
wavwrite(ms_mpds211, 11025, [ms_path11 'S' sprintf('%i',i)
'ms11_' room '_mpds2.wav']);
- 75 -
ms_mpds311 = resample(ms_mpds3, 1,2);
wavwrite(ms_mpds311, 11025, [ms_path11 'S' sprintf('%i',i)
'ms11_' room '_mpds3.wav']);
ms_dsa211 = resample(ms_dsa2, 1,2);
wavwrite(ms_dsa211, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_'
room '_dsa2.wav']);
ms_scp11 = resample(ms_scp, 1,2);
wavwrite(ms_scp11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_'
room '_scp.wav']);
% load originals
gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']);
ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']);
% normalize orignals
gy_orig2 = gy_orig/max(abs(gy_orig));
ms_orig2 = ms_orig/max(abs(ms_orig));
gy_mse_mpds = mse(gy_orig2, gy_mpds);
gy_mse_mpds2 = mse(gy_orig2, gy_mpds2);
gy_mse_mpds3 = mse(gy_orig2, gy_mpds3);
gy_mse_dsa2 = mse(gy_orig2, gy_dsa2);
gy_mse_scp = mse(gy_orig2, gy_scp);
%column vector of gy_mse_*
gy_mse_v = [gy_mse_mpds; gy_mse_mpds2; gy_mse_mpds3; gy_mse_dsa2;
gy_mse_scp];
ms_mse_mpds = mse(ms_orig2, ms_mpds);
ms_mse_mpds2 = mse(ms_orig2, ms_mpds2);
ms_mse_mpds3 = mse(ms_orig2, ms_mpds3);
ms_mse_dsa2 = mse(ms_orig2, ms_dsa2);
ms_mse_scp = mse(ms_orig2, ms_scp);
ms_mse_v = [ms_mse_mpds; ms_mse_mpds2; ms_mse_mpds3; ms_mse_dsa2;
ms_mse_scp];
gy_snr_mpds = snr(gy_orig2, gy_mpds);
gy_snr_mpds2 = snr(gy_orig2, gy_mpds2);
gy_snr_mpds3 = snr(gy_orig2, gy_mpds3);
gy_snr_dsa2 = snr(gy_orig2, gy_dsa2);
gy_snr_scp = snr(gy_orig2, gy_scp);
gy_snr_v = [gy_snr_mpds; gy_snr_mpds2; gy_snr_mpds3; gy_snr_dsa2;
gy_snr_scp];
ms_snr_mpds = snr(ms_orig2, ms_mpds);
ms_snr_mpds2 = snr(ms_orig2, ms_mpds2);
ms_snr_mpds3 = snr(ms_orig2, ms_mpds3);
ms_snr_dsa2 = snr(ms_orig2, ms_dsa2);
ms_snr_scp = snr(ms_orig2, ms_scp);
- 76 -
ms_snr_v = [ms_snr_mpds; ms_snr_mpds2; ms_snr_mpds3; ms_snr_dsa2;
ms_snr_scp];
gy_mse_results = [gy_mse_results gy_mse_v]; %append new results
gy_snr_results = [gy_snr_results gy_snr_v];
ms_mse_results = [ms_mse_results ms_mse_v];
ms_snr_results = [ms_snr_results ms_snr_v];
end
end
%get averages
gy_avg_mses = zeros(5,1);
gy_avg_snrs = zeros(5,1);
ms_avg_mses = zeros(5,1);
ms_avg_snrs = zeros(5,1);
for i = 1:5
gy_avg_mses(i) = mean(gy_mse_results(i,:)); %average across the row
gy_avg_snrs(i) = mean(gy_snr_results(i,:));
ms_avg_mses(i) = mean(ms_mse_results(i,:));
ms_avg_snrs(i) = mean(ms_snr_results(i,:));
end
gy_fid = fopen(['d:\speech\test_results\gy_' room '_multi.txt'], 'w');
fprintf(gy_fid, '%s\n', 'MSE');
fprintf(gy_fid, '%s\n','mpds
mpds2
mpds3
dsa2
scp');
fprintf(gy_fid, '%6.3f
%6.3f
%6.3f
%6.3f
%6.3f\n',
gy_mse_results);
fprintf(gy_fid, '%s\n', 'gy average mse');
fprintf(gy_fid, '%6.3f
%6.3f
%6.3f
%6.3f
%6.3f\n\n',
gy_avg_mses);
fprintf(gy_fid, '%6.3f
%6.3f
%6.3f
%6.3f
%6.3f\n',
gy_snr_results);
fprintf(gy_fid, '%s\n', 'gy average snr');
fprintf(gy_fid, '%6.3f
%6.3f
%6.3f
%6.3f
%6.3f\n',
gy_avg_snrs);
fclose(gy_fid);
ms_fid = fopen(['d:\speech\test_results\ms_' room '_multi.txt'], 'w');
fprintf(ms_fid, '%s\n', 'MSE');
fprintf(ms_fid, '%s\n','mpds
mpds2
mpds3
dsa2
scp');
fprintf(ms_fid, '%6.3f
%6.3f
%6.3f
%6.3f
%6.3f\n',
ms_mse_results);
fprintf(ms_fid, '%s\n', 'ms average mse');
fprintf(ms_fid, '%6.3f
%6.3f
%6.3f
%6.3f
%6.3f\n\n',
ms_avg_mses);
fprintf(ms_fid, '%6.3f
%6.3f
%6.3f
%6.3f
%6.3f\n',
ms_snr_results);
fprintf(ms_fid, '%s\n', 'ms average snr');
fprintf(ms_fid, '%6.3f
%6.3f
%6.3f
%6.3f
%6.3f\n',
ms_avg_snrs);
- 77 -
fclose(ms_fid);
function f= test_simple(alpha, delay)
%test simple echo signals
ms_results = [];
gy_results = [];
version = [sprintf('%2.2f', alpha) '_' sprintf('%i',delay)];
mkdir('d:\speech\gina22\simple\', version);
mkdir('d:\speech\murray22\simple\', version);
mkdir('d:\speech\gina11\simple\', version);
mkdir('d:\speech\murray11\simple\', version);
gy_path = ['d:\speech\gina22\simple\' version '\'];
%disp(gy_path);
ms_path = ['d:\speech\murray22\simple\' version '\'];
gy_path11 = ['d:\speech\gina11\simple\' version '\'];
ms_path11 = ['d:\speech\murray11\simple\' version '\'];
for i = 1:20
if (i~=9) & (i~=18), %skip 9 and 18
% load originals
gy_orig = wavread(['S' sprintf('%i',i) 'gy22.wav']);
ms_orig = wavread(['S' sprintf('%i',i) 'ms22.wav']);
% normalize orignals
gy_orig2 = gy_orig/max(abs(gy_orig));
ms_orig2 = ms_orig/max(abs(ms_orig));
%add echoes
gy = addecho2wav(gy_orig, alpha, delay);
ms = addecho2wav(ms_orig, alpha, delay);
%normalize
gy2 = gy/max(abs(gy));
wavwrite(gy2, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' version
'.wav']);
ms2 = ms/max(abs(ms));
wavwrite(ms2, 22050, [ms_path, 'S' sprintf('%i',i) 'ms22_'
version '.wav']);
%downsample for speech recognition
gy11 = resample(gy2, 1,2);
- 78 -
wavwrite(gy11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_'
version '.wav']);
ms11 = resample(ms2, 1,2);
wavwrite(ms11, 11025, [ms_path11, 'S' sprintf('%i',i) 'ms11_'
version '.wav']);
%find mse's
gy_mse = mse(gy_orig2,
ms_mse = mse(ms_orig2,
%find snr's
gy_snr = snr(gy_orig2,
ms_snr = snr(ms_orig2,
disp(ms_snr);
gy2);
ms2);
gy2);
ms2);
gy_results = [gy_results [gy_mse; gy_snr]];
ms_results = [ms_results [ms_mse; ms_snr]];
end
end
gy_avg_mse = mean(gy_results(1,:));
gy_avg_snr = mean(gy_results(2,:));
ms_avg_mse = mean(ms_results(1,:));
ms_avg_snr = mean(ms_results(2,:));
gy_fid = fopen(['d:\speech\test_results\gy_simple' version '.txt'],
'w');
fprintf(gy_fid, '%6.3f
%6.3f\n', gy_results);
fprintf(gy_fid, '%s\n', 'gy average mse');
fprintf(gy_fid, '%6.3f\n', gy_avg_mse);
fprintf(gy_fid, '%s\n', 'gy average snr');
fprintf(gy_fid, '%6.3f\n', gy_avg_snr);
fclose(gy_fid);
ms_fid = fopen(['d:\speech\test_results\ms_simple' version '.txt'],
'w');
fprintf(ms_fid, '%6.3f
%6.3f\n', ms_results);
fprintf(ms_fid, '%s\n', 'ms average mse');
fprintf(ms_fid, '%6.3f\n', ms_avg_mse);
fprintf(ms_fid, '%s\n', 'ms average snr');
fprintf(ms_fid, '%6.3f\n', ms_avg_snr);
fclose(ms_fid);
B.3.3 Support Functions
function f = addecho2wav(A, alpha, delay)
%f = addecho2wav(A, alpha, delay)
% A = wave vector
% alpha = row vector of attenuations
% delay = row vector of delays
- 79 -
sa = size(alpha);
sd = size(delay);
if ~(sa(1) == sd(1)) | ~(sa(2) ==sd(2)),
error('attenuation and delay vectors are not the same size');
end
n = sa(2); %number of echoes to add
temp = transpose(A);
for i=1:n,
temp = temp + alpha(i)*[zeros(1, delay(i)) transpose(A(1:96000delay(i)))];
end
f = transpose(temp);
function [ap, mp] = allpass(a);
%ap = allpass(A) is the all pass component of A
%[ap, mp] = allpass(A) give both the all pass
%and the min phase components
[y, ym] = rceps(a); %ym = min phase component of a
A = fft(a);
Ym = fft(ym);
Ap = A./Ym;
%Fourier transform of the all pass component
ap = ifft(Ap);
mp = ym;
function s = deconvolve(x,h)
X = fft(x,192000);
H = fft(h,192000);
iH = 1./H;
if sum(iH) == Inf,
x2 = zeros(1, 192000);
x2(1:96000) = x;
s = deconv(x2,h);
else,
S = X./H;
s_long = real(ifft(S));
s = s_long(1:96000);
end
- 80 -
function D = delay(A, d)
% D = delay(A,d)
% shifts a vector A forward by d samples
% add d zeros in front, chops off tail
B = zeros(96000,1);
B(1:d) = 0;
B((d+1):96000) = A(1:(96000-d));
D = B;
function f=findpeaks2(A, b, e, N, alpha)
%f=rfindpeaks2(A, b, e, N)
%A = cepstral domain vector, b=begin, e=end, N=frame size
%finds large pos and neg spikes in A and zeros them out
chunks=floor((e-b)/N);
B = A(b:e);
M = zeros(3,2);
%get rid of positive peaks
[M(1,1) M(1,2)] = max(B(1:N));
[M(2,1) M(2,2)] = max(B(N+1:2*N));
for i=2:chunks-1,
[M(3,1) M(3,2)] = max(B((N*i+1):N*(i+1)));
if abs(M(2,1)) > alpha*abs(mean([M(1,1) M(3,1)]))
B((i-1)*N+M(2,2))=0;
%disp(M);
end
M(1,:) = M(2,:);
M(2,:) = M(3,:);
end
%get rid of neg peaks
[M(1,1) M(1,2)] = min(B(1:N));
[M(2,1) M(2,2)] = min(B(N+1:2*N));
for i=2:chunks-1,
[M(3,1) M(3,2)] = min(B((N*i+1):N*(i+1)));
if abs(M(2,1)) > alpha*abs(mean([M(1,1) M(3,1)]))
B((i-1)*N+M(2,2))=0;
%disp(M);
end
M(1,:) = M(2,:);
M(2,:) = M(3,:);
end
A(b:e)=B;
f=A;
- 81 -
function f = mixdown(room)
%f = mixdown(room)
%adds up the input from three microphones and divide by three to
prevent clipping
%sub paths
sgy_path =
sms_path =
sgy_path11
sms_path11
['d:\speech\gina22\' room '\'];
['d:\speech\murray22\' room ''];
= ['d:\speech\gina11\' room '\'];
= ['d:\speech\murray11\' room '\'];
%output paths
gy_path = ['d:\speech\gina22\' room '\mixed\'];
ms_path = ['d:\speech\murray22\' room '\mixed\'];
gy_path11 = ['d:\speech\gina11\' room '\mixed\'];
ms_path11 = ['d:\speech\murray11\' room '\mixed\'];
for i = 1:20
gy_file1 = ['S' sprintf('%i',i) 'gy22_' room '_m1.wav'];
gy_file2 = ['S' sprintf('%i',i) 'gy22_' room '_m2.wav'];
gy_file3 = ['S' sprintf('%i',i) 'gy22_' room '_m3.wav'];
ms_file1 = ['S' sprintf('%i',i) 'ms22_' room '_m1.wav'];
ms_file2 = ['S' sprintf('%i',i) 'ms22_' room '_m2.wav'];
ms_file3 = ['S' sprintf('%i',i) 'ms22_' room '_m3.wav'];
gy1 = wavread(gy_file1);
gy2 = wavread(gy_file2);
gy3 = wavread(gy_file3);
ms1 = wavread(ms_file1);
ms2 = wavread(ms_file2);
ms3 = wavread(ms_file3);
gy = (gy1+gy2+gy3)/3;
ms = (ms1+ms2+ms3)/3;
wavwrite(gy, 22050, [gy_path 'S' sprintf('%i',i) 'gy22_' room
'.wav']);
wavwrite(ms, 22050, [ms_path 'S' sprintf('%i',i) 'ms22_' room
'.wav']);
%downsample for speech recognition
gy11 = resample(gy,1,2);
ms11= resample(ms,1,2);
wavwrite(gy11, 11025, [gy_path11 'S' sprintf('%i',i) 'gy11_' room
'.wav']);
wavwrite(ms11, 11025, [ms_path11 'S' sprintf('%i',i) 'ms11_' room
'.wav']);
end
- 82 -
function m = mse(X, Y)
D = X-Y;
D2 = D.*D;
m = sum(D2);
function f=rfindpeaks2(A, b, e, N, alpha)
%f=rfindpeaks2(A, b, e, N)
%A = cepstral domain vector, b=begin, e=end, N=frame size
%alpha = threshold level, smaller alpha means more stringent threshold
%recursive version of findpeaks2
%finds large pos and neg spikes in A and zeros them out
chunks=floor((e-b)/N);
B = A(b:e);
M = zeros(3,2);
%get rid of positive peaks
[M(1,1) M(1,2)] = max(B(1:N));
[M(2,1) M(2,2)] = max(B(N+1:2*N));
for i=2:chunks-1,
[M(3,1) M(3,2)] = max(B((N*i+1):N*(i+1)));
if abs(M(2,1)) > alpha*abs(mean([M(1,1) M(3,1)]))
B((i-1)*N+M(2,2))=0;
%disp(M);
end
M(1,:) = M(2,:);
M(2,:) = M(3,:);
end
%get rid of neg peaks
[M(1,1) M(1,2)] = min(B(1:N));
[M(2,1) M(2,2)] = min(B(N+1:2*N));
for i=2:chunks-1,
[M(3,1) M(3,2)] = min(B((N*i+1):N*(i+1)));
if abs(M(2,1)) > alpha*abs(mean([M(1,1) M(3,1)]))
B((i-1)*N+M(2,2))=0;
end
M(1,:) = M(2,:);
M(2,:) = M(3,:);
end
if sum(A(b:e) - B) ~=0,
A(b:e) = B;
f =rfindpeaks2(A,b,e,N,alpha);
else
A(b:e)=B;
f=A;
end
%call itself again
- 83 -
function r = snr(s, x)
%signal to noise ratio, where s = clean signal, and x = corrupted
signal
s = s-mean(s); %subtract DC component
x = x - mean(x);
S = sum(s.^2); %energy of s
X = sum(x.^2);
r = (S.^2)/(S.^2-X.^2);
function w=wavplay2(X);
%plays a wav file at a sampling rate of 22050
wavplay(X, 22050);
- 84 -
Appendix C
Tables of Results
C.1 Results for Simple Model
Table C-1: Female Subject’s Table of Results
alpha
delay
Algorithm Wrong
benchmark
Added
Missing Total
Percent
improvement
3
0
2
5
0.5
0.5
0.5
5513
5513
5513
none
MPD2
C2I
35
5
8
7
1
0
12
3
5
54
9
13
83.33
75.93
0.5
0.5
0.5
11025
11025
11025
none
MPD2
C2I
39
6
8
20
0
2
9
2
7
68
8
17
88.24
75.00
0.5
0.5
0.5
22050
22050
22050
none
MPD2
C2I
14
5
6
48
2
11
6
3
5
68
10
22
85.29
67.65
0.25
0.25
0.25
11025
11025
11025
none
MPD2
C2I
15
6
3
20
1
3
5
4
3
40
11
9
72.50
77.50
- 85 -
alpha
delay
Algorithm Wrong
Added
Missing Total
Percent
improvement
0.5
0.5
0.5
11025
11025
11025
none
MPD2
C2I
39
6
8
20
0
2
9
2
7
68
8
17
88.24
75.00
0.75
0.75
0.75
11025
11025
11025
none
MPD2
C2I
39
9
20
26
1
16
9
7
9
74
17
45
77.03
39.19
Table C-2: Male Subject’s Tables for Simple Model
alpha
delay
Algorithm Wrong
benchmark
Added
Missing Total
Percent
Improvement
6
0
6
12
0.5
0.5
0.5
5513
5513
5513
none
MPD2
C2I
14
8
7
8
1
2
14
9
9
36
18
18
75.00
75.00
0.5
0.5
0.5
11025
11025
11025
none
MPD2
C2I
39
6
8
20
0
2
9
2
7
68
8
17
107.14
91.07
0.5
0.5
0.5
22050
22050
22050
none
MPD2
C2I
7
8
7
46
2
23
11
12
10
64
22
40
80.77
46.15
0.25
0.25
0.25
11025
11025
11025
none
MPD2
C2I
13
6
7
17
1
0
9
9
7
39
16
14
85.19
92.59
0.5
0.5
0.5
11025
11025
11025
none
MPD2
C2I
39
6
8
20
0
2
9
2
7
68
8
17
107.14
91.07
0.75
0.75
0.75
11025
11025
11025
none
MPD2
C2I
40
11
28
18
1
29
10
13
17
68
25
74
76.79
-10.71
- 86 -
C.2 Tables for Complex Model Signals with One Microphone
Table C-3: Female Subject
Signal Environment Algorithm Wrong Missing Added Total Percent
Improvement
benchmark
N/A
3
0
2
5
Low Echo, m1
Low Echo, m1
Low Echo, m1
none
MPD2
C2I
16
10
9
1
0
4
7
5
0
24
15
13
47.37
57.89
Medium Echo,m1
Medium Echo,m1
Medium Echo,m1
none
MPD2
C2I
22
26
27
1
1
1
22
20
19
45
47
47
-5.00
-5.00
High Echo,m1
High Echo,m1
High Echo,m1
none
MPD2
C2I
74
76
76
1
3
3
37
35
35
112
114
114
-1.87
-1.87
Table C-4: Male Subject
Signal Environment Algorithm Wrong Missing Added Total Percent
Improvement
benchmark
N/A
6
0
6
12
Low Echo,m1
Low Echo,m1
Low Echo,m1
none
MPD2
C2I
11
12
12
0
0
0
14
16
16
25
28
28
-23.08
-23.08
Medium Echo,m1
Medium Echo,m1
Medium Echo,m1
none
MPD2
C2I
34
35
35
0
0
0
38
37
38
72
72
73
0.00
-1.67
High Echo,m1
High Echo,m1
High Echo,m1
none
MPD2
C2I
53
57
55
1
0
0
56
52
54
110
109
109
1.02
1.02
- 87 -
C.3 Tables for Complex Signals with Three Microphones
Table C-5: Female Subject
Signal Environment Algorithm Wrong Missing Added Total Percent
Improvement
benchmark
N/A
3
0
2
5
Low Echo
Low Echo
Low Echo
Low Echo
Low Echo
Low Echo
Low Echo
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
17
13
15
18
17
18
17
0
0
0
0
0
0
0
11
6
7
9
8
9
8
28
19
22
27
25
27
25
39.13
26.09
4.35
13.04
4.35
13.04
Medium Echo
Medium Echo
Medium Echo
Medium Echo
Medium Echo
Medium Echo
Medium Echo
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
39
37
38
39
37
38
36
6
4
4
4
4
4
4
25
27
26
25
25
25
25
70
68
68
68
66
67
65
3.08
3.08
3.08
6.15
4.62
7.69
High Echo
High Echo
High Echo
High Echo
High Echo
High Echo
High Echo
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
78
72
65
59
61
63
73
1
1
0
0
0
0
1
35
38
52
58
53
53
38
114
111
117
117
114
116
112
2.75
-2.75
-2.75
0.00
-1.83
1.83
- 88 -
Table C-6: Male Subject
Signal Environment Algorithm Wrong Missing Added Total Percent
Improvement
benchmark
N/A
6
0
6
12
Low Echo
Low Echo
Low Echo
Low Echo
Low Echo
Low Echo
Low Echo
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
16
12
12
15
13
14
12
0
0
1
0
0
0
0
19
16
17
17
16
18
16
35
28
30
32
29
32
28
30.43
21.74
13.04
26.09
13.04
30.43
Medium Echo
Medium Echo
Medium Echo
Medium Echo
Medium Echo
Medium Echo
Medium Echo
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
41
38
41
43
43
44
41
0
1
0
0
0
0
1
35
33
37
34
36
35
34
76
72
78
77
79
79
76
6.25
-3.13
-1.56
-4.69
-4.69
0.00
High Echo
High Echo
High Echo
High Echo
High Echo
High Echo
High Echo
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
53
50
63
62
63
64
69
2
0
0
1
0
1
0
56
62
50
52
51
51
43
111
112
113
115
114
116
112
-1.01
-2.02
-4.04
-3.03
-5.05
-1.01
- 89 -
C.4 Different Training Environments
Table C-7: “Clean” training environment
Signal environment Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2I
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
3
39
6
8
16
10
9
17
13
15
18
17
18
17
0
20
0
2
1
0
4
0
0
0
0
0
0
0
2
9
2
3
7
5
0
11
6
7
9
8
9
8
5
68
8
13
24
15
13
28
19
22
27
25
27
25
95.24
87.30
47.37
57.89
39.13
26.09
4.35
13.04
4.35
13.04
Table C-8: “Clean, mobile” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
15
40
8
16
9
10
9
15
11
14
17
17
17
16
- 90 -
0
18
0
2
0
0
0
0
0
0
0
0
0
0
2
8
4
5
7
8
8
10
7
9
10
10
10
6
17
66
12
23
16
18
17
25
18
23
27
27
27
22
110.20
87.76
200.00
100.00
87.50
25.00
-25.00
-25.00
-25.00
37.50
Table C-9: “Simple Echo, 1 Microphone” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
11
59
23
30
22
24
23
38
26
25
27
27
30
26
0
20
1
8
1
2
2
1
2
3
1
2
3
3
4
12
5
6
10
11
11
20
9
8
16
15
14
10
15
91
29
44
33
37
36
59
37
36
44
44
47
39
81.58
61.84
-22.22
-16.67
50.00
52.27
34.09
34.09
27.27
45.45
Table C-10: “Simple echo, 1 Microphone, MPD2” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
9
50
16
23
17
16
16
29
18
21
25
22
25
21
- 91 -
0
15
0
1
0
0
0
1
0
0
0
0
0
0
5
15
7
9
15
10
11
17
7
11
14
12
14
10
14
80
23
33
32
26
27
47
25
32
39
34
39
31
86.36
71.21
33.33
27.78
66.67
45.45
24.24
39.39
24.24
48.48
Table C-11: “Simple Echo, 1 Microphone, C2I” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
8
45
13
22
19
19
19
29
22
33
30
29
29
30
0
15
0
1
1
0
0
2
0
0
1
0
1
1
3
22
4
8
11
7
7
12
5
23
19
21
20
17
11
82
17
31
31
26
26
43
27
56
50
50
50
48
91.55
71.83
25.00
25.00
50.00
-40.63
-21.88
-21.88
-21.88
-15.63
Table C-12: “Multiple Echo, 1 Microphone” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
17
43
19
23
16
15
17
22
21
20
20
19
21
19
- 92 -
0
24
0
5
0
0
0
2
1
1
1
1
1
1
6
15
7
10
8
9
10
9
9
7
8
8
8
8
23
82
26
38
24
24
27
33
31
28
29
28
30
28
94.92
74.58
0.00
-300.00
20.00
50.00
40.00
50.00
30.00
50.00
Table C-13: “Multiple Echo, 1 Microphone, MPD2” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
18
59
19
25
17
18
17
21
18
18
19
17
19
17
1
24
1
5
1
1
1
2
3
2
2
2
2
2
7
16
7
11
12
10
11
10
10
10
9
9
9
8
26
99
27
41
30
29
29
33
31
30
30
28
30
27
98.63
79.45
25.00
25.00
28.57
42.86
42.86
71.43
42.86
85.71
Table C-14: “Multiple Echo, 1 Microphone, C2I” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
18
60
19
24
19
17
17
21
20
16
18
16
18
18
- 93 -
1
24
1
5
2
1
1
2
2
2
2
2
2
2
7
19
7
10
10
10
10
10
9
10
9
9
9
9
26
103
27
39
31
28
28
33
31
28
29
27
29
29
98.70
83.12
60.00
60.00
28.57
71.43
57.14
85.71
57.14
57.14
Table C-15: “Multiple Echo, 3 Microphone” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
17
67
21
27
25
21
21
23
22
25
24
23
26
23
0
22
1
4
1
1
1
2
1
1
1
1
1
1
6
13
6
9
9
8
8
11
10
7
7
7
7
10
23
102
28
40
35
30
30
36
33
33
32
31
34
34
93.67
78.48
41.67
41.67
23.08
23.08
30.77
38.46
15.38
15.38
Table C-16: “Multiple Echo, 3 Microphone, C2Is” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
13
59
19
28
20
19
19
24
19
18
17
18
19
20
- 94 -
1
17
1
4
1
1
1
2
1
1
1
1
1
1
6
14
7
11
9
10
10
10
11
11
10
11
11
11
20
90
27
43
30
30
30
36
31
30
28
30
31
32
90.00
67.14
0.00
0.00
31.25
37.50
50.00
37.50
31.25
25.00
Table C-17: “Multiple Echo, 3 Microphone, DSA” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
14
56
9
22
16
14
16
25
17
15
16
16
16
16
1
24
1
5
1
1
1
2
1
1
1
1
1
1
6
16
6
11
10
9
9
11
11
11
12
12
12
12
21
96
16
38
27
24
26
38
29
27
29
29
29
29
106.67
77.33
50.00
16.67
52.94
64.71
52.94
52.94
52.94
52.94
Table C-18: “Multiple Echo, 3 Microphone, MPDs” training environment
Signal
environment
Processing
Algorithm
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
19
54
20
26
23
23
23
22
26
22
26
22
22
23
- 95 -
1
17
1
5
1
1
1
2
1
1
1
1
1
1
7
16
11
12
11
11
11
10
9
8
10
8
8
9
27
87
32
43
35
35
35
34
36
31
37
31
31
33
91.67
73.33
0.00
0.00
-28.57
42.86
-42.86
42.86
42.86
14.29
Table C-19: “Multiple Echo, 3 Microphone, MPDs2” training environment
Signal
environment
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
Processing
Algorithm
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
15
1
9
25
56
24
16
96
25
1
9
35
85.92
30
4
11
45
71.83
24
1
11
36
23
1
12
36
0.00
24
1
13
38
-18.18
28
1
11
40
26
2
10
38
13.33
27
2
12
41
-6.67
27
1
13
41
-6.67
29
1
12
42
-13.33
28
1
11
40
0.00
29
1
13
43
-20.00
Table C-20: “Multiple Echo, 3 Microphone, SCP” training environment
Signal
environment
clean
simple echo, 1 mic
simple echo, 1 mic
simple echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 1 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
mult. echo, 3 mic
Processing
Algorithm
N/A
none
MPD
C2I
none
MPD
C2Is
none
DSA
SCP
MPDS
MPDS2
MPDS3
C2Is
Wrong Added Missing Total Percent
Improvement
18
1
6
25
59
18
14
91
22
1
11
34
86.36
34
1
12
47
66.67
24
1
13
38
23
1
15
39
-7.69
23
1
14
38
0.00
29
3
8
40
27
2
11
40
0.00
27
2
14
43
-20.00
27
2
13
42
-13.33
26
2
13
41
-6.67
27
2
13
42
-13.33
27
2
12
41
-6.67
- 96 -
References
[1] T. G. Stockham, T. M. Cannon, and R. B. Ingebretsen, “Blind Deconvolution through
Digital Signal Processing”, Proceedings of the IEEE, v 63, n 4, pp. 678-692, Apr. 1975.
[2] A. P. Petropulu and S. Subramaniam, “Cepstrum Based Deconvolution for Speech
Dereverberation”, Proceedings - ICASSP, IEEE International Conference on Acoustics,
Speech and Signal Processing, pp. I-9-12, Apr. 1994.
[3] S. Affes and Y. Grenier, “A Signal Subspace Tracking Algorithm for Microphone
Array Processing of Speech”, IEEE Transactions on Speech and Audio Processing, v 5, n
5, pp. 425-437, Sept. 1997.
[4] M. A. Casey, W. G. Gardner, and S. Basu, “Vision Steered Beam-forming and
Transaural Rendering for the Artificial Life Interactive Environment (ALIVE)”,
Proceedings of the 99th Convention of the Audio Engineering Society (AES), 1995.
[5] P. Maes, T. Darrell, B. Blumberg, and A. Pentland, “The ALIVE System: Full-body
Interaction with Autonomous Agents”, Proceedings of the Computer Animation
Conference, Switzerland, IEEE Press, 1995.
[6] J. Flanagan, “Autodirective Sound Capture: Towards Smarter Conference Rooms”,
IEEE Intelligent Systems, March/April 1999.
[7] A. Westner, “Object-Based Audio Capture”, Master’s thesis, Media Arts and Sciences
Program at the Massachusetts Institute of Technology, Feb. 1999.
[8] P. Clarkson, Optimal and Adaptive Signal Processing, CRC Press, Inc., 1993.
[9] Q. Liu, B. Champagne and P. Kabal, “Room Speech Dereverberation via Minimum
Phase and All-pass Component Processing of Multi-microphone Signals”, IEEE Pacific
RIM Conference on Communications, Computers, and Signal Processing – Proceedings,
pp. 571-574, May 1995.
[10] A. Oppenheim and R. Schafer, Discrete-Time Signal Processing, Prentice Hall,
1989.
[11] J. Allen and D. Berkeley, “Image Method for Efficiently Simulating Small-room
Acoustics”, Journal of the Acoustical Society of America, v 65, n 4, pp. 943-950, April
1979.
- 97 -
[12] B. P. Bogert, M. J. R. Healy and J. W. Tukey, “The quefrency alanysis of time
series echoes: cepstrum, pseudo-autocovariance, cross-cepstrum and saphe-cracking”,
Proceedings of the Symposium on Time Series Analysis, Chapter 15, 209-243, Wiley,
New York.
- 98 -