A Visually-Guided Microphone Array for Robert Eiichi Irie

advertisement
A Visually-Guided Microphone Array for
Automatic Speech Transcription
by
Robert Eiichi Irie
S.B., Engineering Science
Harvard University (1993)
S.M., Electrical Engineering and Computer Science
Massachusetts Institute of Technology (1995)
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
DOCTOR of PHILOSOPHY
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2000
@ 2000 Massachusetts Institute of Technology
All rights reserved
Signature of Author
Department of Electrical Engineering and Computer Science
October 2, 2000
Certified by
Rodney A. Brooks
Fujitsu Professor-of Computer Science and Engineering
Thsjs Superyisor
Accepted by
Arthur C. Smith
Chairman, Departmental Committee on Graduate Studies
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
APR 2 4 2001
LIBRARIES
BARKER
2
A Visually-Guided Microphone Array for Speech Transcription
by
Robert E. Irie
Submitted to the Department of Electrical Engineering and Computer Science on
October 6, 2000, in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
ABSTRACT
An integrated, modular real-time microphone array system has been
implemented to detect, track and extract speech from a person in a
realistic office environment. Multimodal integration, whereby audio
and visual information are used together to detect and track the
speaker, is examined to determine comparative advantages over
unimodal processing. An extensive quantitative comparison is also
performed on a number of system variables (linear/compound arrays,
interpolation, audio/visual tracking, etc) to determine the system
configuration that represents the best compromise between
performance, robustness, and complexity. Given a fixed number of
microphone elements the compound array, with a broader frequency
response but a coarser spatial resolution, has been determined to have
a slight performance advantage in the currently implemented system
over the linear array.
Thesis Supervisor: Rodney A. Brooks
Title: Fujitsu Professor of Computer Science and Engineering
3
4
Acknowledgments
I would like to thank my advisor, Prof. Rodney Brooks, for giving me the freedom
to pursue my own avenues of research, and with being patient when some paths turned out
to be dead ends. I have learned greatly to think independently and to self-motivate myself
under his tutelage. I would also like to thank the rest of my committee, who have helped
with technical as well as overall advice.
Being part of the Cog Group was always an enriching and exciting experience. I
have had numerous thought-provoking conversations with everyone, but especially with
Charlie Kemp, Cynthia Ferrell, and Matthew Marjanovic. I must also thank Brian, Juan,
Naoki, Junji, Takanori, Kazuyoshi, and all other members, past and present.
Finally, I would like to thank Shiho Kobayashi, the warmest and most stimulating
person that I have had the good ortune to meet. She has been very supportive during times
of much stress and confusion and I owe the completion of this dissertation to her.
5
6
CHAPTER 1 IN TR O D U CTIO N .................................................................................................. 11
CHAPTER 2 BA CK G R OUND ..................................................................................................... 13
2.1 M ICROPHONE A RRAYS ............................................................................................................... 13
2. 1.1 Source Location ....................................................................................................................... 13
2.1.2 Sound A cquisition ................................................................................................................... 14
2.2 M ULTIM ODAL INTEGRATION ...................................................................................................... 14
2.3 PROBLEM D EFINITION ................................................................................................................ 16
CHAPTER 3 D ESIG N ...................................................................................................................19
3.1 M ICROPHONE A RRAY CONFIGURATION ..................................................................................... 20
3. 1.1 A rray Response ....................................................................................................................... 20
3.1.2 Spatial A liasing ....................................................................................................................... 22
3.1.3 Beam width Variations ............................................................................................................. 22
3.1.4 Sensor Placem ent and Beam Patterns ..................................................................................... 23
3.2 BEAM GUIDING ...........................................................................................................................28
3.2.1 Detection .................................................................................................................................29
3.2.2 Tracking ..................................................................................................................................29
CH APTER 4 IMPLEM ENTATIO N ............................................................................................ 33
4.1 GENERAL SYSTEM OVERVIEW ...................................................................................................33
4.1.1 Hardware Com ponents ............................................................................................................ 34
4.1.2 Software Com ponents ............................................................................................................. 34
4.1.3 System Architecture ................................................................................................................ 35
4.2 AUDIO PROCESSING .................................................................................................................... 36
4.2.1 Beam form er .............................................................................................................................36
4.2.2 A udio Localizer .......................................................................................................................41
4.3 VISUAL PROCESSING .................................................................................................................. 42
4.4 TRACKER .................................................................................................................................... 45
CHAPTER 5 PR O CEDU RE .........................................................................................................49
5.1 ExPERIM ENTAL V ARIABLES ....................................................................................................... 49
5.1.1 System Configuration .............................................................................................................. 49
7
5.1.2 Trial Condition ........................................................................................................................ 50
5.2 M EASUREMENTS ........................................................................................................................ 50
5.2.1 SN R ......................................................................................................................................... 51
5.2.2 Position Estim ates ................................................................................................................... 51
5.2.3 Word Error Rate ......................................................................................................................51
5.2.4 M ethod and Controls ...............................................................................................................52
5.3 EXPERIMENTAL SETUP .............................................................................................................. 52
CH APTER 6 RESULTS ................................................................................................................ 55
6.1 INTRODUCTION ........................................................................................................................... 55
6.2 STATIC CONDITION ..................................................................................................................... 56
6.2.1 Signal-to-Noise Ratio .............................................................................................................. 56
6.2.2 Localization Output ................................................................................................................. 57
6.2.3 Tracker Output ........................................................................................................................ 58
6.2.4 WER Data ............................................................................................................................... 60
6.2.5 Sum m ary ................................................................................................................................. 72
6.3 DYNAMIC CONDITION ................................................................................................................ 73
6.3.1 Tracker Output ........................................................................................................................ 73
6.3.2 WER Data ............................................................................................................................... 75
6.3.3 Sum m ary .................................................................................................................................78
6.4 OVERALL SUMMARY .................................................................................................................. 78
6.5 ADDITIONAL/FUTURE W ORK ..................................................................................................... 79
CH APTER 7 CONCLUSION ....................................................................................................... 81
APPENDIX A SPEECH SPECTR O GRA M S .............................................................................. 83
A . 1 CONTROLS ................................................................................................................................. 83
A .2 SINGLE ARRAY CONFIGURATIONS ............................................................................................ 84
A .3 M ULTIPLE ARRAY CONFIGURATIONS ....................................................................................... 85
APPENDIX B SPEECH SETS AND SAMPLE RESULTS ....................................................... 87
B. I ACTUAL TEXT ............................................................................................................................ 87
B.1.1 Trained Set .............................................................................................................................. 87
B. 1.2 Untrained Set .......................................................................................................................... 88
8
B.2 HEADSET (CLOSE) M ICROPHONE DATA SET ............................................................................. 89
B.2.1 Trained Set Results ................................................................................................................. 89
B.2.2 U ntrained Set Results ............................................................................................................. 90
B.3 SINGLE ELEMENT ....................................................................................................................... 91
B.3.1 Trained Set .............................................................................................................................. 91
B.3.2 U ntrained Set .......................................................................................................................... 92
BA LINEAR A RRAY, ON-BEAm ANGLE=O ....................................................................................... 93
B.4.1 Trained Set .............................................................................................................................. 93
B.4.2 Untrained Set .......................................................................................................................... 94
B.5 M ULTIARRAY,, ON-BEAm ANGLE=O .......................................................................................... 95
B.5.1 Trained Set .............................................................................................................................. 95
B.5.2 U ntrained Set .......................................................................................................................... 96
APPENDIX C 16 ELEMENT ARRAY DESIGN ........................................................................97
REFERE N CES ............................................................................................................................... 99
9
10
Chapter 1 Introduction
As computation becomes increasingly more powerful and less expensive, there have been
efforts to make the workspace environment more intelligent and the interaction between humans
and computer systems more natural [1]. One of the most natural means of communication is
speech, and a common task is to transcribe a person's dictated speech. Speech recognition
technology has progressed sufficiently that it is now possible to automate transcription of dictation
with a reasonable degree of accuracy using commercial solutions [2].
One particular scenario under consideration is an intelligent examination room, where it is
desirable for a physician to make an oral examination report of a patient. Current systems require
the physician to be seated by the transcription device (either a computer or a phone) or to carry a
wireless microphone that is cumbersome and requires periodic maintenance. One possible solution
is to embed in the physical room an intelligent system that is able to track and capture the
physician's speech and send it to the speech recognition software for transcription. An advantage of
this solution is that it requires no extra or special action to be performed by the physician.
The disadvantage is that there is added complexity and sources of error in the speech
recognition process. Allowing the speaker to roam around the office freely forces the system to
handle background acoustic noise and to take into account his/her motion; current speech
recognition technology requires the use of a microphone that is placed close to the speaker to avoid
such issues. To counteract noise sources and to localize sound capture over a specified spatial
region, arrays of microphone elements are often used. By appropriately delaying and summing the
output of multiple microphones, signals coming from a desired direction are coherently combined
II
and have improved signal to noise ratio (SNR) while signals from other directions are incoherently
combined and attenuated. One can imagine a beam of spatial selectivity that can be digitally
formed and steered by adjusting delays.
The final requirement for a microphone array is the automatic steering of the beam. Most
arrays use audio-only techniques, which are either computationally expensive or prone to errors
induced by acoustic noise. We introduce an additional modality, visual information, to guide the
beam to the desired location. Our hypothesis is that a multimodal sensor system will be able to
track people in a noisy, realistic environment and transcribe their speech with better performance
and robustness than a unimodal system. We also seek to determine if such an integration of
modalities will allow simpler, less computationally intensive components to be used in real time.
The organization of the rest of this thesis is as follows: Chapter 2 provides background
information including past work on microphone arrays and multimodal integration. It includes a
more thorough formulation of the problem and the expected contributions of this project.Chapter
3 outlines major design considerations and the proposed solutions. Chapter 4 discusses actual
implementation details and issues and describes the currently implemented system. Chapter 5
outlines the experimental procedures used to test the performance of the system based on several
well-defined controls. Chapter 6 presents the results of the experiments, as well as an analysis of
the relative merits of various system parameters. Finally, Chapter 7 concludes with a discussion of
the impact of this project and extensions for future work.
12
Chapter 2 Background
2.1 Microphone Arrays
Array signal processing is a well-developed field, and much of the theoretical foundation
of microphone arrays and target trackers is based on narrowband radar signal processing [3].
Electronically steered microphone array systems have been extensively developed since the early
1980s. They range from simple linear arrays with tens of sensors to complex two and three
dimensional systems with hundreds of elements [4]. Regardless of size and complexity, all
sound/speech capturing systems need to perform two basic functions, locating the sound source of
interest and then acquiring the actual sound signal [5].
2.1.1 Source Location
Most sound source location methods fall into one of two categories, time delay of arrival
(TDOA) estimation and power scanning. The former determines source direction by estimating the
time delay of signals arriving at two or more elements; microphones located closer to the sound
source will receive the signal before those farther away. TDOA estimation provides accurate
estimates of source location, but is sensitive to reflections and multiple sound sources. In this
project a simple TDOA estimator will be supplemented with a visual localization system to provide
robust source location estimates. Power scanning usually involves forming multiple beams that are
spatially fixed; the beam with the highest energy output is then selected [6]. While power scanning
is conceptually simple and easy to implement, it requires huge amounts of computation for all but
13
the coarsest of spatial resolution, as every possible spatial location of interest must be represented
by its own beam.
2.1.2 Sound Acquisition
The classical method for sound acquisition is the delay-and-sum beamformer, and will be
discussed in depth in Section 3.1. Numerous modifications of this basic method have been
proposed and include matched filtering, reflection subtraction, and adaptive filtering. All these
methods attempt to improve performance by more actively handling various acoustic noise sources
such as reverberation and interfering signals. They rely on noise modeling and require simplified
assumptions of the noise source and acoustic enclosure (i.e., the room) [5].
2.2 Multimodal Integration
Visually guided beamforming has been examined before; Bub et. al. use a linear nonadaptive array of 15 elements and a detection based source location scheme (Refer to Section
3.2.2) [7]. Using a gating mechanism, either visual or sound localization information was used to
guide the beam, but not both simultaneously. It was shown that recognition rates for a single
speaker in background and competing noise were significantly higher for the visual localization
case.
Vision and audition are complementary modalities that together provide a richer sensory
space than either alone. Fundamental differences in respective signal source and transmission
characteristics between the two modalities account for their complementary nature. In audition, the
information source (the audible object or event) and signal (sound wave) source are often one and
the same, whereas in vision the signal (light) source is usually separate from the information source
(the visible object). Furthermore, most visible objects of interest are spatially localized and are
relatively static, while perceived sounds are usually the result of transient changes and are thus
more dynamic in nature and require more care in temporal processing [8].
Noise sources in one domain can be more easily handled or filtered in the other domain.
For example, while audio-based detection routines are sensitive to sound reflections, visual
routines are unaffected. Also, advantages in one modality can overcome deficiencies in others.
Visual localization can be precise, but is limited by camera optics to the field of view. Sound
localization in general provides much coarser spatial resolution, but is useful in a larger spatial
region.
14
Research in machine vision and audition have progressed enough separately that the
integration of the two modalities are now being examined, though most such integration still
involves limited, task-specific applications [9]. Most integration work being performed has been in
the context of human-computer interaction (HCI), which seeks to provide more natural interfaces
to computers by emulating basic human communication and interaction, including speech, hand
gestures, facial movement, etc. In particular, substantial work has been done in using visual cues to
improve automatic speech recognition. Image sequences of the region around the mouth of a
speaker are analyzed, with size and shape parameters of the oral cavity extracted to help
disambiguate similar sounding phonemes. The integration of audio and visual information can
occur at a high level, in which recognition is performed independently in both domains and then
compared [10], or at a lower level, with a combined audio-visual feature vector feeding, for
example, a neural network [11]. Performance of integrated recognizers has regularly been greater
than that of unimodal ones. Previous work involving multimodal sensory integration at the Al Lab
was performed on the humanoid robot Cog and prototyping head platforms. A multimodal selfcalibrating azimuthal localization and orientation system [12] and a multimodal event (hand
clapping) detector were implemented [13].
All such work seek to establish some sort of biological relevance; there is ample
neurophysiological evidence that multimodal integration occurs at many different levels in animals,
including birds, reptiles and mammals. The integration can happen at the neuronal level, where a
single neuron can be sensitive to both visual and audio stimuli, orat a more abstract level of spatial
and motor maps [14]. The area of the brain best understood in terms of multimodal representation
and interaction is the optic tectum (superior colliculus in mammals), which is a layered midbrain
nucleus involved in stimulus localization, gaze orientation, and attention [15].
Localization is a key problem that must be solved in many animals for survival. It comes as
no surprise therefore that the problems such animals face are the same ones that had to be solved
for this project, which relies on accurate localization for good tracking performance. Sound
localization is a much more difficult problem than visual localization, since acoustic stimuli are not
spatially mapped to the sensors used in the former (microphones or ears); thus some form of
computation is necessary for both engineered and biological systems so that the localization cues
from the sensors (e.g. time and intensity differences) can be extracted from the set of onedimensional acoustic signals. On the other hand, visual localization is much easier since the sensors
used (cameras or eyes) are already spatially organized (CCD array or retina) in such a way that
15
visual stimuli are mapped to sensor space; the image of the stimulus appears on a corresponding
location in the array or the retina. This difference in representation (computation vs. sensor space
spatial organization) requires that there be some form of normalization of coordinate frames when
integrating both types of localization information. In many animals one modality, usually vision in
mammals, dominates over the other (usually audition) in actually determining stimulus location
[14]. As will be reported later in this thesis, in the currently implemented system visual localization
also dominates.
Of course, most engineering approaches use what we know about the neurophysiology of
multimodal integration only as an inspiration. For example, the representations of the visual and
auditory space are not merely superimposed in the superior colliculus; they are integrated in a
nonlinear manner by bimodal neurons to yield a unified representation of stimulus locations[16];
most engineering implementations combine audio-visual information linearly. Also in biological
systems, the integration occurs in multiple locations at different levels of abstraction. Our system
integrates at a much higher level, and at only one point.
2.3 Problem Definition
Our primary goal was to design and implement a sound capture system that is capable of
extracting the speech of a single person of interest in a normal examination or office room
environment. To be useful, the system must track, beamform, and perform speech recognition in a
timely manner. One of the key design goals was therefore real-time operation.
For any system designed to run in real-time, various compromises must be made in terms
of computational cost, complexity, robustness, and performance. To be able to perform such
optimizations, a modular system of easily modifiable and interchangeable components is necessary.
This allows different types of algorithms to be tested. An added advantage is that the modules may
be distributed across different processors.
The following three components will be examined in this thesis:
*
Source Location Detection-The benefits of multimodality when applied to guiding the
beam has been examined. The hypothesis is that the integration of visual and audio
detection routines will be robust to various error sources (acoustic reverberations, cluttered
visual environment). Specifically, visual localization combined with sound localization can
be used to determine candidates for tracking. Visual localization is performed using a
16
combination of motion analysis and color matching. Sound localization is performed using
a TDOA estimator.
"
Microphone Array Configuration-The simplest configuration for a collection of
microphone elements is a linear array, where all microphones are spaced equally apart. As
will be seen, this has a less than optimal frequency and spatial response. A better
configuration is a compound or nested array that consists of several linear subarrays. See
Section 3.1.4.
*
Tracking-Simple detection methods to guide the beam may be insufficient for any
realistic operating environment. A simple tracking mechanism that follows a single speaker
around the room and takes into account source motion has been implemented and tested.
See Section 3.2.2.
A totally general-purpose person identification, tracking, and speech capture system is
beyond the scope of this project. This thesis presents a solution for focusing on a particular person
and tracking and extracting only his/her speech. In limiting the scope of the solution, some
assumptions must be made concerning the nature of the interaction, and will be discussed in
Section 3.2.
17
18
Chapter 3 Design
This chapter describes in more detail the three system components listed in the previous
chapter. The high-level design issues and decisions as well as some theoretical grounding are
discussed; for actual details in the implementation of the system components, see Chapter 4.
The response of the simple delay-and-sum beamformer, shown in Figure 1, is first derived,
and the related design issues discussed. In the analysis that follows, a plane wave approximation of
the sound signal is assumed.
Mic 1
Sca+n
Mic 2
Scaling
Dea
-+
Delay
Average--
Mic N
Scaling
~+
Delay
Figure 1: Simple delay-and-sum beamformer block diagram. The output
the beamformer is normalized by dividing the output of the summer by
the number of microphone elements.
19
3.1 Microphone Array Configuration
Incoming plane wave
Figure 2: Geometrical analysis of incoming acoustic plane wave.
3.1.1 Array Response
From Figure 2, we see that the interelement delay T, assuming equally spaced microphone
elements, is given by the equation
d sin 9
C
(1)
where c is the speed of sound and 9 is the incident angle of the incoming plane wave. Using
complex exponentials to denote time delays, and normalizing the amplitude of the incoming signal
to one, the total response of an N element array (where N is even) can be expressed as
N
2
-joid sin 0
H(c,9)= Za,,e
,
N
n=-2
(2)
where co is the temporal frequency variable associated with the Fourier series and 9 is as above. It
is clear that with the appropriate choice of coefficients an, the time delays associated with the
incoming wavefront can be taken into account. Substituting
20
jwnd sin
0
an = ane
(3)
into Equation (2) results in a generalized, steerable microphone array response:
N
-javd(sin 6-sin 60)
2
H(w,0,90 )= Laine
C
N
n=--
2
(4)
The parameter 0o is the beamforming, or steering, angle. By modifying this variable, the
angle at which the microphone array has the maximal response (the 'beam') to incident sound
waves can be changed. When the steering angle equals the incident angle of the incoming
wavefront, the complex exponential term becomes unity and drops out. The scaling and delay
jwnd sin 0o
components in Figure 1 correspond to a,, and e
c
, respectively.
It is useful to note that Equation (4) is similar in form to the discrete-time Fourier
Transform, given by the equation
H(,)
= ja[n]e-j"',
n
(5)
where c4, the temporal frequency variable, should not be confused with C, the frequency of the
incoming wave. The analogy between the array response and the DTFT is useful since it means
microphone array design is equivalent to FIR filter design. Letting k = w/c, we have the mapping
co, -+ kd(sin0-sin00 ). To obtain the array response as a function of 0, we use the
transformation 0 = sin- (m'
kd
+ sin 0).
Figure 3 shows the theoretical response of an eight element linear array, center and offcenter steered (00 = 0' and 00 = 300, respectively) with uniformly weighted (unity) array
coefficients a, and equally spaced microphones (d=.06m). The microphone array would be located
on the 90-270' axis, with the normal corresponding to 00. The response is symmetrical about the
array axis, but in this application the array would be mounted on a wall and the other half plane is
not of interest. Note that the response is a function of both signal frequency and interelement
21
distance. This raises two issues in the design and implementation of the array, spatial aliasing and
variations in the beamwidth.
3.1.2 Spatial Aliasing
From basic signal processing theory, we know that sampling a continuous domain is
subject to aliasing unless certain constraints on the sampling rate are met. Aliasing refers to the
phenomenon whereby unwanted frequencies are mapped into the frequency band of interest due to
improper sampling [17]. In dealing with discrete element digital microphone arrays, in additional to
temporal aliasing, we must be aware of aliasing that may occur due to the spatial sampling of the
waveform using a discrete number of microphones.
To avoid spatial aliasing, we use the metaphor of temporal sampling to determine the
constraints in spatial sampling, in this case involving the interelement spacing d. In the temporal
case, to avoid aliasing lw,j <
=
Substituting k =-
c
=
-
2
7r
. Equivalently, to avoid spatial aliasing, kd(sin9 - sinG0 ) <; .
-, we get the inequality d <
A
2.
,
mi
2
where Amin is the wavelength of the
highest frequency component of the incident plane wave.
We can thus conclude that the higher the frequency (and thus the smaller the wavelength)
we are interested in capturing, the smaller the interelement spacing is necessary.
3.1.3 Beamwidth Variations
As mentioned in Section 3.1.1, the array response is also a function of frequency of the
waveform. The direct consequence of this is that the beamwidth, defined here to be the angular
separation of the nulls bounding the main lobe of the beam, is dependent on the frequency of the
incoming signal. Setting Equation (2) to zero, with ^,, again uniformly unity, and solving for 0
corresponding to the main lobe nulls, we obtain an expression for the beamwidth,
BW = 2 sin (
2nc
Nwd
)
(6)
The beamwidth therefore increases for lower frequencies as well as smaller interelement
spacing; both lead to lower spatial resolution. In other terms, broadband signals such as speech that
are not exactly on-axis will experience frequency dependent filtering by a single linear
beamformer.
22
3.1.4 Sensor Placement and Beam Patterns
From the above discussion, it is clear that some care must be taken in the design of the
array in terms of microphone placement. Spatial aliasing and beamwidth variation considerations
require the opposing design goals of smaller interelement distance d for capturing higher
frequencies without aliasing and larger d for smaller beamwidth or higher spatial resolution for low
signal frequencies. A compromise can be made with a linearly spaced array capable of moderate
frequency bandwidth and spatial beamwidth. A better solution is a compound array composed of
subarrays, each with different microphone spacing and specifically designed for a particular
frequency range; beamwidth variation is lessened across a broad frequency range [18].
A linearly spaced array (referred to as the lineararray) and a compound array (multiarray)
composed of three subarrays have been implemented (see Section 4.2) and compared. Figure 3
shows the theoretical beam patterns for an eight element linear array with an inter-element spacing
of 6cm. Figure 4 shows the configuration of an eight element multiarray, and Table 1 lists the
relevant characteristics of the subarrays. Figure 5-Figure 7 are the theoretical responses for each
subarray. Note that as expected, low frequency spatial resolution is poor for the small spacing, high
frequency subarray and there is substantial high frequency aliasing for the large spacing, low
frequency subarray. As a control for relative performance measurement between the two types of
arrays, the number of elements for each are held constant at eight.
A compound microphone array requires slightly different processing than the simple linear
delay-and-sum beamformer. Figure 8 is the modified block diagram. The signal redirector reuses
signals from some of the microphone elements and feeds them to multiple sub-arrays. Bandpass
filters isolate the desired frequency range for each subarray. Sixth-order elliptic (1IR) filters were
chosen (see Figure 9 for frequency response curves) to provide a combination of sharp edge-cutoff
characteristics and low computational requirements [6]. Interpolators allow non-integral sample
delay shifts for more possible angular positions of the beam (See Section 4.2.1.1).
23
1
.
1
.
1
....18
..
1 8 ....
2
0
Q-o0 1
[8,0.06,800]
1
1
-0
2
0
2
-0 2
.0O
I0
18
18 -...
....
0
0 2
2
0
0
[8,0.06,3200]
0,0
18
0
2
0
0
18
. .- ...
0 2
1
-
2
18 ..
[8,0.06,1600]
0
-
1
1
0
-0
2
0
[8,0.06,400]
2
.
0 2
2
0
1
[8,0.06,3200]
-. 18 .......
_
0 2
2
[8,0.06,1600]
[8,0.06,800]
[8,0.06,400]
0
0
2
Figure 3: Eight element linear array beam pattern. The array is located on the 90270 degree axis, with the normal corresponding to 0 degrees. The lefthalf plane is not relevant. The array consists of 8 elements with .06m
interelement spacing. Frequencies are, from left to right, 400Hz, 800Hz,
1.6Hz, 3.2kHz. The top row shows a beam centered at 0 degrees, with
the bottom row at 30 degrees.
Sub-array 1
dl
0
d3
-4
0
0d2
Sub-array 2
7
Sub-array 3
Figure 4: Microphone placement for an 8 element compound array. dl, d2, d3
corresponds to the interelement spacing of sub-array 1, 2, and 3,
respectively.
24
0
Subarray
Interelement spacing
Frequency Range (Max Frequency)
Al
dl=.02m
High 8.625 kHz(-8 kHz)
A2
d2=.06m
Mid 2.875 kHz(-3 kHz)
A3
d3=.18m
Low 958 Hz (-1kHz)
Table 1: Eight element compound array configuration (See Figure 4). Each subarray has a corresponding frequency response at different ranges.
Indicated values are the highest frequency each response can handle
without aliasing.
1
%
18
[4,0.02,20001
1
[4,0.02,1000]
1
[4,0.02,500]
[4,0.02,8000]
1
1
I
18
......
......
20
2
0
2
2
[4,0.02,1000]
[4,0.02,500]
2
2
[4,0.02 2000]
1*
1
'A
18
V
18E
2
0
0
2
0
[4,0.02,8000]
1
18
2
2
2
18
2
2
2
Figure 5: High frequency subarray beam pattern. The array is located on the 90270 degree axis, with the normal corresponding to 0 degrees. The lefthalf plane is not relevant. The subarray consists of 4 elements with .02m
interelement spacing. Frequencies are, from left to right, 500Hz, lkHz,
2kHz, 8kHz. The top row shows a beam centered at 0 degrees, with the
bottom row at 30 degrees.
25
1
1
0
18
18
0
2
-.
0
0
. -. --
0 2
2
0
1
1
18
0 2
-
1
0
1
0 0 2
2
0
2
0
0
2
1
.
18 --2
2
00
- --- 18
0
S
1
0 2
2
0
0
18 --
..-----0
0 2
2
0
1
0 1
1 --..
0 2
0
[4,0.06,8000]
1
0
I1
0
-.~-2
[4,0.06,2000]
I
[4,0.06,1000]
[4,0.06,500]
1
0
1
2
2
-
[4,0.06,8000],
[4,0.06,2000]
[4,0.06,1000]
1
[4,0.06,500]
0
2
0
Figure 6: Mid frequency subarray [4 elements, .06m spacing] beam pattern.
The top row shows a beam centered at 0 degrees, with the bottom row
at 30 degrees.
26
-Qum -%qd-
[4,0.18,2000]
[4,0.18,1000]
[4,0.18,5001
1 . .0
18
22
1
1
0
0
18
0
[4,0.18 1000]
1
2
0 2
0
0
2
[4,0.18,8000]
[4,0.18,2000]1
0
0
2
0
Al
Filter
Simple
Lj
Beamformer
Bandpass
Filter
Simple
Beamformer
Bandpass
Filter
Simple
Beamformer
Bandpass
Filter
A2
Mic 2
Mic 8
Interpolation Lj
-
Signal
Redirector
-
Interpolation
Filter
A3
-kInterpolation
Filter
0
2
Figure 7: Low frequency subarray [4 elements, .1 8m spacing] beam pattern.
Mic 1
0
181
18
2
__0
-
10
0 2
2
0
~
1
18--
1
0
2
2
[4,0.18 500]
11.
2~
0 2
2
2
[4,0.18,8000]
1
1
0
J
Figure 8: Block diagram of compound array. The eight channels in the array are
directed into three subarrays, each with four channels. See Figure 4 for
channel assignments.
27
0
10
A1:highpass
-
A2:bandpass
A3:Iowpass
10
10
I
10
10-'
0
2000
4000
6000
8000
10000
12000
Figure 9: Frequency response for bandpass elliptic filters for three subarrays.
3.2 Beam Guiding
With the beam of the microphone array properly formed, it must be guided to the proper
angular location, in this case a speaking person. This is a very complex task, involving detecting
people, selecting a single person from whom to extract speech, and tracking that person as he/she
moves about the room. A complete and generalized solution to the target detection, identification,
and tracking problem is beyond the scope of this thesis, and is indeed an entire research focus in
itself [19]. Fortunately the constrained nature of the particular problem, and design decisions of the
array itself, allow various simplifications. As part of an intelligent examination or office room, the
system will be situated in a relatively small room with few people, as opposed to a large conference
hall or a highly occupied work area. As will be discussed in Section 4.2, the possible steering
angles will be limited to discrete positions. These factors simplify the detection task, since there
will be fewer candidates (usually only one or two) to process. The tracking task will be simpler and
more robust, as discrete angular positions will allow the tracker to be less sensitive to slight errors
in target location.
28
3.2.1 Detection
As mentioned in the background chapter, in most previous work with microphone arrays,
target detection and tracking are handled very simply, usually using only sound localization.
TDOA estimation utilizes the spatial separation of multiple microphones in much the same way as
beamforming in microphone arrays. The signals from two or more microphones are compared and
the interelement time delay r (see Section 3.1) is estimated, which is equivalent to the sound
source's direction. The comparison is usually in the form of a generalized cross-correlation
function, in the form
rgc, = arg max R,
(r),
where x, and
x2
are the signals from two
00
microphones and R,(r)
=
x1 (r + Ox 2 (t)dt [20]. Refinements have been proposed to the
objective function to minimize effects of reverberation [21].
One of the simplest yet still effective visual methods for detection is motion analysis using
thresholded image subtraction. The current captured image is subtracted pixel-wise from a
background image. Pixels that have changed intensities, usually corresponding to objects that have
moved, can then be detected; these pixels are then thresholded to a value of either 0 or 1 and the
result will be referred to as motion pixels. The underlying assumption for this process is that the
background does not change significantly over time and that the objects (people) are sufficiently
distinct from the background. The background image is composed of a running average of image
frames in the form
I',[n]= ae[n -1] a + I,,[n]-(1 - a)
(7)
with a (range 0-1.0) determining how much weight the current image is assigned [19]. Note that
with a high a, this method can detect stationary objects that have recently moved, for example a
person who has entered but is currently sitting or standing still.
3.2.2 Tracking
In a completely dynamic environment, there may be multiple objects or people
simultaneously speaking and moving. In the more constrained environment of an office or
29
examination room, usually one person is speaking at a time'. Depending on the intended
application, two possible modes of beam steering are possible. The beam may be guided to each
speaker location in turn using the above detection methods, with no explicit maintenance of
detected object state information. In this case sound localization information coupled with a sound
energy detector (to determine when there is actually speech being spoken) is most useful. A
detection-based method using only sound localization is the way beam guiding is handled in most
microphone array projects.
As outlined in the problem statement, one particular person must be tracked as he/she
moves, regardless of background noise and even in the absence of speech, when there are no
acoustic cues of motion. The task of initially identifying this specific person will not be explored in
this thesis; speaker recognition [22], identifiable markers [1], and appearance-based searching [23]
are all possible options. Currently, the first object detected after a long period of visual and audio
inactivity will be tracked and will be referred to as the TO (tracked object).
One of the major issues in visual tracking is the correspondence problem; detected objects
in one image must be matched to objects in a successive image [24]. Color histogram matching is
often used in real-time trackers to find this correspondence [19]. Once the TO and its visual
bounding box is determined, a color histogram of the image pixel values is constructed and used as
a match template. In the next time frame, bounding boxes and histograms for each detected object
are constructed and the intersection with the template computed. The normalized intersection of the
test
object
histogram
Y
N
j=1
the
match
template
histogram
H" is
defined
to
be
N
min(H
I(H',H')=
and
HZ , where N is the number of bins in the histogram. The
, Hm)
j=1
detected object that has the highest normalized intersection value, above a threshold, will be
considered the current tracked object, and the template histogram will be updated. The benefits of
color histogram matching over more complex model based techniques include relative insensitivity
to changes in object scale, deformation, and occlusion [25]. It is also very computationally
efficient, and can be used in real-time tracking.
Additionally, a simple prediction-correction algorithm involving source motion estimates
can be used to narrow the search for a match [26]. The tracker estimates the velocity of the TO and
uses the estimate to predict the probable location in the next time frame. Detected objects in the
1 Realistically, to handle simultaneous speech, a larger element microphone array with
greater spatial resolution is necessary.
30
vicinity of the predicted location will be compared first. Once the location of the TO in the next
time frame is determined, the tracker corrects its velocity estimate.
Finally, if the system is unable to maintain a track of the TO (when it does not appear
visually or aurally in the predicted location, and there are no suitable visual histogram matches), it
must acquire a new TO. In the presence of valid and ambiguous visual and audio location cues, it
picks the location indicated by the modality with the higher confidence level at the particular
location. Confidence in a particular modality ()
at a given location (L)is measured as a signal-to-
noise ratio, or SNRML, of the beamformed output. In other words, if the visual detection module
indicates loc, as the location of a valid target and the audio detection module indicates a different
loca, the tracker selects loc = arg max(SNRAocSNR,
). A running table of SNRm,L for
M={A,V} at every discrete location L is maintained and updated at every iteration of the tracker.
See Section 4.4 for more details.
While the above techniques are useful for large, cluttered scenes, the tracking environment
in this project is relatively simpler-a small office room will require a small number of people to
be tracked.
A sensibly mounted microphone array/camera unit will result in visual images
containing mostly horizontal motion; only one object is of interest at any given time.Figure 10 is
the general dataflow diagram of the proposed system.
31
microphone
array
__
Delay & Sum
Beamformer
----------- '-
L*
j
PP
Sound
Localization
Speech
Recognition
Integrated
Tracker
Motion
Detector
camera
Visual
Localization
-Color
Histogram
Matcher
Figure 10: Data flow diagram. The dashed box indicates DSP code. All other
software components are on the host or other PCs. Each component will
be discussed in detail in Chapter 4.
32
Chapter 4 Implementation
In this chapter, the implemented system is presented. As seen in Figure 11, the array can
be configured to be a linear array or multiarray by simply rearranging the placement of
microphones. One color CCD camera is mounted on the centerline of the array to provide visual
information.
Figure 11: The integrated array. The CCD camera is mounted directly above the
microphones at the array center. The image is of the linear array
configuration. The multiarray configuration requires relocating some of
the microphone elements.
This chapter is organized as follows. The first section is a general system overview, and the
low-level system details and issues not directly related to the high-level algorithms are presented. A
detailed system architecture is also given. The remaining sections describe the implementation in
three major groupings, audio processing (including sound localization and beamforming), visual
processing (visual localization), and the tracker.
4.1 General System Overview
A few underlying principles were followed in the design and implementation of the
system. Computation is not only modular and multithreaded, but can be distributed across separate
33
platforms. Inter-module communication is asynchronous and queued, to make a single system
clock unnecessary and computation independent of individual platform speeds.
4.1.1 Hardware Components
Computation is split between the host PC, an add-on DSP board, and an optional additional
PC. The
actual microphone
array hardware
is straightforward;
eight electret
condenser
microphones are connected to custom-built microphone preamplifiers. A more detailed description
of the hardware is given in [27]2. The CCD camera is connected to the Matrox Meteor framegrabber board, which can be hosted on any PC.
The actual beamforming is performed in software on the Signalogic DSP32C-8 DSP board,
a combined DSP engine3 and data acquisition system. It was chosen for the dual advantages of
simultaneous multi-channel data acquisition and offloading beamforming calculations from the
host. The board is capable of performing simultaneous A/D conversion on eight channels at a
sampling rate of 22.05Khz. The signals are filtered, delayed and combined appropriately, and then
passed on to the host PC. The host PC (and additional PC, if present) performs the sound and visual
localization, tracking, and the actual speech recognition using the commercially available software
ViaVoice. Other than the host/DSP interface, all software components are modular and network
based; visual processing of the single camera can be performed on another computer. A modular
system lends itself to easy expansion of components and features. Additional cameras and
processing can be added in a straightforward manner. The DSP board itself can be supplemented
with another board to allow more microphones to be added to the array. See Section 6.5 for
possible future extensions.
4.1.2 Software Components
Before discussing the actual software architecture details some terms must be defined. A
task refers to a host-based thread of execution that corresponds to a well-defined self-contained
2
Joyce Lee worked extensively with the author in designing and implementing the actual
hardware.
3
The DSP used is an 80 Mhz AT&T DSP32C processor. It is very popular in the
microphone array research community and has certain advantages, including easy A/D interfacing
and seamless 8, 16, and 32 bit memory access, over other DSPs.
34
operation. In the system there are four tasks, the Tracker, Audio Localizer, Vision Localizer, and
the DSP Beamformer Interface 4. An application refers to a platform-specific process that may
contain one or more tasks. Their only purpose is as a front-end wrapper so that the underlying OS
may invoke and interact with the tasks using a graphical user interface. A COM object is an
application that supports a standard programmatic interface for method invocation and property
access. COM, or Component Object Model, is a CORBA-like interface standard prevalent on
Windows platforms. DCOM (Distributed COM) is an RPC-based extension that provides remote
invocation and property access.
To provide some flexibility in the incorporation of additional computational resources, the
system was designed from the beginning to allow some simple manner of distribution of
computation. The logical boundary of separation is at the application level, as there are numerous
methods of inter-application (inter-process) communication. Since the Audio Localizer is closely
tied with both the Tracker and the Beamformer Interface Task, it makes sense to place them in a
single COM object; however, they can easily be separated if necessary. The Vision Localizer task
is placed in a separate COM object that can be executed on a different PC if desired. Applications
that support COM interface conventions can communicate with other COM objects on the same
computer, or objects on remote computers automatically and seamlessly, by employing DCOM as a
transparent mechanism for inter-application communication. Note that DCOM is not suitable for
low-level distributed computing (e.g., passing raw signal data as opposed to processed localization
estimates) due to overhead and bandwidth limitations, but is more than sufficient for high-level
message passing, as is the case in this system.
4.1.3 System Architecture
Figure 12 is the overall system architecture diagram, from the perspective of the hardware
and software components mentioned above. Each software component is explained in further detail
in the sections below.
4 The actual DSP-based beamformer routine is not considered here, as its execution is fixed
to be on the DSP board.
35
-----------------------------------
(GUI,etc.)
(GUI,etc.)
Tracker
4-
DCOM
--
---
Vision Localizer
Audio
Localizer
Beamformer
Interface
I
PCI
ISA
Frame
Grabber
DSP Board
Microphone Array
a
Speech
Recognizer
Camera
Figure 12: System architecture diagram. Dashed boxes indicate application
groupings, solid boxes indicate tasks. Double border boxes indicate
hardware components. Dark lines depict hardware (bus or cable)
connections
4.2 Audio Processing
4.2.1 Beamformer
Figure 13 is the block diagram of the beamformer, the only task that is not running on the
host, but on the DSP board. It handles the simultaneous capturing of sound samples from the eight
channels, prefiltering, beamforming, and postprocessing.
36
Phase
Microphone Array
AGC
Microphone compensation
filter
Channel Grouping
SubArray Al
Interpolation
Interpolation
Interpol[at ion
Delay
Delay
Del ay
Sum
Sum
Sur
BW Filter
-------------------------
SubArray A3
SubArray A2/Linear
- -
-
--
-
--
-
BW F ilt er
BW Filter
----
----- - - - - -
-
--
-.-
-
Sum
Channel Output (2)
(Audio Localizer)
Beamformed Output
(Audio Localizer and
Speech Recognizer)
Figure 13: Beamformer block diagram. This task resides on the DSP board.
Dashed boxes are present only in the multi-array configuration. Phase is
obtained from the Tracker task on the host PC. The two channel and
beamformed output are sent back to the host PC to the Audio Localizer
Task.
37
4.2.1.1 Prefilter
Raw sound data from the eight channels goes through a series of prefiltering stages before
they are grouped together for beamforming. An automatic gain control (AGC) filter scales the
signals to normalize amplitude and to reduce the effects of distance from the speech source to the
microphone array. Fifteenth order FIR filters are then applied to compensate for individual
microphone characteristics so that each channel will have a uniform frequency response. (Refer to
[27] for a full discussion of the design of the compensation filters.) As was discussed in
Section 3.1.4, for the multiarray case, each subarray "shares" some microphone channels with
another. A channel grouper redirects the channels to the appropriate subarray.
4.2.1.2 Interpolation
Since the incoming signal is being temporally as well as spatially sampled, arbitrary
beamforming angles are not available using the discretely spaced samples [28]. While it is possible
to obtain arbitrary angles using upsampling or interpolation techniques, it makes more sense to
select a few discrete angles that correspond to as many integral sample delays as possible. Discrete
beamforming also makes sense when the beamwidth is relatively large. Setting Equation (2) to
zero and again solving for 0 corresponding to the main lobe nulls, we get a minimum possible
beamwidth of 10 degrees for the eight element linear array and 65 degrees for the compound array'
[29].
To calculate the possible angular beam positions using discrete sample delays, we return to
Equation (1), this time substituting nTsR for r and mD for d. For our system TSR, the sampling
period, is 4.5E-5s6 and D, the smallest unit of interelement spacing, is .02m.
5 The eight element compound array consists of 3 four element sub-arrays, which explains
the larger beamwidth. However, with the current definition of the beamwidth, these values are a bit
overstated. Another definition for the beamwidth is the angular separation of the two-3dB points
of the main lobe. With this definition the spatial resolution is much better.
6 Corresponding to a sampling rate
of 22.05kHz.
38
m=1 (d=.02)
3 (d=.06)
9 (d=. 18)
N=O
0*
0*
0*
1
54.8*
15.67*
5.16*
32.7*
54.30
10.37*
15.670
2
3
4
5
21.1*
26.74*
6
32.7*
7
8
39.05*
46.05*
9
54.8*
10
64.16*
81.890
11
Table 2: Angular beam positions. The high, mid, and low frequency subarrays of
the compound array correspond to d=.02, .06, and .18 respectively. n is
the integral sample delay. Highlighted boxes indicate the discrete angle
positions for the compound array. Interpolation is necessary for the high
frequency subarray to obtain the other two positions. The linear eight
element array is equivalent to the mid frequency subarray (second
column).
Table 2 lists the possible angular positions given an integral sample delay (n) and
interelement separation (m). 00 corresponds to a dead-center beam perpendicular to the array. The
highlighted angles (00, ±15.670, ±32.70, and ±54.8*) are the chosen discrete positions for the beam
in the compound array. The linear array corresponds to the middle column (d= .06). The
beamwidth for the linear array is too small for full coverage, so interpolation is necessary. A factor
of 3 upsampler will result in 8 (16 for the full 1800 space) additional angles that are the nonhighlighted angles in column 3 [30]. For computational reasons, a simple linear interpolator is
implemented. Others have used window-based FIR filters [6]; Figure 14 is a comparison of a
Hamming window based filter and the linear interpolator. The high, mid, and low frequency
subarrays of the compound array correspond to d=.02, .06, and .18 respectively. For the compound
array, interpolation is necessary only in the high frequency subarray. To obtain the other two
chosen angles (15.67' and 32.70) the same factor 3 interpolator can be used.
Figure 15 shows the coverage area for the microphone array and the camera. Currently, a
wide-angle CCD camera with a field of view of 100 degrees is being used. The tracker will
39
integrate both visual and audio information in the area of overlapping coverage. Outside the field
of view of the camera, sound localization estimates will be the sole source of information.
Unear
-
Ideal
Hamming
3
2.5
21.5
1
0 .5
o
0
2000
4000
-0-b~-
6000
8000
10000 12000
Figure 14: Frequency responses for the linear and Hamming window based
interpolators, with the ideal response as a reference. The particular
interpolator is used for a factor 3 upsampler.
- Camera
-e- Mid freq subarray (A2)/linear array
Low freq subarray (A3)
0
3303
300
60
1.5
270
- - -- - -
---------- --------
--------
--------
Figure 15: Coverage area for microphone array and camera. Discrete radials
correspond to possible beamforming angles. Arcs denote visual field of
view: inner arc corresponds to current camera, outer arc to wide angle
camera.
40
90
4.2.1.3 Beamforming and Post-Processing
The actual beamforming is relatively straightforward, and is implemented exactly as
discussed in Section 3.1. For the multiarray case, sixth order elliptic bandpass filters provide some
post-processing that will allow the three separate subarrays to be summed together to form the
beamformed output.
In addition, two pre-processed channel
signals, corresponding to the two central
microphones in subarray A2, are sent along to the host for the Audio Localizer task.
4.2.2 Audio Localizer
The audio localizer (see Figure 16) is a simple TDOA estimator as described in Section
3.2.1. Two signal channels of the microphone array and the beamformed output are obtained from
the beamformer task, and are first preprocessed on a frame-by-frame manner by applying a sliding
Hamming window frame. The signal energy of each frame of the beamformed output is calculated
by computing the dot product of the sample values. A simple speech detector uses the running
average of signal energy and a thresholder to determine when there is an actual signal or just
background noise. In the absence of speech, the value of the signal energy of the beamformed
output is used to update a running value of the background noise energy (E(N)).
In the presence of speech, a cross-correlator is applied to the two channel frames and a
location estimate, corresponding to a peak in the cross-correlation output, is computed. The value
of the signal energy of the beamformed output (E(S+N)), containing both signal (S) and noise (N)
components, is used in conjunction with the running background noise energy to compute the SNR.
SNR is defined to be the ratio of the signal energy to the noise energy. Assuming that the signal
and noise components are independent, we get the following expression for the SNR in dB:
( E(s +N)
SNR =10 log
-i
(
.
E(N)
The sound localization estimate and SNR value are sent to the Tracker task.
41
Beamformed Output
Sound Channels (2)
Cross-
Preprocessing
Energy
Preprocessing
Speech
Detection
Correlation
4
Localization
V
V
Background
SNR
Noise
{
SNR Queue
(Tracker)
Sound Localization Distribution
(Tracker)
Figure 16: Audio Localizer block diagram. The sound channels and beamformed
output are obtained from the Beamformer Task on the DSP board.
4.3 Visual Processing
The visual localizer obtains images from the camera through the frame grabber board and
places them in a circular buffer. As described in Section 3.2.1, the images are used to update the
background image and create a motion image. The motion image is computed by taking the
absolute value of the difference of the current and background images, all computations being
performed on a per-pixel basis. The image is then thresholded to produce a bilevel image, and then
dilated to create regions of pixels corresponding to moving objects in the camera image. Dilation is
a standard morphological image operation that removes spurious holes in an aggregate collection of
pixels to create a uniform "blob" [31].
A segmentation routine utilizes a connected components (CC) clustering algorithm to
further associate spatially localized blobs to form clusters. The CC algorithm simply associates all
42
blobs that are adjacent to each other into a single cluster. A standard k-means algorithm is also
employed to further segment the clusters into "object" candidates [32] .
As was mentioned in Section 3.2.2, it is assumed that the desired target (referred to as the
Tracked Object, or TO) has been determined previously. If not, the first single object after a period
of inactivity is arbitrarily assigned to be the TO. The relevant information for the TO (position,
pixel velocity, bounding box, and color histogram) as well as that of other candidates are
maintained and updated on a frame-by-frame basis. The color histogram of the TO is used as a
reference template to search among the current frame object candidates for the best match. The
search is narrowed by considering the past positions and pixel velocities of the TO to predict the
current location.
Visual localization is accomplished by computing the centroid of all the constituent
clusters of the TO. In the current design of the system, a one dimensional microphone array is used,
resulting in azimuthal steering of the beam. Similarly, the camera is mounted centered and directly
on top of the array at approximately eye level. Thus with visual localization only the horizontal
component of the TO location needs to be determined.
7
It should be noted that the term object as used here is not related to that used in the image
processing community, specifically in "object detection." No effort has been made to determine the
identity or nature of the group of clusters. In this thesis, object detection refers to the process by
which a group of clusters is aggregated into a single collection.
43
Frame Grabber
Rotating Image
Buffer
Background
Image
Motion Image
Threshold
Reference
Morphological
Histogram
Operators
(Dilation)
Histogram
Matching
Segmentation
(CC, kmeans)
Tracked Object
(TO)
(
Detected
+ 1Objects
- -- Localization
Visual Localization Distribution
(Tracker)
Figure 17: Visual Localizer block diagram. The segmentation routine
incorporates both a connected components (CC) and a k-means
segmentation algorithm. The Tracked Object and Detected Objects
boxes indicate stored state information obtained from the segmentation.
Likewise the Reference Histogram is a running color histogram of the
TO.
44
Figure 18 shows sample output of the currently implemented visual object detector. The
value here of a from Equation (7) (in Section 3.2.1) is low and therefore this is basically a
motion detector. The vertical (green) line represents the output of the localizer, in image
coordinates. With the current setup each image is 160 by 120 pixels with 24 bit color. The red and
green boxes are the bounding boxes from the motion segmentation routine. The sub-image on the
right side is the current TO. Note that, even with only a single moving speaker, there are still
spurious motion pixels not associated with the target. These arise from, among other noise sources,
flickering lights.
Figure 18: Sample motion based localization output. The vertical line represents
the computed location. The middle picture is the output of the motion
segmentation routine. The rightmost picture is that of the currently
tracked object (TO).
4.4 Tracker
The tracker task (Figure 19) guides the microphone array beam based on the location
estimates from both sound and vision localizers, using either or both modality depending on the
existence of valid estimates and on a simple persistence model based on position and velocity. This
allows it to handle cases when the TO may temporarily be hidden from view but still speaking, or
when the TO is moving but not speaking. If there is inactivity for an extended period of time, the
state model is reset and a new TO is selected.
Localization estimates from both modalities are maintained in spatial distribution maps,
and a single location value is computed for each modality, by finding peaks in the distributions
45
[12]. These maps have an integrating or "smoothening" effect on the raw localization estimates and
make the tracker more robust to spurious noise data points. If both modalities indicate the same
location, or if there is only one valid estimate, then that value is passed onto the Beamformer task.
If each modality indicates different locations, a position estimate from the persistence model,
which computes the likely location of the TO from the previously estimated location and the
current (azimuthal) target velocity estimate, is obtained and compared to the ambiguous location
estimates. If there is a match, that location estimate is sent to the Beamformer and the persistence
model is updated with the new location estimate and target velocity. If there is no match, then the
tracker has basically lost track of the TO and a new track object must be obtained using the SNR
Map as described in Section 3.2.2. As will be discussed below, in all performed experiments the
tracker never lost track of the TO so this feature was never used.
46
Sound Localization
Distribution
S NR
Queue
Vision Localization
Distribution
LocationSN
Estimator
Sing or
Unambg uous Cues
S
Ambiguous Cues
Map
Predictor
4
Persistence
On Track
Lost Track
SNR
4-
Comparator
Phase
Conversion
Phase
(DSP)
Figure 19: Tracker block diagram. The two localization distributions are updated
asynchronously from the respective Localization tasks.
47
48
Chapter 5 Procedure
As outlined in Section 2.3, the major contribution of this thesis is the comparison of
various methods and techniques in visually-guided beamforming and tracking to determine the
optimum system configuration; the goal of the project is to improve overall system performance
given certain quantifiable constraints. This chapter will first discuss the system configurations and
trial conditions that are the experimental variables. A discussion on the various measures used to
evaluate performance will then be presented. Finally, the actual experimental setup and procedure
will be given.
5.1 Experimental Variables
5.1.1 System Configuration
Since there are several components to the entire system, evaluation of overall system
performance must first start at the evaluation of each component in as close to an isolated
configuration as possible. The three main components to be examined are the beamformer,
multimodal localizers (Video and Audio), and tracker.
The microphone array (and consequently, the beamformer task) can be operated in two
modes, a single linear array and a compound multiarray. The linear array provides a finer spatial
resolution, but has a limited frequency bandwidth. The multiarray has a wider, more uniform
frequency bandwidth, but has much coarser spatial selectivity.
49
Most microphone array systems use only sound localization to guide the beam. This
system has both audio and visual localizers, and their individual and relative performance must be
measured. The tracker can accept input from either localizer, or both. By comparing its
performance in these three configuration modes, we can determine which localizer results in better
overall performance; we expect the visual localizer to provide more stable and accurate position
information under most trial conditions. In addition, we need to determine if both localizers are
required for optimal performance.
Finally, a second microphone array was arranged to form a planar coverage area (as
opposed to a radial coverage area with a single array) to determine how much of an improvement
the addition of more elements (for a total of 16) will provide. Section 6.2.4.7 gives more details
about the planar multiple array configuration, and Appendix C includes a design for a single array
with 16 microphone elements.
5.1.2 Trial Condition
In addition to varying the system configuration, the experimental trial condition can be
varied by changing the stimulus presentation, which mainly involves changing the location of the
speaker or the microphone array beam. The variations can be classified in two broad categories,
static and dynamic. In the static condition, the speaker is located at a fixed position. In the dynamic
condition, the speaker roams around the room in an unstructured pattern for each trial. For either
condition, only one system configuration variable is changed at a time. This results in a large
amount of data collection, but is necessary to isolate the experimental variable.
By restricting the speaker's position, the static condition allows a broad range of
experiments. The speaker may be located at various angles off the dead center array normal, and
the beam itself may be guided directly at the speaker (on-beam condition) or away (off-beam). Onbeam angles may correspond to integral sample delays, or involve non-integral (interpolated)
delays.
5.2 Measurements
The primary measure used for overall system evaluation and local comparisons will be the
performance, or word error rate, of the commercial speech recognizer. Intermediate measures will
be utilized to compare performance improvements within each system component. For example,
SNR will be used to compare the outputs of the single linear array and the compound array with the
50
controls, the close-talking and one element microphones. Performance of the localizers will be
compared by their percentages of correct position estimates.
5.2.1 SNR
One direct measure of the effect of a filter (in this case, the beamformer) is the
improvement in the signal to noise ratio (SNR) of the input signal. SNR in dB is defined to be a
log-ratio of signal and noise variances o
and o
, respectively:
SNR =10log(
),J.
(8)
This expression is equivalent to that discussed in Section 4.2.2; with zero-centered
waveform variance equivalent to signal energy.
5.2.2 Position Estimates
The tracker and localizer tasks maintain a log of detected or tracked positions of object
candidates on a per frame or per iteration basis. The visual and audio localizers perform
computations on every frame, where frame is defined to be an image for the former and 185ms of
sound samples for the latter. The tracker task computes the location of the TO each iteration, every
I00ms.
In the static trial condition, the location of the speaker is known and fixed. By speaking
and slightly moving in place, both the audio and video localizers will have valid position estimates.
A simple measure of the percentage of estimates that coincide withthe known location can be used
to compare performance.
5.2.3 Word Error Rate
Performance of the speech recognition software ViaVoice is measured by word error rate,
defined to be the ratio of erroneous words to total words. Erroneous words are defined to be words
that are incorrectly added, omitted, or substituted by the speech recognition software [33].
We are concerned more with relative differences in WER between various configurations,
as opposed to absolute values; absolute levels of performance can be improved by more training,
better recognition software, etc.
51
5.2.4 Method and Controls
For each measure, a mechanism must be defined to identify and establish controls for the
component variables. Table 3 lists the methods for collecting data as well as the specific control
used for the measurement. To ensure that the same sound stimulus is present for both the control
and variable, and to allow off-line analysis, whenever possible all speech output from the control
microphones and the array were simultaneously and digitally recorded for later playback to the
speech recognition software. Note however that due to physical limitations multiple trials are
required; the linear and multi-array configurations can not be tested simultaneously, as they require
a physical reconfiguration of the array. In this case the multiarray trial was performed immediately
after the main trial (controls and linear array). As will be described in Chapter 6, a test was
performed to determine if it is (statistically) justifiable in making comparisons across trials.
Measurement
SNR
Positional Error
Word Error Rate
Control
Method
Headset microphone close to
speaker and one microphone
element of array.
Predetermined,
fixed
positions at various angles
and distances
Predetermined text. Single
close microphone.
Simultaneous recording of controls
and linear array output. Multiarray
output is recorded separately.
Discrete markings on floor. For
localization
comparisons,
remain
fixed in position.
Simultaneously record single and
linear array output of recitals of same
text. Multiarray output is recorded
separately
Table 3: Procedure and control for each performance measurement
5.3 Experimental Setup
For most trials, a single microphone array is mounted on the wall of a medium-sized
laboratory room (approximately 20'x30') at approximately eye level at a height of 63". The
camera's field of view covers practically the entire workspace. There are several background noise
sources, including the air conditioning unit and cooling fans of a large rack-mounted computer
system. The speaker is always the same person and the speech is read from a set of 39 untrained
and previously trained sentences. In the static case the speaker is located approximately five feet
from the array. In the dynamic condition the speaker is free to move around the room, but always
52
faces the array when speaking. Figure 20 graphically represents the experimental setup for the
static condition.
The speech recognition system, ViaVoice, requires substantial training for optimum
performance; to reduce the speaker dependency of the results, only the bare minimum of the entire
training set were used to train the speech recognition software. This does not affect relative
comparisons of WER performance, as all experimental trials will be performed at the same level of
training. Obviously, with more training, the actual WER values of all trials will be lower.
The reduced training set of sentences taken from the user enrollment portion of the
ViaVoice software was applied to three cases, one using the normal headset (close-talking)
microphone and the other two using the microphone array outputs. Since the microphone arrays
could be configured in two ways, linear and multiarray, separate training must be performed on
each configuration.
32"
60"
00
10.40
15.70
32.70
array @ 63" height
Figure 20: Experimental setup for single array, static condition. Arrows indicate speaker's
facing direction.
Even though the speech recognition software is capable of real-time processing, for reasons
discussed above the various system and control outputs are processed off-line using digital
53
recordings. Every other component of the system (the beamformer, localizers, and tracker) were
tested in real-time conditions.
For the static condition trials, discrete positions were marked on the floor five feet away
from the array, corresponding to the angles in the third column of Table 2. SNR, word error rate,
and positional error measurements are obtained from these fixed positions.
For the dynamic condition trials a free-form scenario, with the speaker roaming around the
environment while speaking, was used. Since there would be too many variables for each trial, no
attempt was made to fix the path, though a general pattern of (azimuthal) back-and-forth movement
was used.
54
Chapter 6 Results
6.1 Introduction
As was mentioned in Section 5.1.2, measurements were taken in two general experimental
trial conditions, static and dynamic. The static condition refers to measurements taken when the
sound source was spatially localized (i.e., stationary). Relevant quantitative measurements for the
static condition include sound and visual localization output, signal-to-noise ratio, and word error
rate. The dynamic condition refers to comparisons and measurements taken when the sound source
is in motion. The primary quantitative measurement for the dynamic condition is the word error
rate.
This chapter is organized as follows: comparisons and measurements for the static
condition are given first, followed by those for the dynamic condition. The outputs of the audio and
visual localizers are quantitatively compared. The localizer outputs are also fed separately to the
tracker and the relative performance is also measured. SNR and WER measurements are then used
to quantitatively compare different variations of the microphone array. For the dynamic condition
the audio-only, video-only, and integrated tracker outputs are compared. The combined tracker
output is also examined to determine the relative contribution of each modality to the final
estimate. Finally, WER measurements in the dynamic condition are quantitatively compared and a
final performance evaluation is made.
55
6.2 Static Condition
6.2.1 Signal-to-Noise Ratio
Signal-to-noise ratio (SNR) was defined in Section 5.2.1 to be the ratio (in dB) of the
signal and noise variances. It is a useful quantity to compare the signal quality of the different
microphone configurations.
The noise variance is computed from small segments of each sample before the speech
utterance begins. The signal variance is computed using a 1 Ims sliding window across the sample
in the presence of speech, in a manner as described in Section 4.2.2. For each case, a single five
second utterance ("This is a test.") was used. Table 4 summarizes the SNR results for each
configuration, and Appendix A provides spectrograms for each utterance.
SNR
Close-
One Element
Multiple
Multiple
talking Mic.
Mic.
Linear Arrays
Multiarrays
52.04
24.33
26.78
23.40
29.79
27.11
23.17
10.76
17.33
17.06
18.48
17.87
Max (dB)
SNR
Avg (dB)
Table 4: SNR (dB) data for each microphone configuration, with utterance, "This
is a test."
As expected, the close-talking microphone has the best average and peak SNR, followed
by the linear array. It is expected that the linear array produces better SNR than the multiarray, as
the former has a narrower beam; the multiarray, with its wider beam, would pick up more noise
from a greater spatial area than the linear array. The one element microphone should produce the
lowest SNR values, and it is somewhat unexpected that it has a higher peak SNR than the
multiarray; but the mean values are more in-line with what we expect.
Two additional sets of data, corresponding to the multiple array configurations described in
Section 6.2.4.7, is also given. Having more microphones directed at the target should increase the
SNR and improve performance. For both the linear array and the multiarray, doubling the number
56
---
ONG--
.
..........
-mom
of elements has a marginal effect in improving SNR, but not enough to match the performance of
the close talking microphone. Section 6.2.4.7 has more details on the multiple array configurations.
6.2.2 Localization Output
As mentioned in Section 5.3, to test and compare the two localizers, a simple stimulus
pattern (the speaker moving to and speaking at three locations corresponding to the integral sample
delay angles, in order, 0, 15.7, and 32.7 degrees) was simultaneously provided to the audio and
visual localizers. Figure 21 shows the output of the audio localizer, while Figure 22 is the output
of the visual localizer. For both, the data points are in units of integral sample delays, which
represent the normalized location of the speaker at the particular iteration. The visual image pixel
coordinates have been converted into corresponding integral delays using a static look-up table.
Negative sample delays indicate left of dead center. The time scales (x-axis) of each plot are
different, as the two localizers run independently in separate tasks with different timing conditions.
Iterations with no data points indicate no valid localization information (i.e., no speech sounds or
no movement) at that particular moment.
Figure 21: Normalized audio localization
scatterplot. The y-axis is in units of integral
sample delays (representing beamformed
angles, with negative delay values
corresponding to angles to the left of the
array normal) and the x-axis is the iteration
number of the localizer and roughly
corresponds to time. Red points indicate the
output of the localizer, and green points
indicate actual position.
Figure 22: Normalized visual localization
scatterplot. The y-axis is in units of integral
sample delays (representing beamformed
angles, with negative delay values
corresponding to angles to the left of the
array normal) and the x-axis is the localizer
iteration number. Red points indicate the
output of the localizer, and green points
indicate actual position.
57
It is evident from looking at the two plots that while both seem to be in general alignment
and agreement in terms of locating the speaker at the three indicated locations, the audio localizer
seems to be considerably noisier in terms of spurious location estimates. This suggests that the
visual localizer, at least in the case of non-occluded TOs with simple movement patterns, is all that
is required for tracking.
Audio Localizer % correct
Video Localizer % correct
43.2%
72.1%
Table 5: Localizer accuracy. Numbers indicate percentage of localizer output
that correctly identifies location of target. (See Figure 21 and
Figure 22)
Table 5 lists the accuracy of each localizer for this particular trial. Each respective
localizer output is compared to the actual or known location of the speaker during the trial and a
percentage of correctly identified target location is computed. It is evident from both the graphical
and quantitative representations that the video localizer is considerably more accurate than the
audio localizer.
6.2.3 Tracker Output
The localizer outputs were then sent to the tracker separately. Figure 23 shows the output
of the audio-only tracker. Compared to the audio localizer output shown in Figure 21, there is a
definite "smoothening," due to the integrating properties of the tracker, which is necessary in
directing the microphone array in a stable manner. At least the first two positions (0 and 15.7
degrees corresponding to delays of 0 and -3, respectively) are clearly evident. The third position
(32.7 degrees, or a delay of -6) also appears near the end of the plot, although it is corrupted by
other noise. Figure 24 shows the output of the visual tracker. Again the raw localizer output is
smoothed, although the video localizer output (see Figure 22) was much smoother to begin with
than the audio localizer output. The output of the video-only tracker coincides with the actual
stimulus pattern much closer than the audio-only tracker.
58
Figure 23: Audio-only normalized tracker
output. The y-axis is in units of integral
sample delays (representing beamformed
angles, with negative delay
values
corresponding to angles to the left of the
array normal) and the x-axis is the tracker
iteration number. Red points indicate the
output of the tracker, and green points
indicate actual position.
Audio Tracker % correct
Figure 24: Visual only tracker output. The yaxis is in units of integral sample delays
(representing beamformed angles, with
negative delay values corresponding to
angles to the left of the array normal) and
the x-axis is the tracker iteration number.
Red points indicate the output of the
tracker, and green points indicate actual
position.
Video Tracker % correct
66.1%
84.3%
Table 6: Tracker accuracy. Numbers indicate percentage of tracker output that
correctly identifies location of target. (See Figure 23 and Figure 24.)
Table 6 lists the accuracy of each tracker for this particular trial. The percentage of
correctly tracked target location is computed in a manner similar to that of the localizer outputs
above. While the relative performance of the audio-only tracker compared to the video-only tracker
is much better than with the audio and video localizers, it is still clear from both the graphical and
quantitative representations that the video-only tracker is considerably more accurate and stable.
The performance of the combined or integrated tracker in a dynamic condition is examined in
Section 6.3.1.
59
6.2.4 WER Data
Word error rate (WER) is a ratio of the number of (incorrect) word additions, deletions,
and substitutions over the total number of words in the original speech. Nineteen sentences from
the training set (312 words) and twenty untrained sentences (287 words) were used in all
experiments. To calculate the WER for all combinations of configurations, a speech recognition
software scoring program was used. NIST provides the sctk toolkit to do the scoring in a standard
manner [34].
It is useful to separate previously trained and untrained utterances in measuring
performance since speech recognition software naturally perform better on utterances on which
they have already been trained. In what follows, all WER measurements are divided into three
categories: trained, untrained, and total.
In any quantitative comparison of stochastic systems, it is important to be aware of the
statistical significance of any measurements. Significance tests start with the null hypothesis (H0 )
that there is no performance difference between the configurations being compared. The test then
performs a specific comparison between the measurements and computes a "p" value, which is
defined to be Pr(datalH),the probability of the observed (or more extreme) data given the null
hypothesis [35]. The lower the value of p, the more likely that the null hypothesis can be rejected
and that the difference between the observed measurement and the configuration with which it is
being compared is significant [36]. The sctk toolkit includes the Matched Pairs Sentence-Segment
Word Error (MAPSSWE) test, which is a statistical significance test, similar to the t-test, that
compares the number of errors occurring in either whole utterances of segments of utterances. The
MAPSSWE test uses somewhat standard thresholds of p=.001, p=.Ol, and p=.05 to determine the
level of significance; measurements with a p value greater than .05 are considered statistically
similar [37].
Section 5.2.4 discussed the necessity, due to experimental constraints, of testing whether
there is any statistically significant difference between successive trials of utterances, with the
system configuration and as much of the environmental condition as possible is kept constant.
Section 6.2.4.2 below presents the justification for assuming speaker dependent changes in the
utterance across trials do not significantly alter the results and conclusions of WER based
comparisons.
60
6.2.4.1 Control Cases
The close-talking headset microphone that was packaged with ViaVoice was used to
provide the baseline control for all word error rate measurements. The performance for this control
should be the highest of all experimental results, and represents the best possible case; the eventual
goal for beamformed output performance is to match this level.
Trained Set
Untrained Set
Total
16.5%
22.0%
19.2%
Table 7: WER for control case. The control case is speech taken from a closetalking headset microphone.
As expected, the trained set results in better performance than the untrained set. Note
however that even for the trained set in ideal conditions, the errors are substantial.
Another way of testing the effect of beamforming is to compare performance with a single
element microphone. The performance for this control represents the worst case, and should be the
lowest of all experimental results. For the following result a single microphone from the array was
used to record speech.
Trained Set
Untrained Set
Total
45.1%
63.9%
54.1%
Table 8: WER for single element case. The speaker was located at dead center
(angle 0). A single microphone closest to the center of the array was
chosen.
Again, the untrained set leads to greater errors than the trained, though for both cases the
output is almost unintelligible. Together, the close-talking headset and single element microphone
cases provide a range within which the various configurations of the microphone array can be
compared.
61
6.2.4.2 Trial Variation
Since it is impossible with the current setup to take all the data for every configuration and
condition variations simultaneously from one speech trial, multiple trials must be performed, each
time changing a single configuration or condition. To see if there is a statistically significant
difference in the speaker generated speech (and background noise) across trials, two trials using the
linear array while fixing all other controllable variables were performed. Figure 25 shows that
there is no significant difference across trials for this particular case.
Unear Array0 - Redo Unear Arraya Trained
LI near Ar r yO - Redo LI near Ar rayO
Untrai ned
Uneo, Array0
-
Redo Unear Arrayl0
00.0%
00.0%._
_
_
__
_
_
_
40.0%
50.0%.
40.0%
40.0%
35.0%
40.0%
33.7%
*30.4%
130.0%
30.0%.
20.0%
20.0%
20.0%
10y0%
10.0%.
10y0%
0.0%
0.0%
*mem
&Wytaned
raef ISO bl
-
0.0%
11inow
eryO untr Mnod
a
ledola~untrailed
bern
-M0
rf b
o
Figure 25: WER data for linear array for two trials: trained, untrained, and total
data sets, respectively. For each chart, the left bar represents the original
trial and the right bar represents the second trial. The gray bars indicate
no statistical difference in the results across trials.
Of course, we can not state generally with absolute confidence that there is no difference
across trials as human speech is extremely variable across even a short span of time. Furthermore,
background noise is a constantly changing element that may also induce variations. The best that
can be done is to perform as few trials as necessary to minimize such differences.
6.2.4.3 Configuration Variations
There are two major array configurations that are examined, the linear array and the
multiarray. (A third variation, the multiple microphone arrays, will be discussed separately in
Section 6.2.4.7) We compare performance of each with respect to the two controls mentioned
above, as well as with each other (See Table 9 and Figure 26-Figure 28.)
62
Configuration
Trained Set
Untrained Set
Total
Linear Array
31.1%
30.4%
30.8%
Multiarray
24.8%
31.7%
28.1%
Table 9: WER for linear array vs. multiarray configuration. In both cases, the
speaker was located dead center from the array.
Looking at Figure 26, we can see that on all data sets (trained, untrained, and total), the
performance of the linear array is in between that of the two controls, and that the differences
between the three are statistically significant at least to the p=O.I level.
The multiarray performs similarly (Figure 27). An interesting point is that the performance
of the multiarray is somewhat closer to that of the close-talking microphone compared to the linear
array.
COse-LA-One
Close-LA-One Traied
Unftaied
Close-A-Os.
63.9%
60.0%
50.0%
45.Ma
40.0%
60.0%
60.0%
50.2%
50.0%
40.0%
40.0%
31.1%
54.1%
&
30.4%
30.0%
ac
19.2%
6%
20.0%
10.0%e
20.0%
00%
100%
doe.'
.0
oo%00
-
es
cls
W
-
Figure 26: WER data for linear array: trained, untrained, and total data sets,
respectively. For each chart, the bars are arranged in the order: close
talking microphone, linear array, and one microphone element. The
middle gray value (linear array) is the basis for statistical signficance
testing. Red, green, and blue bars indicate a significance level of
p<=0.001, p<=0.01, and p<=0.05, respectively.
63
ewe
Close-MA-One Trined
Clusa-MA-One
CloeMA-one
Untrained
63.9%
o0.0%
00.0%
50.0%
40.0%
40V0%
30.0%
30.0%
60%
54.1%
50.0%
40.0%
31.7%
de
28.1%
30.0%
20.0%
192%
22.0%
10.0%
20.0%
10.0%
0.0%4.
close
akied
nmoi0my0raOed
C10980 Unbakne
OnlS elet
akned
Mulli~my
unraknw
or* elemnt
untaled
0.0%
cloe
MURlirrayO
one elemenM
Figure 27: WER data for multiarray: trained, untrained, and total data sets,
respectively. For each chart, the bars are arranged in the order: close
talking microphone, multiarray, and one microphone element. The
middle gray value (multiarray) is the basis for statistical signficance
testing. Red (grid), green (vertical lined), and blue (horizontal lined)
bars indicate a significance level of p<=0.001, p<=0.01, and p<=0.05,
respectively.
It is not evident from just Table 9, however, which array configuration performs better.
Performing the MAPSSWE test (Figure 28) reveals that statistically, there is no difference in the
performance of the two configurations, at least for the case of a stationary, on-beam, dead-center
sound source.
LA-MA Trained
LA-MA
L A-MA Untr dned
60.0%
60.0%-
60.0%
50.0% 40.0%
40.0%
40.0% -
...
30.0%
20.0%
10.0%
30,4%
31.1%
em24.9%
30.0%
30.0% -
20.0%
--
1
-
lineatyaayO tmined
33
mulflamayO
Mrined
0.0%
-
2010%
30..%28.1%
-
- - --
-
10.0%-
-
Inecr ar r q/0 untr dnod
mr
i c~r r avO untr d red
11neffiftyo
Mulia"yo
Figure 28: WER data for linear array vs. multiarray: trained, untrained, and total
data sets, respectively. For each chart, the left bar represents the linear
array and the right bar represents the multiarray. Statistically, there is no
difference between the two in performance.
64
6.2.4.4 Angular Variations
Both the linear array and multiarray were tested at various on-beam conditions, where the
guided beam and the physical location of the speaker were aligned. Three angles (0', 15.70, and
32.7'), representing integral sample delays for the linear array and two of the multiarray subarrays,
were chosen, and the results are given in Table 10 and Table 11. We would expect the results to be
roughly similar regardless of angle.
Angle
Trained Set
Untrained Set
Total
00
31.1%
30.4%
30.8%
15.70
30.1%
38.5%
34.1%
32.70
27.5%
30.1%
28.7%
Table 10: WER for on-beam, linear array. The three angles represent integral
sample delays.
Angle
Trained Set
Untrained Set
Total
00
24.8%
31.7%
28.1%
15.70
28.5%
27.7%
28.1%
32.70
32.5%
31.8%
32.2%
Table 11: WER for on-beam, multiarray. The three angles correspond to integral
sample delays for the linear array as well as the A2 (mid frequency) and
A3 (low frequency) subarrays of the multiarray.
Figure 29 indicates that for all data sets the linear array performs similarly, which is
expected. Likewise, Figure 30 indicates that in general the multiarray performs similarly across
angles and data sets. However, in one specific case, angle 32.70 on trained data, there is a
difference in performance with respect to the baseline dead center (angle 00) case. This can
possibly be explained by the fact that the beamwidth is proportional to the directed angle [26]; as
the angle increases, more interference noise will enter the beamformed signal. However; the
statistical significance of the difference is at a level p=0.05, which is near the usual threshold of
65
what is considered to be significant; given that all other cases in this configuration are statistically
similar, no firm conclusion about the difference in performance can be made.
LA-LA34A6 Trkined
LA04A3-LA6 Untraind
00.0%
8D.0%
50.0%
60.0%
30.1%
31.1%
77
30.0%
20.0%
10.0%
10.0%
lieaayO Vraned kwnw~ay
traned
3.%
30.4%
30.0%
75
20.0%
0.0%
-0.0%
50.0%
40.0%
40.0%
0.0%
imay~ra6
brabned
LAO4A3-A6
-
linwrayo
Unnakned
40.0%34.1%
-28.7%
301%
WON-aaray3
unkaid
Hmneraay6
unnakwed
0.0%
Figure 29: WER data for linear array at different angles: trained, untrained, and
total data sets, respectively. For each chart, the bars are arranged in the
order: 00, 15.70, and 32.70. Statistically, there is no difference in
performance for the three angles for both data sets.
MA0-MA3MA
60.0%
Traknd
MA0.AA1-MA2
MA04MA14-A2
60.0%
n
60.0%
50.0%
40.0%
40.0%
31.7%
32.5%
30.0%
2.%
28.5%
31.8%
2.%
30.0%
P
3.%28.1%
28.1%
SM2D.0%
20.0%
10.0%
0.0%1
J
00
mWWW
MLW3
Muwy
0.0%
Figure 30: WER data for multiarray at different angles: trained, untrained, and
total data sets, respectively. For each chart, the bars are arranged in the
order: 00, 15.70, and 32.70. In the trained data set, there is a statistically
significant (at p<=O.1) difference between the performance at angle 0
and at angle 32.7. Otherwise, there is no statistically signficant
difference between the angles in both data sets.
6.2.4.5 Beamforming Variation
One way of testing the efficacy of beamforming is to compare performance between results
from the on-beam case and the off-beam case. Doing so also qualitatively tests the spatial
66
selectivity of the beam. We expect to see a sharp drop in performance (i.e., increased WER) for the
linear array, which has a tighter beam than the multiarray.
Case
Trained Set
Untrained Set
Total
On-beam
31.1%
30.4%
30.8%
Off-beam
65.9%
68.1%
66.9%
Table 12: WER for linear array, on and off-beam. For both cases, the beam was
guided to dead center (angle 00). For the off-beam case the speaker was
located at angle 15.7'.
Case
Trained Set
Untrained Set
Total
On-beam
24.8%
31.7%
28.1%
Off-beam
30.2%
39.2%
34.6%
Table 13: WER for multiarray, on and off-beam. For both cases, the beam was
guided to dead center (angle 00). For the off-beam case the speaker was
located at angle 15.70.
Comparing the results in Table 12 and Table 13 confirms our expectations; there is a
significant drop-off in performance on all data sets between the on-beam and off-beam case for the
linear array. The drop-off for the multiarray is not as significant (p<=O. 1), due to the broad beam of
the multiarray, and is also expected.
67
LA-LAO9 Unalned
LA-LAOS Trined
LA-LAO
681%
-0.0%
60.0%
-0.0%O
50.0%
40.0%
50.0%
40.0%
40.0%
3.%
30.0%
30.4%
30.8%
3.%
20.0%
20.0%
10.0%
10.0%
20.0%
10.0%
10.0%
10.0%
O
Figure 31: WER data for linear array, on-beam and off-beam: trained, untrained,
and total data sets, respectively. For each chart, the left bar represents
the on-beam case and the right bar represents the off-beam case. Red
(grid) bar indicates significance level of p<=0.001.
MA-MAO
MA-MA06
Trahned
00.0%f
60.0%
50.0%
50.0%
40,Y%
40.0%
Untraied
MAMAO
00-04W.0%
30.0%
20.%
24.9%
302
30.0%
30.0%
2.1
20.0
-
10.0%
31.7%
-
0.0%
10.0%
10.0%
m0.0%mnW
fWwoO tW
0.0%
-
-"".
Figure 32: WER data for multiarray, on-beam and off-beam: trained, untrained,
and total data sets, respectively. For each chart, the left bar represents
the on-beam case and the right bar represents the off-beam case. Blue
(horizontal lined) bar indicates significance level of p<=0.05.
6.2.4.6 Interpolation Variation
In Section 4.2.1.2, it is claimed that interpolation is necessary to obtain intermediary
angles corresponding to non-integral sample delays. To determine whether such interpolation is
really necessary, WER results from a beamform angle corresponding to a non-integral sample
delay (10.40) were compared with that from a location corresponding to the nearest integral sample
68
delay (15.7*, see Table 14). We expect the non-interpolated WER will be significantly higher than
that of the interpolated results.
Since there will always be some amount of interpolation in the multiarray case (subarray
Al always has interpolation), only the linear array was examined.
Case
Trained Set
Untrained Set
Total
Interpolated
31.5%
32.5%
32.0%
Non-Interpolated
44.5%
40.0%
42.4%
Table 14: WER for interpolated vs. non-interpolated linear array case. For both
cases, the speaker was located at angle 10.4'. For the former, the beam
was guided to the same angle, while for the latter, the beam was guided
to the closest angle possible (15.70).
LA2 I0 Trained4A2 NI Trained
LA2
60.0%
Int Untrained4A3
Nit Untrained
LA2
Int-LA3 NI
60.0%
50.0%
445%
42.4%
40.0%
4A
40.0%
30.0%
4300%
20.0%
10.0%3
0.0%
-
fineararray2 int trained
lineararray3 ni trained
200
o.0%
finearanay2
int untrained
0%
10.0%
0
10.0%
I
30.0%
0.0%L
knwararray3 Ni untrained
fineaany2 it
lier
r
"n
Figure 33: WER data for linear array, interpolated and non-interpolated: trained,
untrained, and total data sets, respectively. For each chart, the left bar
represents the interpolated case and the right bar represents the noninterpolated case. Green (vertical lined) bar indicates significance level
of p<=0.01.
Figure 33 indicates that there is a significant difference (p<=0.01) between the interpolated
and non-interpolated case, which is as expected. Thus interpolation is required to obtain better
performance at non-integral sample delay angles.
6.2.4.7 Multiple Array Configuration
Obviously, as the number of microphone elements increase, the SNR and hence WER
performance should improve. One extra configuration that was examined was that of two eight
69
element microphone arrays arranged at right angles (see Figure 34). Each array is an independent
entity, with its own DSP board and beamformer. Both multiple linear arrays and multiple
multiarrays were examined. To simplify data collection, only a single location, the point where the
dead center normals of the two arrays intersect, was tested. More rigorous experimentation will be
left for future work.
16"
32"
-
!4
48"
arrays @ 63" height
Figure 34: Multiple array configuration. The speaker is located at the
intersection of the dead center normals of both arrays and is facing at 45
degrees from both.
Configuration
Trained Set
Untrained Set
Total
Linear Array
31.1%
30.4%
30.8%
Multiple Linear Arrays
17.1%
31.1%
23.8%
Table 15: WER for linear array vs. multiple linear arrays configuration. In the
linear array case the speaker was pointed directly towards the single
array. In the latter case the speaker was pointed in the direction 45
degrees off dead center for both arrays.
70
Configuration
Trained Set
Untrained Set
Total
Multiarray
31.7%
30.4%
28.1%
Multiple Multiarrays
33.3%
31.7%
29.2%
Table 16: WER for linear array vs. multiarray configuration. In the multiarray
case the speaker was pointed directly towards the single array. In the
latter case the speaker was pointed in the direction 45 degrees off dead
center for both arrays.
As expected, having multiple linear arrays improves performance as compared to a single
linear array. However, the performance gain seems to occur only in the trained data set.
Furthermore, there does not appear to be a corresponding improvement for the multiarray
configurations; neither the trained nor the untrained data sets are significantly different. There a
few possible explanations for the above results.
First, the current system was not designed to incorporate multiple arrays using independent
computing platforms. While it is easy to add localizer and tracker tasks to the system, adding an
additional beamformer task requires that the additional DSP board be added to the same host as the
original board. In addition, a DSP board based synchronizer needs to be implemented to
synchronize between the 16 or more channels that are simultaneously being captured. Currently,
only a simple post processing synchronizer has been implemented.
71
Unear Array
-
Multiple Linear ArrayC Trained
Linear ArrayC - Multiple Arreyc Untrained
60.0%
00.0%
50.0%
5D.0%
40.0%
40.0%
Mu"pl
Lnear
Arrmy
0
60.0%
a
50.0%
40.0%
31.1%
13.%
30.0%
Linear Arrfy" -
30.8%
31.1%
30.4%
30.0%
2D.0%
20.0%
2X0%
10.0%
10.0%
0.0%
nemf0W
0.0%
rbed
mulrW VWtraaned
Figure 35: WER for
untrained, and
represents the
multiple linear
p<=0.001. The
p<=0. 05 .
MultiarrayO - Multiple Multlarray0 Trained
60.0%
WwnwrarayO
un*raWnd
mulWpO 100 Wntried
0.0%
"Mea
Wrray
MuL*0pl
Wa
linear array vs multiple linear arrays case: trained,
total data sets, respectively. For each chart, the left bar
single linear array and the right bar represents the
arrays. The red (grid) bar indicates significance level of
blue (horizontal lined) bar indicates significance level of
MuloirrayO - Mufti*l Mularray0 Untrained
Muftlerrayo - Multiple Multiarrayc
80.0%
t
50.0%
50.0%
40.0%
40.0%
-0.0%
31%
20.0%
10.0%
20.0%
10.0%
---
10.0%
0.0% L
mnfiwnayO tmhied
mufip* MOO tahned
fmkltffayO untraied
muln*pl MRO untned
0.0%
MUM~rry
mutdp loa
Figure 36: WER for multiarray vs multiple multiarrays case: trained, untrained,
and total data sets, respectively. For each chart, the left bar represents
the single multiarray and the right bar represents the multiple
multiarrays.
6.2.5 Summary
In general, most static condition results given above have been as expected. It should be
noted that given current speech recognition technology, totally perfect recognition is an
unattainable goal even with the best microphone configuration. Best performance, measured in
SNR as well as WER, is obtained using the headset microphone, followed by the linear array and
72
the multiarray. The single element microphone is practically useless in a dynamic acoustic
environment.
Both the linear array and multiarray have comparable performance on a straight on-beam
situation; both perform approximately 50% worse than the "ideal" close talking microphone.
Performance as measured in WER is not dependent on angle for either array, and is significantly
better for trained data sets than untrained.
It has been confirmed that the multiarray is much less spatially selective, as was predicted
by the calculations in Section 3.1.4. This is both an advantage and disadvantage. The multiarray is
more tolerant than the linear array with off-beam sound sources, which is helpful if the TO is not
exactly on-beam, but harmful if other noise sources are nearby. The lower SNR of the multiarray
compared to the linear array is another manifestation of this characteristic.
Not much analysis can be made regarding the multiple array configurations at the present
time. At least for the multiple linear array case there appears to be evidence that performance is
enhanced with two arrays over one. However, more sophisticated synchronization mechanisms are
necessary before more conclusive statements can be made.
6.3 Dynamic Condition
In the static condition, measuring system performance is made easy by the fact that the
speaker is stationary at well-defined and ideal locations (corresponding to integral sample delays).
In the dynamic condition, a single speaker moves in front of the linear array or the multiarray in a
free-form manner (roaming) to simulate a person walking around the room and dictating. The
speech was taken from the same trained and untrained data sets of the static condition experiments.
Unless the tracker perfectly tracks the speaker, errors will be introduced to the speech recognition
on the order of the static non-interpolated or off-beam conditions.
6.3.1 Tracker Output
In the combined or integrated track mode, the tracker incorporates both audio and visual
localizer information to generate a target location estimate in a manner described in Section 4.4.
Figure 37 and Figure 38 show portions of the integrated tracker output, the normalized direction
of the beam, for two roaming trials with a linear array and multiarray, respectively. The tracker
output is overlaid with the outputs of the individual modal trackers.
73
Figure 37: Integrated tracker output with a
Figure 38: Integrated tracker output with a
linear array. The y-axis is in units of
integral
sample delays (representing
beamformed angles, with negative delay
values corresponding to angles to the left
of the array normal) and the x-axis is the
tracker iteration number. Red points
indicate the audio localizer based tracker
output, blue points indicate video localizer
based tracker output, and the purple line
indicates the actual tracked output.
multiarray. The y-axis is in units of integral
sample delays (representing beamformed
angles, with negative delay values
corresponding to angles to the left of the
array normal) and the x-axis is the tracker
iteration number. Red points indicate the
audio localizer based tracker output, blue
points indicate video localizer based tracker
output, and the purple line indicates the
actual tracked output.
It is evident from the plots that, for the most part, the integrated tracker used the output of
the video localizer based tracker. There are also portions where both the audio and video based
trackers agree. Table 17 lists the actual percentages of the individual modal tracker usage by the
integrated tracker in the two roaming trials.
Configuration
Video used
Audio used
Video only
Audio only
Linear array
97.8%
29.3%
69.5%
1%
Multiarray
98.1%
22.4%
76.4%
0.64%
Table 17: Modal tracker usage by the integrated tracker
74
Only a small percentage of integrated tracker positions were determined solely by the
audio localizer based tracker, though it did provide corroborating information in a considerable
percentage.
In the final analysis, the video localizer based tracker works remarkably well, enough that
for the most part, the audio based tracker is redundant or unnecessary. Further experiments are
necessary, with more variations in stimuli, to determine conclusively the relative merits of each
modal tracker.
6.3.2 WER Data
Table 18 gives the WER scores for a roaming speaker and a linear array under three
tracker configurations: audio-only, video-only, and integrated. The same trained and untrained sets
from the static condition experiments are used. It is expected that all WER values will be similar or
worse than those obtained in the static condition experiments, as there are additional sources of
error that will cause speech signal degradation; a mismatch between the actual location of the
speaker and the array beam leads to increased recognition errors. In any tracking system with a
moving target, tracking delays or outright errors will cause such mismatches.
We further expect the combined and video-only tracker performances to be similar, as the
previous section showed that the video localizer has the greatest contribution to the integrated
tracker output. In addition, the audio-only tracker performance should be significantly worse. We
expect the resulting increased WER values to be within the range of values for the static noninterpolated or static off-beam cases (see sections 6.2.4.5 and 6.2.4.6).
Configuration
Trained Set
Untrained Set
Total
Audio info only
55.3%
58.2%
56.7%
Video info only
30.0%
36.6%
33.0%
Combined info
34.3%
37.0%
35.6%
Table 18: WER for linear array with roaming speaker, with audio only, video
only, and combined information for the tracker. Input from trained,
untrained, and total set.
75
As expected, the video only and integrated tracker cases are statistically similar (See
Figure 39), and the audio tracker case is significantly worse. The reason for this is evident from the
sample audio tracker output plot (Figure 23), which shows an output that occasionally strays very
far from the actual speaker location. The video and combined tracker outputs, on the other hand,
are for the most part stable and show the tracker correctly following the speaker; consequently the
WER values are close to that of the linear array in the static condition.
Linear Array Roam Trained: Audio - Video Combined
LinearArray
Roam Untrained: Audio - Video Combined
o
o0.0%
Unear Array Roam Total: Audio - Video Combined
60.0%
50.0%
50.0%
40.0%34.3%
30.0.%%
30.0%
20.0%
20.0%
10.0%
10.0%
50.0%a
0.0% 1L
I
I
I
I
0.%lknoamAu
lIOnraVu
KirvoamAMu
%
oamA
1nroamv
lkwoa
AV
Figure 39: WER for linear array with roaming speaker and audio only, video
only, and combined input to tracker. The charts are arranged in the
order: trained, untrained, and total data sets. For each chart, the left bar
represents the audio only case, the middle bar represents the video only
case, and the right bar presents the combined case. The red (grid) bar
indicates significance level of p<=0.001.
Table 19 gives the WER scores for a roaming speaker and the multiarray under the three
tracker configurations. We expect similar trends to that of the linear array, with values higher than
that of the multiarray in the static condition.
Configuration
Trained Set
Untrained Set
Total
Audio info only
42.9%
51.1%
46.7%
Visual info only
36.5%
44.6%
40.3%
Combined info
35.9%
43.3%
39.3%
Table 19: WER for multiarray with roaming speaker, with audio only, video
only, and combined information for the tracker. Input from trained,
untrained, and total set.
76
Figure 40 shows that while the video and combined trackers perform better than the audioonly tracker, all three tracker outputs are similar across all data sets, at least statistically. It appears
that the multiarray is less sensitive to the tracking errors of the audio tracker than the linear array.
On the other hand, the video-only and combined trackers seem to be performing worse than with
the linear array.
MultiArray Roam Trained: Audio - Video
Combined
-
MultiArray Roam
Untralned: Audio - Video Combined
MuNlArray Roam Total: Audio - Vkleo - Combined
60.0%
a0.0%
51.1%
60.0%
50.0%
A-
40,3%
39.3%
nh*kwv
MukkoaffkAV
40.0%
30.0% -
2D.0% 40 -
460066
oke~
mjoa0%
10.0%
0.0%
nvikKoafflA
Figure 40: WER for multiarray with roaming speaker and audio only, video
only, and combined input to tracker. The charts are arranged in the
order: trained, untrained, and total data sets. For each chart, the left bar
represents the audio only case, the middle bar represents the video only
case, and the right bar presents the combined case.
Figure 41 provides a direct comparison between the linear array and the multiarray for a
roaming speaker. The results are mixed; the multiarray appears to perform statistically better than
the linear array with an audio-only tracker, but worse with a video-only tracker. The combined
tracker performs similarly with the linear array combined tracker.
77
.
......
.............
...
..
....
. ......
...
.
Linear Array Room Audio - MultiArray Roam
Audio
Linear Array
Linear Array Roam Combined - MulliArray Room
Combined
%0.0%
M7
50.0%
Room Video - MultiArray Room Video
50.0%
30
40.0%
40.0%
20.0%
00%
10.0%
mvjkkoanA
-
10.0%
0.0%
HnroafA
39.3%
0
andrownO
*VmanV
-
1u1
n0%
-
flnooAV
A
muftoenAV
Figure 41: WER for linear array vs multiarray case with roaming speaker. The
charts are arranged in the order: audio only, video only, and combined
input to tracker. For each chart, the left bar represents the linear array
result, and the right bar presents the multiarray result. The blue
(horizontal lined) bar indicates significance level of p<=0.05.
6.3.3 Summary
In comparing the two array configurations, we are basically comparing the tradeoff
between two inversely related array characteristics, spatial selectivity and frequency bandwidth.
The multiarray has a broader frequency response but less spatial selectivity due to a wider beam.
The linear array has a narrower frequency response but better spatial selectivity.
The performance of the different trackers determines which tradeoff is better suited. With
the more unstable and error-filled output of the audio tracker, the broader beam and less spatial
selectivity is an advantage, and the multiarray performs better than the linear array. With the more
stable video tracker, better spatial selectivity is more advantageous in eliminating interfering
background noise, so the linear array performs better. Ideally, the combined tracker should perform
better than either of the individual trackers, but in practice it performs at least as well as the best
individual tracker (in this case video).
6.4 Overall Summary
For many experiments there were no clear or statistically significant advantages for either
array configuration. For others, the multiarray had a slight advantage, primarily due to the broader
spatial resolution (see Section 6.2.4.5). In a typical real-world application, with a moving speaker
and the currently implemented system, the multiarray is the best choice. Improvements in the
78
localizers and trackers, or increasing the number of microphone elements may change the relative
merits of each configuration, and further experiments are necessary.
In absolute terms, no configuration, including the close-talking microphone, currently
perform at a level acceptable for real world applications. However, in a realistic implementation
more time and effort would be expended to tailor the system to the particular speaker; the speech
recognition software can be trained much more to improve overall recognition scores.
6.5 Additional/Future Work
A DSP board level synchronizer will be implemented to allow more rigorous testing with
multiple arrays. In addition, various configurations of multiple arrays (right angle, planar, etc.) will
be examined. More experiments can also be performed to test other aspects of the system. For
example, more complex roaming patterns and additional sources of visual and audio noise are
possible.
In the analysis of microphone array response, the array shading coefficients, ai
in
Equation (4), have been assumed to be constant and unity, which is equivalent to a rectangular
window FIR filter. Again from signal processing theory, it has been determined that the best
tradeoff between beamwidth and sidelobe level using static coefficients are those based on the
Chebyshev window [6].
Different approaches can be employed to adapt the array response to the specific
conditions of the environment. One of the criticisms of simple delay-and-sum beamformers is that
besides the natural attenuation of signals lying outside of the mainlobe, there is no activefiltering
of interfering sources [38]. Interfering or competing noise sources include speech from other
people or coherent sounds like air conditioning units. A common approach is to adaptively modify
the shading coefficients b
using a least mean squared (LMS) based algorithm to adjust the nulls
in the array response so that their spatial location corresponds to that of the noise sources [26].
Still other approaches seek to reduce the effect of reverberation, which is caused by
reflections of the source signal off walls and objects. Most dereverberation techniques require an
estimate of the acoustic room transfer function, which represents how the source signal is modified
by the physical environment. In theory, convolving the inverse of this function with the
microphone output will remove the effects of the reflections. In practice, computing the direct
inverse is often not stable or possible. One computationally efficient method, matched filter array
79
processing, uses the time-reverse of the transfer function as a "pseudo-inverse" and results in
improved SNR of the beamformed signal [5].
80
Chapter 7 Conclusion
A real-time tracking and speech extraction system can be immensely useful in any
intelligent workspace or human-computer interaction application. This thesis presents a simple
low-cost modular system that automatically tracks and extracts the speech of a single, moving
speaker in a realistic environment. Just as important as the implementation of the system, several
experiments have been performed to test the system and explore the tradeoffs in changing various
system variables, including microphone array configuration and visual or audio modality usage.
While neither the linear array nor the multiarray has proven to be overwhelmingly better
than the other, the multiarray, with a broad frequency response but a coarser spatial resolution,
appears to perform slightly better.
We sought to show that multimodal integration provided benefits to thetraditional means
of locating the speaker, sound localization. The experimental results have shown that visual
localization is a very powerful sensory modality that renders audio localization a redundant
modality at best, and even unnecessary in certain circumstances. More experiments, involving a
more complex environment with multiple speakers, are necessary to determine whether the reverse
is true in other circumstances.
Several possibilities exist for future expansion and work. Adding and testing adaptive and
dereverberation features as discussed in Section 6.5 is the immediate next step. Different
configurations of both the visual and audio hardware may also be examined. Currently, the
microphone array is linear and can therefore handle one dimension of spatial separation. A two
dimensional array capable of handling two spatial dimensions is possible with the development of
81
a DSP board level synchronizer. Similarly, only a single camera is currently being used. More
precise tracking and object detection can be performed with multiple cameras placed at different
locations.
82
Appendix A Speech Spectrograms
For all of the following spectrogram figures, the utterance was "This is a test." Red denotes
high signal energy, while blue indicates low energy.
A.1 Controls
Figure 42: One
spectrogram
channel
microphone
Figure 43: Close
spectrogram
talking
microphone
It is evident from the spectrogram plots that the close-talking microphone output has a
bigger contrast (red to blue) between high and low signal energy than the one channel microphone
output, which is mostly red or orange. Having a higher contrast corresponds directly to having a
higher SNR, which confirms the results of Section 6.2.1.
83
...
. ....
. .....
....
......
A.2 Single Array Configurations
Figure 44: Linear array spectrogram
Figure 45: Multiarray spectrogram
The linear array spectrogram appears to have a higher SNR and better frequency
resolution, which is also in agreement with the results of Section 6.2.1 and our understanding of
the benefits of the linear array over the multiarray. It is surprising that the multiarray works as well
as was found in the experiments given the poor SNR and frequency resolution.
84
..
......
.
A.3 Multiple Array Configurations
Figure 46:
Multiple
spectrogram
linear
arrays
Figure
47:
spectrogram
Multiple
multiarrays
In the multiple arrays case there is a very clear increase in SNR compared to the single
array case for both the linear and multiarray. The much improved spectrogram for the multiple
multiarray case over the single multiarray case is at odds with the actual WER results, where it was
found that performance
was not much improved. A probable
explanation
is that the
synchronization of the two arrays happened to coincide in this very short sample, while it drifted
over the course of a much longer speech trial.
85
86
Appendix B Speech Sets and Sample Results
B.1 Actual Text
B.1.1 Trained Set
1. to enroll you need to read these sentences aloud speaking naturally
and as clearly
as possible, then wait for the next sentence to
appear
2. this first section will describe how to complete this enrollment
3. additional sections will introduce you to this continuous dictation
system and let you read some other entertaining material
4. the purpose of enrollment and training is to help the computer
learn to recognize your speech more accurately
5. during enrollment you read sentences to the computer the computer
records your speech and saves it
for later
processing
6. during training the computer processes your speech information
helping it to learn your individual way of speaking
7. read normal text with any pauses you need such as for or any time
you need to take a breath
8. but be sure to read hyphenated commands like like a single word
with no pause
9. how can you tell if the computer has correctly understood the
sentence you just read?
10.as you read a sentence check it to see if
it turned red
11.if the computer understood what you said the next sentence to read
will appear automatically
12.if the sentence you just spoke turned red then the computer did not
understand everything you said
13.the first time this happens with a sentence try reading the
sentence again
14.if it happens again click the playback button to hear what you just
recorded
15.and pay special attention to the way you said the words and to any
strong background noise
16.if everything sounds right to you click the next button to go to
the next sentence you do not need to fix all red sentences
17.if you heard anything that did not sound right try recording the
sentence one more time
87
18.if all or most of your sentences are turning red click the options
button
19.then move the slider for match word to sound closer to approximate
B.1.2 Untrained Set
1. consider the difficulties you experience when you encounter someone
with an unusual accent
2. or if someone says a word which you don't know or when you can't
hear someone clearly
3. fortunately people use speech in social situations
4. a social setting helps listeners figure out what speakers are
trying to convey
5. in these situations you can exploit your knowledge of english and
the topic of conversation to figure out what people are saying
6. first you use the context to make up for the unfamiliar or
insufficient acoustic information
7. then if you still can't decipher the word you might ask the person
to repeat it slowly
8. most people don't realize how typical it is for people to use
context to fill in any blanks during normal conversation
9. but machines don't have this source of supplementary information
10.analyses based on meaning and grammar are not yet powerful enough
to help in the recognition task
1l.therefore current speech recognition relies heavily on the sounds
of the words themselves
12.even under quiet conditions the recognition of words is difficult
13.that's because no one ever says a word in exactly the same way
twice
14.so the computer can't predict exactly how you will say any given
word
15.some words also have the same pronunciation even though they have
different spellings
16.so the system can not determine what you said based solely on the
sound of the word
17.to aid recognition we've supplemented the acoustic analysis with a
language model
18.this is not based on rules like a grammar of english
19.it is based on an analysis of many sentences of the type that
people typically dictate
20.context also helps in distinguishing among sets of words that are
similar in sound although not identical
88
B.2 Headset (Close) Microphone Data Set
B.2.1 Trained Set Results
1. To enroll you need to read the sentences is aloud speaking
naturally in as clearly as possible, then wait for the next
sentence to appear (rei_1)
2. This first section would describe how to complete this enrollment
(rei 2)
3. Additional sections will introduce you to this continues dictation
system and (let) you read some other entertaining material (rei_3)
4. The purpose of enrollment and training is to help the computer
learn to recognize your speech more accurately (rei 4)
5. During enrollment you read censuses to the computer
Computer
records to speech and sees it for later processing
(rei_5)
6. During training the computer processes is speech information hoping
it to learn your individual lawyers speaking (rei
6)
7. Read normal text with any pauses you need such as for or any time
in need to take a breath (rei 7)
8. But be sure to be hyphenated commands like but a single word with
no pause (rei 8)
9. How can you tell
the computer screen is to the sense
just read
can? (rei 9)
10.As you to sentence ticket to see if it turned red
(rei 10)
11.If the computer understood what he said the next sentence read
appear automatically (rei 11)
12.If the sentencing judge spoke to and read and computer did not
understand everything said (rei 12)
13.The first time this happens with the sentence try reading the
sentence again (rei 13)
14.If that happens again click the playback button to hear what you
just recorded (rei_14)
15.And pay special attention to the way said the words and to any
strong background noise (rei 15)
16.If everything sounds right to you click the next but then to go to
the next sentence he did not need to fix all read sentences
(rei 16)
17.If you heard anything did not sound right track recording the
sentence one more time (rei 17)
18.If all almost is sentences that turning red click the options but
some (rei 18)
19. Then move tihe slider for a match with the sound closer to approximate (rei_19)
89
B.2.2 Untrained Set Results
1. Consider the difficulties expense with intent to someone with an
unusual accent (rei 51)
2. Or if someone says a word which don't know when he can't hear
someone clearly (rei 52)
3. Fortunately people use speech in social situations
(rei 53)
4. A social setting helps listeners figure out what speakers are
trying to convey (rei 54)
5. In these situations you can exploit your knowledge of English at
the topic of conversation to figure out what people are saying
(rei 55)
6. Foresees a context to make up for the unfamiliar or insufficient
acoustic intimation (rei_56)
7. Beneath the sick and decipher the word you might ask the person to
repeat
slowly (rei 57)
8. Most people don't realize how to acquit is for people accuse
context to felony blanks to normal conversations (rei_58)
9. But machines don't have this was a supplementary information
(rei 59)
10.Analyses based on meaning Grandma how not yet powerful enough to
help in the recognition task (rei_60)
11.Therefore current speech recognition relies heavily on the sounds
of the words themselves (rei_61)
12.Even on a quiet conditions the recognition of words is difficult
(rei 62)
13.That's because no one ever says a word in exactly the same way
twice (rei 63)
14.Seven computer can't predict exactly how you say any given word
(rei 64)
15.Some words also have the same pronunciation even though their
defense spellings (rei 65)
16.Service system cannot determine what you said based solely on the
sound of a word (rei_66)
17.To aid recognition with supplemented acoustic announces with the
line which model (rei 67)
18.This is not based on rules like a grandmother English (rei 68)
19.Is based on an analysis of many sentences of a type that people to
Berkeley dictate (rei 69)
20.Context also helps in distinguishing among assets of words that a
similar and sound of the not identical (rei_70)
90
B.3 Single Element
B.3.1 Trained Set
1. To enroll you need to be the sentences fallout speaking naturally
and as clearly as possible and wait for the next sentence to appear
(rei_1)
2. This first section would describe how the complete this and Rome
(rei_2)
3. Additional sessions will introduce you to this continuation to use
efficient system the only reason other entertainment
(rei 3)
4. The purpose of enrollment and training is to help the computer
learn to recognize is the more accurately (rei_4)
5. During enrollment fee resemblances to the computer
Peter of course
feature SEC for later processing
(rei 5)
6. During training the computer processes your speech definition
helping it to learn your individual is speaking (rei 6)
7. Make no more intensive if any pause is unique such as four is for
any time in the fifth (rei 7)
8. But be sure to read hyphenated commands let me have raft when the
single word with no is behalf (rei_8)
9. How can you tell me if the computer has correctly understood the
sentence you just read (rei_9)
1O.As you read a sentence a ticket to see if it turned red
(rei 10)
11.The computer others that were used the the next sentence to read
will appear automatically (rei 11)
12.If the Senate considers both turned red (rei_12)
13.The first time this happens for sentence try reading the sentence
again (rei_13)
14.It happens again with Iraq but here we just reported (rei_14)
15.A special attention to the ways of the words and to many strong
background noise (rei 15)
16.Getting some
40 but the next button to go to the next up The money
to fix all right set (rei 16)
17.Here and in the the not sound right time recording the Senate,
(rei 17)
18.It almost is sentences are turning red, the auctions by (rei 18)
19.Then move this order or a match for the sound of to to oppose
(rei_19)
91
B.3.2 Untrained Set
1. Severe difficulties this fence running counter someone with an
unusual accents (rei 51)
2. For someone says a word of tonal and here's some clearly (rei_52)
3. Fortunately he please finish in social situations (rei 53)
4. For social setting helplessness figure out what speakers are trying
to convey (rei_54)
5. In these situations finance for your knowledge of English hero with
it was saying (rei 55)
6. For Caesar conquered from the fund familial were insufficient of
Pacific nation (rei 56)
7. Then if is to have to suffer the work he might as the person to
peacefully (rei 57)
8. To speed with unrealized profit to is for people whose contents Of
the land any flights during normal conversation (rei_58)
9. machines that have resources of woman fifth majore
(rei 59)
10.Analyses based on meaning and drummer for an idea how far enough to
hold further recognition has (rei_60)
11.Therefore current speech recognition was heavily on.is also worth
saw (rei 61)
12.In another part Commission the recognition of words as the "
(rei 62)
13.Best because no one ever says a word is that is simply point
(rei 63)
14.Still popular tactic SEC economists say any import (rei_64)
15.Some words or so as in pronunciation even though their incomes by
(rei_65)
16.Service system cannot summon ways that is so we are some of the
word (rei_66)
17.To recognition result: the purpose but announces what language,
(rei 67)
18.Is not based on Wall by the Liberal mobbing which
(rei_68)
19.is based on analysis of many senses of retired (rei 69)
20.Complex also hopes of distinguishing among sense of words (rei_70)
92
B.4 Linear Array, On-beam Angle=O
B.4.1 Trained Set
1. To enroll in need to the face sentences aloud naturally and as
clearly as possible and wait for the next sentence to appear
(rei_1)
2. This first section were describe how to complete this enrollment
(rei 2)
3. Additional sections will introduce you to this continues dictation
3)
system and the reason other entertaining material (rei
4. Purpose of enrollment and training is hoped the computer learn to
recognize your speech more accurately (rei_4)
5. during enrollment you read sentences to the computer Peter of
(rei_5)
for later processing
court to speech and Caesar
6. during training in the computer processes your speech defamation
6)
to learn your individual lawyers speaking (rei
hoping it
7. Lead normal text with any pauses unique scissors for or any time
any take a breath (rei_7)
8. But he sure to read hyphenated commands like with a single word
with no pause (rei_8)
9. Hundreds tell the computer has correctly understood the sentence
(rei 9)
just read?
(rei 10)
10.Busily the sentence check in see if it turned red
11.Is the computer understood we said the next sentence should read
will appear automatically (rei 11)
12.If the sentence judge spoke to and red The not understand it in is
set (rei_12)
13.The for is time this happens for the sentence try reading the
sentence again (rei_13)
14.If it happens again put that there that want to hear we just
reported (rei_14)
15.That the special attention to the way said the words after any
strong background Alex (rei_15)
16.If everything sounds trite you click the next autumn to go to the
red Sox (rei
16)
next sentence money you do not need to fix all
17.To let a that the not sound right track recording the sentence one
more time (rei 17)
18.Is almost is sentences are turning red but the options but (rei_18)
closer to boxed (rei_19)
19.Then move the slider for much work to sell
93
B.4.2 Untrained Set
1. Consider the difficulties experience an encounter someone with an
unusual accent (rei 51)
2. Or someone says a word which don't know you can't hear someone
clearly (rei 52)
3. Fortunately people use speech in social situations
(rei 53)
4. The social setting hopes listeners figure out what speakers are
trying to convey (rei 54)
5. In these situations even explain the knowledge of English and the
topic of conversation to figure out for St in
(rei_55)
6. First peace conference to make up for the unfamiliar or
insufficient the acoustic information (rei_56)
7. Then he still can't decipher the word you might pass the person to
repeat
slowly (rei 57)
8. Most people don't realize how to put
is for people to use context
to fill and enable lines during normal conversations (rei 58)
9. But machines don't have this source of supplementary information
(rei 59)
10.Analyses based on meaning and Grandma not yet offer enough to help
in the recognition task (rei_60)
11.Therefore current speech recognition was heavily on the sounds of
the words and soaps (rei_61)
12.Even under what conditions the recognition of words is to
(rei_62)
13.That's because no one of us as a word and set is in which buys
(rei 63)
14.To the computer can't predict exactly how you say any deport
(rei 64)
15.Some words also of the same pronunciation even though they defense
blowups (rei 65)
16.Service system can not determine ways that based solely on the
sound of a word (rei_66)
17.To late recognition will supplemented the acoustic announces with
the language of (rei 67)
18.This is not based on goals by the ballot in which (rei 68)
19.Is based on analysis of many sentences but the type people to
predict (rei 69)
20.Context also helps stocks finished in amounts of words is similar
in sound can not adapt (rei_70)
94
B.5 Multiarray, On-beam Angle=O
B.5.1 Trained Set
1. Drumroll of you need to read these sentences alloud, speaking
naturally as clearly as possible and wait for the next sentence to
appear (rei_1)
2. This for a section with describe how to complete this and Roman
(rei_2)
3. Additional sections will introduce you to this continuous dictation
system and let me read some other entertaining material (rei 3)
4. The purpose of enrollment and training is to help the computer
learn to recognize your speech more accurately (rei_4)
5. Turn enrollment Sentences to the computer As a set for later
processing (rei_5)
6. During training the computer processes speeds information cocaine
to learn your individual lawyers speaking (rei_6)
7. Read normal text with any pauses the need such as for were any time
any typical crime (rei_7)
8. Be sure to read hyphenated commands like what the single word no
pause (rei_8)
9. And needs no of the computer has correctly understood the sentence
just read (rei_9)
10.As you read a sentence check in to see if it turned red (rei 10)
11.If the computer understood we said the next sentence to read will
appear automatically (rei_11)
12.If the center suggests to red and computer can not understand
levees said (rei_12)
13.The first time this happens with a sentence try reading the
sentence again (rei_13)
14.If it happens again click the playback button to hear what you
just recorded (rei_14)
15.Pay special attention to the way you said the words and to any
strong background bonds (rei_15)
16.If everything sounds right to you click the next one to close the
next sentence if the money to fix all red a sentence (rei 16)
17.If you heard of anything that the not sound right track recording
the sentence one more time (rei 17)
18.All most of the sentences are turning red click the options but
(rei 18)
19.Then move the slider for match were to sell closer to box (rei_19)
95
B.5.2 Untrained Set
1. Consider the difficulties experience an encounter someone with an
unusual access (rei 51)
2. Or someone says a word to Donald note: can't hear someone clearly
(rei 52)
3. Fortune and the people use speech and social situations (rei 53)
4. A social setting helps listeners figure out what speakers are
trying to convey (rei_54)
5. The situations begin exporting your knowledge of English topic of
conversation to figure out what people sang (rei_55)
6. Firsts use the context to make up for the unfamiliar or
insufficient but this information (rei 56)
7. Then if you still can't decide for the word As the person to repeat
slowly
(rei 57)
8. Most people don't realize how to put it is for people to use
context of felony blanks to normal conversations (rei_58)
9. Machines don't have the source of supplementary information
(rei 59)
10.Analyses meaning and Grandma not yet for now to have been the
recognition past (rei 60)
11.therefore current speech recognition relies heavily on the sounds
of the words themselves
(rei_61)
12.Even on a quiet conditions for recognition awards is difficult
(rei 62)
13.Us because no one ever says a word in excess of the same way
towards (rei 63)
14.So the computer can't predict exactly how you say any thing worth
(rei 64)
15.Some words most of the same pronunciation even though they have
defense balance (rei 65)
16.Services and can not determine what you said based solely on some
of the word (rei_66)
17.To name recognition we've supplemented the acoustic announces with
a language more (rei 67)
18.this is not based on rules for Grumman of English (rei 68)
19.Is based on analysis of many sentences of the tight the people of
the dictate (rei 69)
20.Context also hopes of distinguishing among sense of words and a
similar unsolved problem not identical (rei_70)
96
Appendix C 16 Element Array Design
d2
Sub-array 3
d
Figure 48: Microphone placement for 16 element compound array.
Additional elements can be easily added to the system by incorporating extra DSP32C
boards, which have built-in inter-board data sharing capabilities [39]. Figure 48 shows the
configuration for a 16 element compound array. Interelement spacing is the same as for the eight
element case. Figure 49 and Figure 50 are the high and mid frequency subarray beam patterns,
respectively. The low frequency subarray is unchanged from the one in the eight element
compound array.
97
51
15 .
.....
0
0
15
2
.....
18
.
2
0
...
0..
-
0 2
21
[10,0.02,8000]
I
0
1
[10,0.02,2000]
1
0
1.
[10,0.02,1000]
0
1,
[10,0.02,500]
10
.....
18
..
.
0
0 21
0 21
0
2
0
2
0
[10,0.02,8000]
[1 0,0.02,1000]
[10,0.02,2000]
1
1
- 0
1
0
1.
00
.
1
18
.
.. .....1.... ...... 18 .
0
12
18
......
21
..
18
00 21
0
0 21
0 21
21
0C
2
0
2
2
C
0
2
[10,0.02,500]
1
0
15
Figure 49: High Frequency Sub-Array Pattern
1
18 --+ - -
18
-
0
2
2
0
0
0 21
2
0
2
1
-
15..
[10,0.06,8000]
I
1
- 0
1
18 --
0 2170
2
0
Figure 50: Mid Frequency Sub-Array Pattern
98
0
. ..
'
-+--
0
0
2
0
[10,0.06,2000]
[10,0.06,1000]
1
0
.
1
.0
0 15
1
1 8 ....--.-. - 18 - -0 2
[10,0.06,8000]
1 1
o
1-00
18
[10,0.06,500]
1
1
. 0
21
-
0 21
2
0
1
10
. .....
0 2
21
[10,0.06,2000]
[10,0.06,1000]
[10,0.06,500]
10
0 1
'.
,..
18 .-- -
.
----
0
21
0
2
0
References
[1]
Durlach, N. I. and Mayor, A. S., "Virtual Reality: Scientific and Technological
[2]
Challenges,". Washington, D.C.: National Academy Press, 1995, pp. 542.
Shockley, E. D., "Advances in Human Language Technologies," IBM, White Paper 1999.
[3]
Brookner, E., Tracking and Kalman FilteringMade Easy. New York: John Wiley & Sons,
[4]
Inc., 1998.
Flanagan, J. L., Berkley, D. A., Elko, G. W., West, J. E., and Sondhi, M. M.,
[5]
"Autodirective Microphone Systems," Acustica, vol. 73, 1991.
Rabinkin, D. V., "Optimum Sensor Placement for Microphone Arrays," inPh.D., Dept. of
Electrical and Computer Engineering.New Brunswick, NJ: Rutgers, State University of
New Jersey, 1998, pp. 169.
[6]
[7]
[8]
Lustberg, R. J., "Acoustic Beamforming Using Microphone Arrays," in MS., Dept. of
ElectricalEngineeringand Computer Science. Cambridge: MIT, 1993, pp. 72.
Bub, V., Hunke, M., and Waibel, A., "Knowing Who to Listen to in Speech Recognition:
Visually Guided Beamforming," presented at Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, 1995.
Teranishi, R., "Temporal aspects in hearing perception," in Handbook of Hearing,Namba,
S., Ed. Kyoto: Nakanishiya Shuppan (in Japanese), 1984.
[9]
[10]
[11]
Cole, R. A., Mariani, J., Uszkoreit, H., Zaene, A., and Zue, V., "Survey of the State of the
Art in Human Language Technology,".: National Science Foundation, 1995, pp. 590.
Silsbee, P., "Sensory Integration in Audiovisual Automatic Speech Recognition," presented
at 28th Asilomar Conference on Signals, Systems and Computers, 1994.
Bregler, C., Omohundro, S., and Konig, Y., "A hybrid approach to bimodal speech
recognition," presented at 28th Asilomar Conference on Signals, Systems and Computers,
1994.
[12]
Irie, R. E., "Multimodal Sensory Integration for Localization in a Humanoid Robot,"
presented at Second IJCAI Workshop on Computational Auditory Scene Analysis,
Nagoya, Japan, 1997.
[13]
Irie, R. E., "Multimodal Integration for Clap Detection," NTT Basic Research Laboratory,
Japan, Internal Report 1998.
99
[14]
Knudsen, E. I. and Brainard, M. S., "Creating a Unified Representation of Visual and
Auditory Space in the Brain," Annual Review ofNeuroscience, vol. 18, pp. 19-43, 1995.
[15]
Stein, B. E. and Meredith, M. A., The Merging of the Senses. Cambridge: MIT Press, 1993.
[16]
Meredith, M. A., Nemitz, J. W., and Stein, B. E., "Determinants of multisensory
integration in superior colliculus neurons," Journal of Neuroscience, vol. 7, pp. 3215-29,
[17]
[18]
Bracewell, R., The FourierTransform and Its Applications: McGraw-Hill, 1986.
Chou, T. C., "Broadband Frequency-Independent Beamforming," in MS., Dept. of
ElectricalEngineeringand Computer Science. Cambridge: MIT, 1995, pp. 105.
Inoue, K., "Trainable Vision based Recognizer of Multi-person Activities," in Dept. of
ElectricalEngineeringand Computer Science. Cambridge: MIT, 1996, pp. 79.
Knapp, C. H. and Carter, G. C., "The Generalized Correlation Method for Estimation of
Time Delay," IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-24,
1987.
[19]
[20]
1976.
[21]
[22]
Omologo, M. and Svaizer, P., "Acoustic Source Location in Noisy and Reverberant
Environment using CSP Analysis," presented at ICASSP96, 1996.
Rosenberg, A. E. and Soong, F. K., "Recent Research in Automatic Speaker Recognition,"
in Advances in Speech Signal Processing,Furui, S. and Sondhi, M. M., Eds. New York:
Marcel Dekker, 1992, pp. 701-738.
[23]
[24]
[25]
Murase, H. and Nayar, S. K., "Visual Learning and Recognition of 3-D Objects from
Appearance," InternationalJournalof Computer Vision, vol. 14, pp. 5-24, 1995.
Zhang, Z. and Faugeras, 0., 3D Dynamic Scene Analysis: Springer-Verlag, 1992.
Swain, M. J. and Ballad, H., "Color Indexing," InternationalJournalof Computer Vision,
vol. 7, pp. 11-32, 1991.
[26]
Johnson, D. H. and Dudgeon, D. E., Array Signal Processing: Concepts and Techniques.
NJ: Prentice Hall, 1993.
[27]
[28]
[29]
Lee, J., "Acoustic Beamforming in a Reverberant Environment," in Dept. of Electrical
Engineeringand Computer Science. Cambridge: MIT, 1999, pp. 64.
Dudgeon, D. E. and Mersereau, R. M., Multidimensional DigitalSignal Processing.New
Jersey: Prentice Hall Inc., 1984.
Goodwin, M. M. and Elko, G., "Constant Beamwidth Beamforming," presented at
Proceedings of the 1993 IEEE ICASSP, 1993.
[30]
[31]
Oppenheim, A. V. and Schafer, R. W., Discrete-Time Signal Processing. New Jersey:
Prentice Hall, 1989.
Parker, J. R., Algorithms for Image Processingand Computer Vision. New York: Wiley
Computer Publishing, 1997.
[32]
Gose, E. a. J., Richard and Jost, Steve, Pattern Recognition and Image Analysis. NJ:
Prentice Hall PTR, 1996.
[33]
Rabiner, L. and Juang, B.-H., Fundamentalsof Speech Recognition. New Jersey: Prentice
[34]
[35]
Hall, 1993.
NIST, "SCTK NIST Scoring Toolkit,",, 1.2 ed: NIST, 1998.
USGS, "The Insignificance of Statistical Significance Testing,".: USGS Norther Prairie
[36]
Wildlife Research Center, 1999.
Gillick, L. and Cox, S., "Some Statistical Issues in the Comparison of Speech Recognition
Algorithms," presented at ICASSP 89, 1989.
[37]
Pallett, D. and al., e., "Tools for the Analysis of Benchmark Speech Recognition Tests,"
[38]
presented at ICASSP 90, 1990.
Haykin, S., Adaptive Filter Theory. New Jersey: Prentice Hall, 1996.
100
[39]
Signalogic, SIG32C-8 System User Manual. Texas, 1994.
101
Download