A Visually-Guided Microphone Array for Automatic Speech Transcription by Robert Eiichi Irie S.B., Engineering Science Harvard University (1993) S.M., Electrical Engineering and Computer Science Massachusetts Institute of Technology (1995) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of DOCTOR of PHILOSOPHY at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2000 @ 2000 Massachusetts Institute of Technology All rights reserved Signature of Author Department of Electrical Engineering and Computer Science October 2, 2000 Certified by Rodney A. Brooks Fujitsu Professor-of Computer Science and Engineering Thsjs Superyisor Accepted by Arthur C. Smith Chairman, Departmental Committee on Graduate Studies MASSACHUSETTS INSTITUTE OF TECHNOLOGY APR 2 4 2001 LIBRARIES BARKER 2 A Visually-Guided Microphone Array for Speech Transcription by Robert E. Irie Submitted to the Department of Electrical Engineering and Computer Science on October 6, 2000, in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical Engineering and Computer Science ABSTRACT An integrated, modular real-time microphone array system has been implemented to detect, track and extract speech from a person in a realistic office environment. Multimodal integration, whereby audio and visual information are used together to detect and track the speaker, is examined to determine comparative advantages over unimodal processing. An extensive quantitative comparison is also performed on a number of system variables (linear/compound arrays, interpolation, audio/visual tracking, etc) to determine the system configuration that represents the best compromise between performance, robustness, and complexity. Given a fixed number of microphone elements the compound array, with a broader frequency response but a coarser spatial resolution, has been determined to have a slight performance advantage in the currently implemented system over the linear array. Thesis Supervisor: Rodney A. Brooks Title: Fujitsu Professor of Computer Science and Engineering 3 4 Acknowledgments I would like to thank my advisor, Prof. Rodney Brooks, for giving me the freedom to pursue my own avenues of research, and with being patient when some paths turned out to be dead ends. I have learned greatly to think independently and to self-motivate myself under his tutelage. I would also like to thank the rest of my committee, who have helped with technical as well as overall advice. Being part of the Cog Group was always an enriching and exciting experience. I have had numerous thought-provoking conversations with everyone, but especially with Charlie Kemp, Cynthia Ferrell, and Matthew Marjanovic. I must also thank Brian, Juan, Naoki, Junji, Takanori, Kazuyoshi, and all other members, past and present. Finally, I would like to thank Shiho Kobayashi, the warmest and most stimulating person that I have had the good ortune to meet. She has been very supportive during times of much stress and confusion and I owe the completion of this dissertation to her. 5 6 CHAPTER 1 IN TR O D U CTIO N .................................................................................................. 11 CHAPTER 2 BA CK G R OUND ..................................................................................................... 13 2.1 M ICROPHONE A RRAYS ............................................................................................................... 13 2. 1.1 Source Location ....................................................................................................................... 13 2.1.2 Sound A cquisition ................................................................................................................... 14 2.2 M ULTIM ODAL INTEGRATION ...................................................................................................... 14 2.3 PROBLEM D EFINITION ................................................................................................................ 16 CHAPTER 3 D ESIG N ...................................................................................................................19 3.1 M ICROPHONE A RRAY CONFIGURATION ..................................................................................... 20 3. 1.1 A rray Response ....................................................................................................................... 20 3.1.2 Spatial A liasing ....................................................................................................................... 22 3.1.3 Beam width Variations ............................................................................................................. 22 3.1.4 Sensor Placem ent and Beam Patterns ..................................................................................... 23 3.2 BEAM GUIDING ...........................................................................................................................28 3.2.1 Detection .................................................................................................................................29 3.2.2 Tracking ..................................................................................................................................29 CH APTER 4 IMPLEM ENTATIO N ............................................................................................ 33 4.1 GENERAL SYSTEM OVERVIEW ...................................................................................................33 4.1.1 Hardware Com ponents ............................................................................................................ 34 4.1.2 Software Com ponents ............................................................................................................. 34 4.1.3 System Architecture ................................................................................................................ 35 4.2 AUDIO PROCESSING .................................................................................................................... 36 4.2.1 Beam form er .............................................................................................................................36 4.2.2 A udio Localizer .......................................................................................................................41 4.3 VISUAL PROCESSING .................................................................................................................. 42 4.4 TRACKER .................................................................................................................................... 45 CHAPTER 5 PR O CEDU RE .........................................................................................................49 5.1 ExPERIM ENTAL V ARIABLES ....................................................................................................... 49 5.1.1 System Configuration .............................................................................................................. 49 7 5.1.2 Trial Condition ........................................................................................................................ 50 5.2 M EASUREMENTS ........................................................................................................................ 50 5.2.1 SN R ......................................................................................................................................... 51 5.2.2 Position Estim ates ................................................................................................................... 51 5.2.3 Word Error Rate ......................................................................................................................51 5.2.4 M ethod and Controls ...............................................................................................................52 5.3 EXPERIMENTAL SETUP .............................................................................................................. 52 CH APTER 6 RESULTS ................................................................................................................ 55 6.1 INTRODUCTION ........................................................................................................................... 55 6.2 STATIC CONDITION ..................................................................................................................... 56 6.2.1 Signal-to-Noise Ratio .............................................................................................................. 56 6.2.2 Localization Output ................................................................................................................. 57 6.2.3 Tracker Output ........................................................................................................................ 58 6.2.4 WER Data ............................................................................................................................... 60 6.2.5 Sum m ary ................................................................................................................................. 72 6.3 DYNAMIC CONDITION ................................................................................................................ 73 6.3.1 Tracker Output ........................................................................................................................ 73 6.3.2 WER Data ............................................................................................................................... 75 6.3.3 Sum m ary .................................................................................................................................78 6.4 OVERALL SUMMARY .................................................................................................................. 78 6.5 ADDITIONAL/FUTURE W ORK ..................................................................................................... 79 CH APTER 7 CONCLUSION ....................................................................................................... 81 APPENDIX A SPEECH SPECTR O GRA M S .............................................................................. 83 A . 1 CONTROLS ................................................................................................................................. 83 A .2 SINGLE ARRAY CONFIGURATIONS ............................................................................................ 84 A .3 M ULTIPLE ARRAY CONFIGURATIONS ....................................................................................... 85 APPENDIX B SPEECH SETS AND SAMPLE RESULTS ....................................................... 87 B. I ACTUAL TEXT ............................................................................................................................ 87 B.1.1 Trained Set .............................................................................................................................. 87 B. 1.2 Untrained Set .......................................................................................................................... 88 8 B.2 HEADSET (CLOSE) M ICROPHONE DATA SET ............................................................................. 89 B.2.1 Trained Set Results ................................................................................................................. 89 B.2.2 U ntrained Set Results ............................................................................................................. 90 B.3 SINGLE ELEMENT ....................................................................................................................... 91 B.3.1 Trained Set .............................................................................................................................. 91 B.3.2 U ntrained Set .......................................................................................................................... 92 BA LINEAR A RRAY, ON-BEAm ANGLE=O ....................................................................................... 93 B.4.1 Trained Set .............................................................................................................................. 93 B.4.2 Untrained Set .......................................................................................................................... 94 B.5 M ULTIARRAY,, ON-BEAm ANGLE=O .......................................................................................... 95 B.5.1 Trained Set .............................................................................................................................. 95 B.5.2 U ntrained Set .......................................................................................................................... 96 APPENDIX C 16 ELEMENT ARRAY DESIGN ........................................................................97 REFERE N CES ............................................................................................................................... 99 9 10 Chapter 1 Introduction As computation becomes increasingly more powerful and less expensive, there have been efforts to make the workspace environment more intelligent and the interaction between humans and computer systems more natural [1]. One of the most natural means of communication is speech, and a common task is to transcribe a person's dictated speech. Speech recognition technology has progressed sufficiently that it is now possible to automate transcription of dictation with a reasonable degree of accuracy using commercial solutions [2]. One particular scenario under consideration is an intelligent examination room, where it is desirable for a physician to make an oral examination report of a patient. Current systems require the physician to be seated by the transcription device (either a computer or a phone) or to carry a wireless microphone that is cumbersome and requires periodic maintenance. One possible solution is to embed in the physical room an intelligent system that is able to track and capture the physician's speech and send it to the speech recognition software for transcription. An advantage of this solution is that it requires no extra or special action to be performed by the physician. The disadvantage is that there is added complexity and sources of error in the speech recognition process. Allowing the speaker to roam around the office freely forces the system to handle background acoustic noise and to take into account his/her motion; current speech recognition technology requires the use of a microphone that is placed close to the speaker to avoid such issues. To counteract noise sources and to localize sound capture over a specified spatial region, arrays of microphone elements are often used. By appropriately delaying and summing the output of multiple microphones, signals coming from a desired direction are coherently combined II and have improved signal to noise ratio (SNR) while signals from other directions are incoherently combined and attenuated. One can imagine a beam of spatial selectivity that can be digitally formed and steered by adjusting delays. The final requirement for a microphone array is the automatic steering of the beam. Most arrays use audio-only techniques, which are either computationally expensive or prone to errors induced by acoustic noise. We introduce an additional modality, visual information, to guide the beam to the desired location. Our hypothesis is that a multimodal sensor system will be able to track people in a noisy, realistic environment and transcribe their speech with better performance and robustness than a unimodal system. We also seek to determine if such an integration of modalities will allow simpler, less computationally intensive components to be used in real time. The organization of the rest of this thesis is as follows: Chapter 2 provides background information including past work on microphone arrays and multimodal integration. It includes a more thorough formulation of the problem and the expected contributions of this project.Chapter 3 outlines major design considerations and the proposed solutions. Chapter 4 discusses actual implementation details and issues and describes the currently implemented system. Chapter 5 outlines the experimental procedures used to test the performance of the system based on several well-defined controls. Chapter 6 presents the results of the experiments, as well as an analysis of the relative merits of various system parameters. Finally, Chapter 7 concludes with a discussion of the impact of this project and extensions for future work. 12 Chapter 2 Background 2.1 Microphone Arrays Array signal processing is a well-developed field, and much of the theoretical foundation of microphone arrays and target trackers is based on narrowband radar signal processing [3]. Electronically steered microphone array systems have been extensively developed since the early 1980s. They range from simple linear arrays with tens of sensors to complex two and three dimensional systems with hundreds of elements [4]. Regardless of size and complexity, all sound/speech capturing systems need to perform two basic functions, locating the sound source of interest and then acquiring the actual sound signal [5]. 2.1.1 Source Location Most sound source location methods fall into one of two categories, time delay of arrival (TDOA) estimation and power scanning. The former determines source direction by estimating the time delay of signals arriving at two or more elements; microphones located closer to the sound source will receive the signal before those farther away. TDOA estimation provides accurate estimates of source location, but is sensitive to reflections and multiple sound sources. In this project a simple TDOA estimator will be supplemented with a visual localization system to provide robust source location estimates. Power scanning usually involves forming multiple beams that are spatially fixed; the beam with the highest energy output is then selected [6]. While power scanning is conceptually simple and easy to implement, it requires huge amounts of computation for all but 13 the coarsest of spatial resolution, as every possible spatial location of interest must be represented by its own beam. 2.1.2 Sound Acquisition The classical method for sound acquisition is the delay-and-sum beamformer, and will be discussed in depth in Section 3.1. Numerous modifications of this basic method have been proposed and include matched filtering, reflection subtraction, and adaptive filtering. All these methods attempt to improve performance by more actively handling various acoustic noise sources such as reverberation and interfering signals. They rely on noise modeling and require simplified assumptions of the noise source and acoustic enclosure (i.e., the room) [5]. 2.2 Multimodal Integration Visually guided beamforming has been examined before; Bub et. al. use a linear nonadaptive array of 15 elements and a detection based source location scheme (Refer to Section 3.2.2) [7]. Using a gating mechanism, either visual or sound localization information was used to guide the beam, but not both simultaneously. It was shown that recognition rates for a single speaker in background and competing noise were significantly higher for the visual localization case. Vision and audition are complementary modalities that together provide a richer sensory space than either alone. Fundamental differences in respective signal source and transmission characteristics between the two modalities account for their complementary nature. In audition, the information source (the audible object or event) and signal (sound wave) source are often one and the same, whereas in vision the signal (light) source is usually separate from the information source (the visible object). Furthermore, most visible objects of interest are spatially localized and are relatively static, while perceived sounds are usually the result of transient changes and are thus more dynamic in nature and require more care in temporal processing [8]. Noise sources in one domain can be more easily handled or filtered in the other domain. For example, while audio-based detection routines are sensitive to sound reflections, visual routines are unaffected. Also, advantages in one modality can overcome deficiencies in others. Visual localization can be precise, but is limited by camera optics to the field of view. Sound localization in general provides much coarser spatial resolution, but is useful in a larger spatial region. 14 Research in machine vision and audition have progressed enough separately that the integration of the two modalities are now being examined, though most such integration still involves limited, task-specific applications [9]. Most integration work being performed has been in the context of human-computer interaction (HCI), which seeks to provide more natural interfaces to computers by emulating basic human communication and interaction, including speech, hand gestures, facial movement, etc. In particular, substantial work has been done in using visual cues to improve automatic speech recognition. Image sequences of the region around the mouth of a speaker are analyzed, with size and shape parameters of the oral cavity extracted to help disambiguate similar sounding phonemes. The integration of audio and visual information can occur at a high level, in which recognition is performed independently in both domains and then compared [10], or at a lower level, with a combined audio-visual feature vector feeding, for example, a neural network [11]. Performance of integrated recognizers has regularly been greater than that of unimodal ones. Previous work involving multimodal sensory integration at the Al Lab was performed on the humanoid robot Cog and prototyping head platforms. A multimodal selfcalibrating azimuthal localization and orientation system [12] and a multimodal event (hand clapping) detector were implemented [13]. All such work seek to establish some sort of biological relevance; there is ample neurophysiological evidence that multimodal integration occurs at many different levels in animals, including birds, reptiles and mammals. The integration can happen at the neuronal level, where a single neuron can be sensitive to both visual and audio stimuli, orat a more abstract level of spatial and motor maps [14]. The area of the brain best understood in terms of multimodal representation and interaction is the optic tectum (superior colliculus in mammals), which is a layered midbrain nucleus involved in stimulus localization, gaze orientation, and attention [15]. Localization is a key problem that must be solved in many animals for survival. It comes as no surprise therefore that the problems such animals face are the same ones that had to be solved for this project, which relies on accurate localization for good tracking performance. Sound localization is a much more difficult problem than visual localization, since acoustic stimuli are not spatially mapped to the sensors used in the former (microphones or ears); thus some form of computation is necessary for both engineered and biological systems so that the localization cues from the sensors (e.g. time and intensity differences) can be extracted from the set of onedimensional acoustic signals. On the other hand, visual localization is much easier since the sensors used (cameras or eyes) are already spatially organized (CCD array or retina) in such a way that 15 visual stimuli are mapped to sensor space; the image of the stimulus appears on a corresponding location in the array or the retina. This difference in representation (computation vs. sensor space spatial organization) requires that there be some form of normalization of coordinate frames when integrating both types of localization information. In many animals one modality, usually vision in mammals, dominates over the other (usually audition) in actually determining stimulus location [14]. As will be reported later in this thesis, in the currently implemented system visual localization also dominates. Of course, most engineering approaches use what we know about the neurophysiology of multimodal integration only as an inspiration. For example, the representations of the visual and auditory space are not merely superimposed in the superior colliculus; they are integrated in a nonlinear manner by bimodal neurons to yield a unified representation of stimulus locations[16]; most engineering implementations combine audio-visual information linearly. Also in biological systems, the integration occurs in multiple locations at different levels of abstraction. Our system integrates at a much higher level, and at only one point. 2.3 Problem Definition Our primary goal was to design and implement a sound capture system that is capable of extracting the speech of a single person of interest in a normal examination or office room environment. To be useful, the system must track, beamform, and perform speech recognition in a timely manner. One of the key design goals was therefore real-time operation. For any system designed to run in real-time, various compromises must be made in terms of computational cost, complexity, robustness, and performance. To be able to perform such optimizations, a modular system of easily modifiable and interchangeable components is necessary. This allows different types of algorithms to be tested. An added advantage is that the modules may be distributed across different processors. The following three components will be examined in this thesis: * Source Location Detection-The benefits of multimodality when applied to guiding the beam has been examined. The hypothesis is that the integration of visual and audio detection routines will be robust to various error sources (acoustic reverberations, cluttered visual environment). Specifically, visual localization combined with sound localization can be used to determine candidates for tracking. Visual localization is performed using a 16 combination of motion analysis and color matching. Sound localization is performed using a TDOA estimator. " Microphone Array Configuration-The simplest configuration for a collection of microphone elements is a linear array, where all microphones are spaced equally apart. As will be seen, this has a less than optimal frequency and spatial response. A better configuration is a compound or nested array that consists of several linear subarrays. See Section 3.1.4. * Tracking-Simple detection methods to guide the beam may be insufficient for any realistic operating environment. A simple tracking mechanism that follows a single speaker around the room and takes into account source motion has been implemented and tested. See Section 3.2.2. A totally general-purpose person identification, tracking, and speech capture system is beyond the scope of this project. This thesis presents a solution for focusing on a particular person and tracking and extracting only his/her speech. In limiting the scope of the solution, some assumptions must be made concerning the nature of the interaction, and will be discussed in Section 3.2. 17 18 Chapter 3 Design This chapter describes in more detail the three system components listed in the previous chapter. The high-level design issues and decisions as well as some theoretical grounding are discussed; for actual details in the implementation of the system components, see Chapter 4. The response of the simple delay-and-sum beamformer, shown in Figure 1, is first derived, and the related design issues discussed. In the analysis that follows, a plane wave approximation of the sound signal is assumed. Mic 1 Sca+n Mic 2 Scaling Dea -+ Delay Average-- Mic N Scaling ~+ Delay Figure 1: Simple delay-and-sum beamformer block diagram. The output the beamformer is normalized by dividing the output of the summer by the number of microphone elements. 19 3.1 Microphone Array Configuration Incoming plane wave Figure 2: Geometrical analysis of incoming acoustic plane wave. 3.1.1 Array Response From Figure 2, we see that the interelement delay T, assuming equally spaced microphone elements, is given by the equation d sin 9 C (1) where c is the speed of sound and 9 is the incident angle of the incoming plane wave. Using complex exponentials to denote time delays, and normalizing the amplitude of the incoming signal to one, the total response of an N element array (where N is even) can be expressed as N 2 -joid sin 0 H(c,9)= Za,,e , N n=-2 (2) where co is the temporal frequency variable associated with the Fourier series and 9 is as above. It is clear that with the appropriate choice of coefficients an, the time delays associated with the incoming wavefront can be taken into account. Substituting 20 jwnd sin 0 an = ane (3) into Equation (2) results in a generalized, steerable microphone array response: N -javd(sin 6-sin 60) 2 H(w,0,90 )= Laine C N n=-- 2 (4) The parameter 0o is the beamforming, or steering, angle. By modifying this variable, the angle at which the microphone array has the maximal response (the 'beam') to incident sound waves can be changed. When the steering angle equals the incident angle of the incoming wavefront, the complex exponential term becomes unity and drops out. The scaling and delay jwnd sin 0o components in Figure 1 correspond to a,, and e c , respectively. It is useful to note that Equation (4) is similar in form to the discrete-time Fourier Transform, given by the equation H(,) = ja[n]e-j"', n (5) where c4, the temporal frequency variable, should not be confused with C, the frequency of the incoming wave. The analogy between the array response and the DTFT is useful since it means microphone array design is equivalent to FIR filter design. Letting k = w/c, we have the mapping co, -+ kd(sin0-sin00 ). To obtain the array response as a function of 0, we use the transformation 0 = sin- (m' kd + sin 0). Figure 3 shows the theoretical response of an eight element linear array, center and offcenter steered (00 = 0' and 00 = 300, respectively) with uniformly weighted (unity) array coefficients a, and equally spaced microphones (d=.06m). The microphone array would be located on the 90-270' axis, with the normal corresponding to 00. The response is symmetrical about the array axis, but in this application the array would be mounted on a wall and the other half plane is not of interest. Note that the response is a function of both signal frequency and interelement 21 distance. This raises two issues in the design and implementation of the array, spatial aliasing and variations in the beamwidth. 3.1.2 Spatial Aliasing From basic signal processing theory, we know that sampling a continuous domain is subject to aliasing unless certain constraints on the sampling rate are met. Aliasing refers to the phenomenon whereby unwanted frequencies are mapped into the frequency band of interest due to improper sampling [17]. In dealing with discrete element digital microphone arrays, in additional to temporal aliasing, we must be aware of aliasing that may occur due to the spatial sampling of the waveform using a discrete number of microphones. To avoid spatial aliasing, we use the metaphor of temporal sampling to determine the constraints in spatial sampling, in this case involving the interelement spacing d. In the temporal case, to avoid aliasing lw,j < = Substituting k =- c = - 2 7r . Equivalently, to avoid spatial aliasing, kd(sin9 - sinG0 ) <; . -, we get the inequality d < A 2. , mi 2 where Amin is the wavelength of the highest frequency component of the incident plane wave. We can thus conclude that the higher the frequency (and thus the smaller the wavelength) we are interested in capturing, the smaller the interelement spacing is necessary. 3.1.3 Beamwidth Variations As mentioned in Section 3.1.1, the array response is also a function of frequency of the waveform. The direct consequence of this is that the beamwidth, defined here to be the angular separation of the nulls bounding the main lobe of the beam, is dependent on the frequency of the incoming signal. Setting Equation (2) to zero, with ^,, again uniformly unity, and solving for 0 corresponding to the main lobe nulls, we obtain an expression for the beamwidth, BW = 2 sin ( 2nc Nwd ) (6) The beamwidth therefore increases for lower frequencies as well as smaller interelement spacing; both lead to lower spatial resolution. In other terms, broadband signals such as speech that are not exactly on-axis will experience frequency dependent filtering by a single linear beamformer. 22 3.1.4 Sensor Placement and Beam Patterns From the above discussion, it is clear that some care must be taken in the design of the array in terms of microphone placement. Spatial aliasing and beamwidth variation considerations require the opposing design goals of smaller interelement distance d for capturing higher frequencies without aliasing and larger d for smaller beamwidth or higher spatial resolution for low signal frequencies. A compromise can be made with a linearly spaced array capable of moderate frequency bandwidth and spatial beamwidth. A better solution is a compound array composed of subarrays, each with different microphone spacing and specifically designed for a particular frequency range; beamwidth variation is lessened across a broad frequency range [18]. A linearly spaced array (referred to as the lineararray) and a compound array (multiarray) composed of three subarrays have been implemented (see Section 4.2) and compared. Figure 3 shows the theoretical beam patterns for an eight element linear array with an inter-element spacing of 6cm. Figure 4 shows the configuration of an eight element multiarray, and Table 1 lists the relevant characteristics of the subarrays. Figure 5-Figure 7 are the theoretical responses for each subarray. Note that as expected, low frequency spatial resolution is poor for the small spacing, high frequency subarray and there is substantial high frequency aliasing for the large spacing, low frequency subarray. As a control for relative performance measurement between the two types of arrays, the number of elements for each are held constant at eight. A compound microphone array requires slightly different processing than the simple linear delay-and-sum beamformer. Figure 8 is the modified block diagram. The signal redirector reuses signals from some of the microphone elements and feeds them to multiple sub-arrays. Bandpass filters isolate the desired frequency range for each subarray. Sixth-order elliptic (1IR) filters were chosen (see Figure 9 for frequency response curves) to provide a combination of sharp edge-cutoff characteristics and low computational requirements [6]. Interpolators allow non-integral sample delay shifts for more possible angular positions of the beam (See Section 4.2.1.1). 23 1 . 1 . 1 ....18 .. 1 8 .... 2 0 Q-o0 1 [8,0.06,800] 1 1 -0 2 0 2 -0 2 .0O I0 18 18 -... .... 0 0 2 2 0 0 [8,0.06,3200] 0,0 18 0 2 0 0 18 . .- ... 0 2 1 - 2 18 .. [8,0.06,1600] 0 - 1 1 0 -0 2 0 [8,0.06,400] 2 . 0 2 2 0 1 [8,0.06,3200] -. 18 ....... _ 0 2 2 [8,0.06,1600] [8,0.06,800] [8,0.06,400] 0 0 2 Figure 3: Eight element linear array beam pattern. The array is located on the 90270 degree axis, with the normal corresponding to 0 degrees. The lefthalf plane is not relevant. The array consists of 8 elements with .06m interelement spacing. Frequencies are, from left to right, 400Hz, 800Hz, 1.6Hz, 3.2kHz. The top row shows a beam centered at 0 degrees, with the bottom row at 30 degrees. Sub-array 1 dl 0 d3 -4 0 0d2 Sub-array 2 7 Sub-array 3 Figure 4: Microphone placement for an 8 element compound array. dl, d2, d3 corresponds to the interelement spacing of sub-array 1, 2, and 3, respectively. 24 0 Subarray Interelement spacing Frequency Range (Max Frequency) Al dl=.02m High 8.625 kHz(-8 kHz) A2 d2=.06m Mid 2.875 kHz(-3 kHz) A3 d3=.18m Low 958 Hz (-1kHz) Table 1: Eight element compound array configuration (See Figure 4). Each subarray has a corresponding frequency response at different ranges. Indicated values are the highest frequency each response can handle without aliasing. 1 % 18 [4,0.02,20001 1 [4,0.02,1000] 1 [4,0.02,500] [4,0.02,8000] 1 1 I 18 ...... ...... 20 2 0 2 2 [4,0.02,1000] [4,0.02,500] 2 2 [4,0.02 2000] 1* 1 'A 18 V 18E 2 0 0 2 0 [4,0.02,8000] 1 18 2 2 2 18 2 2 2 Figure 5: High frequency subarray beam pattern. The array is located on the 90270 degree axis, with the normal corresponding to 0 degrees. The lefthalf plane is not relevant. The subarray consists of 4 elements with .02m interelement spacing. Frequencies are, from left to right, 500Hz, lkHz, 2kHz, 8kHz. The top row shows a beam centered at 0 degrees, with the bottom row at 30 degrees. 25 1 1 0 18 18 0 2 -. 0 0 . -. -- 0 2 2 0 1 1 18 0 2 - 1 0 1 0 0 2 2 0 2 0 0 2 1 . 18 --2 2 00 - --- 18 0 S 1 0 2 2 0 0 18 -- ..-----0 0 2 2 0 1 0 1 1 --.. 0 2 0 [4,0.06,8000] 1 0 I1 0 -.~-2 [4,0.06,2000] I [4,0.06,1000] [4,0.06,500] 1 0 1 2 2 - [4,0.06,8000], [4,0.06,2000] [4,0.06,1000] 1 [4,0.06,500] 0 2 0 Figure 6: Mid frequency subarray [4 elements, .06m spacing] beam pattern. The top row shows a beam centered at 0 degrees, with the bottom row at 30 degrees. 26 -Qum -%qd- [4,0.18,2000] [4,0.18,1000] [4,0.18,5001 1 . .0 18 22 1 1 0 0 18 0 [4,0.18 1000] 1 2 0 2 0 0 2 [4,0.18,8000] [4,0.18,2000]1 0 0 2 0 Al Filter Simple Lj Beamformer Bandpass Filter Simple Beamformer Bandpass Filter Simple Beamformer Bandpass Filter A2 Mic 2 Mic 8 Interpolation Lj - Signal Redirector - Interpolation Filter A3 -kInterpolation Filter 0 2 Figure 7: Low frequency subarray [4 elements, .1 8m spacing] beam pattern. Mic 1 0 181 18 2 __0 - 10 0 2 2 0 ~ 1 18-- 1 0 2 2 [4,0.18 500] 11. 2~ 0 2 2 2 [4,0.18,8000] 1 1 0 J Figure 8: Block diagram of compound array. The eight channels in the array are directed into three subarrays, each with four channels. See Figure 4 for channel assignments. 27 0 10 A1:highpass - A2:bandpass A3:Iowpass 10 10 I 10 10-' 0 2000 4000 6000 8000 10000 12000 Figure 9: Frequency response for bandpass elliptic filters for three subarrays. 3.2 Beam Guiding With the beam of the microphone array properly formed, it must be guided to the proper angular location, in this case a speaking person. This is a very complex task, involving detecting people, selecting a single person from whom to extract speech, and tracking that person as he/she moves about the room. A complete and generalized solution to the target detection, identification, and tracking problem is beyond the scope of this thesis, and is indeed an entire research focus in itself [19]. Fortunately the constrained nature of the particular problem, and design decisions of the array itself, allow various simplifications. As part of an intelligent examination or office room, the system will be situated in a relatively small room with few people, as opposed to a large conference hall or a highly occupied work area. As will be discussed in Section 4.2, the possible steering angles will be limited to discrete positions. These factors simplify the detection task, since there will be fewer candidates (usually only one or two) to process. The tracking task will be simpler and more robust, as discrete angular positions will allow the tracker to be less sensitive to slight errors in target location. 28 3.2.1 Detection As mentioned in the background chapter, in most previous work with microphone arrays, target detection and tracking are handled very simply, usually using only sound localization. TDOA estimation utilizes the spatial separation of multiple microphones in much the same way as beamforming in microphone arrays. The signals from two or more microphones are compared and the interelement time delay r (see Section 3.1) is estimated, which is equivalent to the sound source's direction. The comparison is usually in the form of a generalized cross-correlation function, in the form rgc, = arg max R, (r), where x, and x2 are the signals from two 00 microphones and R,(r) = x1 (r + Ox 2 (t)dt [20]. Refinements have been proposed to the objective function to minimize effects of reverberation [21]. One of the simplest yet still effective visual methods for detection is motion analysis using thresholded image subtraction. The current captured image is subtracted pixel-wise from a background image. Pixels that have changed intensities, usually corresponding to objects that have moved, can then be detected; these pixels are then thresholded to a value of either 0 or 1 and the result will be referred to as motion pixels. The underlying assumption for this process is that the background does not change significantly over time and that the objects (people) are sufficiently distinct from the background. The background image is composed of a running average of image frames in the form I',[n]= ae[n -1] a + I,,[n]-(1 - a) (7) with a (range 0-1.0) determining how much weight the current image is assigned [19]. Note that with a high a, this method can detect stationary objects that have recently moved, for example a person who has entered but is currently sitting or standing still. 3.2.2 Tracking In a completely dynamic environment, there may be multiple objects or people simultaneously speaking and moving. In the more constrained environment of an office or 29 examination room, usually one person is speaking at a time'. Depending on the intended application, two possible modes of beam steering are possible. The beam may be guided to each speaker location in turn using the above detection methods, with no explicit maintenance of detected object state information. In this case sound localization information coupled with a sound energy detector (to determine when there is actually speech being spoken) is most useful. A detection-based method using only sound localization is the way beam guiding is handled in most microphone array projects. As outlined in the problem statement, one particular person must be tracked as he/she moves, regardless of background noise and even in the absence of speech, when there are no acoustic cues of motion. The task of initially identifying this specific person will not be explored in this thesis; speaker recognition [22], identifiable markers [1], and appearance-based searching [23] are all possible options. Currently, the first object detected after a long period of visual and audio inactivity will be tracked and will be referred to as the TO (tracked object). One of the major issues in visual tracking is the correspondence problem; detected objects in one image must be matched to objects in a successive image [24]. Color histogram matching is often used in real-time trackers to find this correspondence [19]. Once the TO and its visual bounding box is determined, a color histogram of the image pixel values is constructed and used as a match template. In the next time frame, bounding boxes and histograms for each detected object are constructed and the intersection with the template computed. The normalized intersection of the test object histogram Y N j=1 the match template histogram H" is defined to be N min(H I(H',H')= and HZ , where N is the number of bins in the histogram. The , Hm) j=1 detected object that has the highest normalized intersection value, above a threshold, will be considered the current tracked object, and the template histogram will be updated. The benefits of color histogram matching over more complex model based techniques include relative insensitivity to changes in object scale, deformation, and occlusion [25]. It is also very computationally efficient, and can be used in real-time tracking. Additionally, a simple prediction-correction algorithm involving source motion estimates can be used to narrow the search for a match [26]. The tracker estimates the velocity of the TO and uses the estimate to predict the probable location in the next time frame. Detected objects in the 1 Realistically, to handle simultaneous speech, a larger element microphone array with greater spatial resolution is necessary. 30 vicinity of the predicted location will be compared first. Once the location of the TO in the next time frame is determined, the tracker corrects its velocity estimate. Finally, if the system is unable to maintain a track of the TO (when it does not appear visually or aurally in the predicted location, and there are no suitable visual histogram matches), it must acquire a new TO. In the presence of valid and ambiguous visual and audio location cues, it picks the location indicated by the modality with the higher confidence level at the particular location. Confidence in a particular modality () at a given location (L)is measured as a signal-to- noise ratio, or SNRML, of the beamformed output. In other words, if the visual detection module indicates loc, as the location of a valid target and the audio detection module indicates a different loca, the tracker selects loc = arg max(SNRAocSNR, ). A running table of SNRm,L for M={A,V} at every discrete location L is maintained and updated at every iteration of the tracker. See Section 4.4 for more details. While the above techniques are useful for large, cluttered scenes, the tracking environment in this project is relatively simpler-a small office room will require a small number of people to be tracked. A sensibly mounted microphone array/camera unit will result in visual images containing mostly horizontal motion; only one object is of interest at any given time.Figure 10 is the general dataflow diagram of the proposed system. 31 microphone array __ Delay & Sum Beamformer ----------- '- L* j PP Sound Localization Speech Recognition Integrated Tracker Motion Detector camera Visual Localization -Color Histogram Matcher Figure 10: Data flow diagram. The dashed box indicates DSP code. All other software components are on the host or other PCs. Each component will be discussed in detail in Chapter 4. 32 Chapter 4 Implementation In this chapter, the implemented system is presented. As seen in Figure 11, the array can be configured to be a linear array or multiarray by simply rearranging the placement of microphones. One color CCD camera is mounted on the centerline of the array to provide visual information. Figure 11: The integrated array. The CCD camera is mounted directly above the microphones at the array center. The image is of the linear array configuration. The multiarray configuration requires relocating some of the microphone elements. This chapter is organized as follows. The first section is a general system overview, and the low-level system details and issues not directly related to the high-level algorithms are presented. A detailed system architecture is also given. The remaining sections describe the implementation in three major groupings, audio processing (including sound localization and beamforming), visual processing (visual localization), and the tracker. 4.1 General System Overview A few underlying principles were followed in the design and implementation of the system. Computation is not only modular and multithreaded, but can be distributed across separate 33 platforms. Inter-module communication is asynchronous and queued, to make a single system clock unnecessary and computation independent of individual platform speeds. 4.1.1 Hardware Components Computation is split between the host PC, an add-on DSP board, and an optional additional PC. The actual microphone array hardware is straightforward; eight electret condenser microphones are connected to custom-built microphone preamplifiers. A more detailed description of the hardware is given in [27]2. The CCD camera is connected to the Matrox Meteor framegrabber board, which can be hosted on any PC. The actual beamforming is performed in software on the Signalogic DSP32C-8 DSP board, a combined DSP engine3 and data acquisition system. It was chosen for the dual advantages of simultaneous multi-channel data acquisition and offloading beamforming calculations from the host. The board is capable of performing simultaneous A/D conversion on eight channels at a sampling rate of 22.05Khz. The signals are filtered, delayed and combined appropriately, and then passed on to the host PC. The host PC (and additional PC, if present) performs the sound and visual localization, tracking, and the actual speech recognition using the commercially available software ViaVoice. Other than the host/DSP interface, all software components are modular and network based; visual processing of the single camera can be performed on another computer. A modular system lends itself to easy expansion of components and features. Additional cameras and processing can be added in a straightforward manner. The DSP board itself can be supplemented with another board to allow more microphones to be added to the array. See Section 6.5 for possible future extensions. 4.1.2 Software Components Before discussing the actual software architecture details some terms must be defined. A task refers to a host-based thread of execution that corresponds to a well-defined self-contained 2 Joyce Lee worked extensively with the author in designing and implementing the actual hardware. 3 The DSP used is an 80 Mhz AT&T DSP32C processor. It is very popular in the microphone array research community and has certain advantages, including easy A/D interfacing and seamless 8, 16, and 32 bit memory access, over other DSPs. 34 operation. In the system there are four tasks, the Tracker, Audio Localizer, Vision Localizer, and the DSP Beamformer Interface 4. An application refers to a platform-specific process that may contain one or more tasks. Their only purpose is as a front-end wrapper so that the underlying OS may invoke and interact with the tasks using a graphical user interface. A COM object is an application that supports a standard programmatic interface for method invocation and property access. COM, or Component Object Model, is a CORBA-like interface standard prevalent on Windows platforms. DCOM (Distributed COM) is an RPC-based extension that provides remote invocation and property access. To provide some flexibility in the incorporation of additional computational resources, the system was designed from the beginning to allow some simple manner of distribution of computation. The logical boundary of separation is at the application level, as there are numerous methods of inter-application (inter-process) communication. Since the Audio Localizer is closely tied with both the Tracker and the Beamformer Interface Task, it makes sense to place them in a single COM object; however, they can easily be separated if necessary. The Vision Localizer task is placed in a separate COM object that can be executed on a different PC if desired. Applications that support COM interface conventions can communicate with other COM objects on the same computer, or objects on remote computers automatically and seamlessly, by employing DCOM as a transparent mechanism for inter-application communication. Note that DCOM is not suitable for low-level distributed computing (e.g., passing raw signal data as opposed to processed localization estimates) due to overhead and bandwidth limitations, but is more than sufficient for high-level message passing, as is the case in this system. 4.1.3 System Architecture Figure 12 is the overall system architecture diagram, from the perspective of the hardware and software components mentioned above. Each software component is explained in further detail in the sections below. 4 The actual DSP-based beamformer routine is not considered here, as its execution is fixed to be on the DSP board. 35 ----------------------------------- (GUI,etc.) (GUI,etc.) Tracker 4- DCOM -- --- Vision Localizer Audio Localizer Beamformer Interface I PCI ISA Frame Grabber DSP Board Microphone Array a Speech Recognizer Camera Figure 12: System architecture diagram. Dashed boxes indicate application groupings, solid boxes indicate tasks. Double border boxes indicate hardware components. Dark lines depict hardware (bus or cable) connections 4.2 Audio Processing 4.2.1 Beamformer Figure 13 is the block diagram of the beamformer, the only task that is not running on the host, but on the DSP board. It handles the simultaneous capturing of sound samples from the eight channels, prefiltering, beamforming, and postprocessing. 36 Phase Microphone Array AGC Microphone compensation filter Channel Grouping SubArray Al Interpolation Interpolation Interpol[at ion Delay Delay Del ay Sum Sum Sur BW Filter ------------------------- SubArray A3 SubArray A2/Linear - - - -- - -- - BW F ilt er BW Filter ---- ----- - - - - - - -- -.- - Sum Channel Output (2) (Audio Localizer) Beamformed Output (Audio Localizer and Speech Recognizer) Figure 13: Beamformer block diagram. This task resides on the DSP board. Dashed boxes are present only in the multi-array configuration. Phase is obtained from the Tracker task on the host PC. The two channel and beamformed output are sent back to the host PC to the Audio Localizer Task. 37 4.2.1.1 Prefilter Raw sound data from the eight channels goes through a series of prefiltering stages before they are grouped together for beamforming. An automatic gain control (AGC) filter scales the signals to normalize amplitude and to reduce the effects of distance from the speech source to the microphone array. Fifteenth order FIR filters are then applied to compensate for individual microphone characteristics so that each channel will have a uniform frequency response. (Refer to [27] for a full discussion of the design of the compensation filters.) As was discussed in Section 3.1.4, for the multiarray case, each subarray "shares" some microphone channels with another. A channel grouper redirects the channels to the appropriate subarray. 4.2.1.2 Interpolation Since the incoming signal is being temporally as well as spatially sampled, arbitrary beamforming angles are not available using the discretely spaced samples [28]. While it is possible to obtain arbitrary angles using upsampling or interpolation techniques, it makes more sense to select a few discrete angles that correspond to as many integral sample delays as possible. Discrete beamforming also makes sense when the beamwidth is relatively large. Setting Equation (2) to zero and again solving for 0 corresponding to the main lobe nulls, we get a minimum possible beamwidth of 10 degrees for the eight element linear array and 65 degrees for the compound array' [29]. To calculate the possible angular beam positions using discrete sample delays, we return to Equation (1), this time substituting nTsR for r and mD for d. For our system TSR, the sampling period, is 4.5E-5s6 and D, the smallest unit of interelement spacing, is .02m. 5 The eight element compound array consists of 3 four element sub-arrays, which explains the larger beamwidth. However, with the current definition of the beamwidth, these values are a bit overstated. Another definition for the beamwidth is the angular separation of the two-3dB points of the main lobe. With this definition the spatial resolution is much better. 6 Corresponding to a sampling rate of 22.05kHz. 38 m=1 (d=.02) 3 (d=.06) 9 (d=. 18) N=O 0* 0* 0* 1 54.8* 15.67* 5.16* 32.7* 54.30 10.37* 15.670 2 3 4 5 21.1* 26.74* 6 32.7* 7 8 39.05* 46.05* 9 54.8* 10 64.16* 81.890 11 Table 2: Angular beam positions. The high, mid, and low frequency subarrays of the compound array correspond to d=.02, .06, and .18 respectively. n is the integral sample delay. Highlighted boxes indicate the discrete angle positions for the compound array. Interpolation is necessary for the high frequency subarray to obtain the other two positions. The linear eight element array is equivalent to the mid frequency subarray (second column). Table 2 lists the possible angular positions given an integral sample delay (n) and interelement separation (m). 00 corresponds to a dead-center beam perpendicular to the array. The highlighted angles (00, ±15.670, ±32.70, and ±54.8*) are the chosen discrete positions for the beam in the compound array. The linear array corresponds to the middle column (d= .06). The beamwidth for the linear array is too small for full coverage, so interpolation is necessary. A factor of 3 upsampler will result in 8 (16 for the full 1800 space) additional angles that are the nonhighlighted angles in column 3 [30]. For computational reasons, a simple linear interpolator is implemented. Others have used window-based FIR filters [6]; Figure 14 is a comparison of a Hamming window based filter and the linear interpolator. The high, mid, and low frequency subarrays of the compound array correspond to d=.02, .06, and .18 respectively. For the compound array, interpolation is necessary only in the high frequency subarray. To obtain the other two chosen angles (15.67' and 32.70) the same factor 3 interpolator can be used. Figure 15 shows the coverage area for the microphone array and the camera. Currently, a wide-angle CCD camera with a field of view of 100 degrees is being used. The tracker will 39 integrate both visual and audio information in the area of overlapping coverage. Outside the field of view of the camera, sound localization estimates will be the sole source of information. Unear - Ideal Hamming 3 2.5 21.5 1 0 .5 o 0 2000 4000 -0-b~- 6000 8000 10000 12000 Figure 14: Frequency responses for the linear and Hamming window based interpolators, with the ideal response as a reference. The particular interpolator is used for a factor 3 upsampler. - Camera -e- Mid freq subarray (A2)/linear array Low freq subarray (A3) 0 3303 300 60 1.5 270 - - -- - - ---------- -------- -------- -------- Figure 15: Coverage area for microphone array and camera. Discrete radials correspond to possible beamforming angles. Arcs denote visual field of view: inner arc corresponds to current camera, outer arc to wide angle camera. 40 90 4.2.1.3 Beamforming and Post-Processing The actual beamforming is relatively straightforward, and is implemented exactly as discussed in Section 3.1. For the multiarray case, sixth order elliptic bandpass filters provide some post-processing that will allow the three separate subarrays to be summed together to form the beamformed output. In addition, two pre-processed channel signals, corresponding to the two central microphones in subarray A2, are sent along to the host for the Audio Localizer task. 4.2.2 Audio Localizer The audio localizer (see Figure 16) is a simple TDOA estimator as described in Section 3.2.1. Two signal channels of the microphone array and the beamformed output are obtained from the beamformer task, and are first preprocessed on a frame-by-frame manner by applying a sliding Hamming window frame. The signal energy of each frame of the beamformed output is calculated by computing the dot product of the sample values. A simple speech detector uses the running average of signal energy and a thresholder to determine when there is an actual signal or just background noise. In the absence of speech, the value of the signal energy of the beamformed output is used to update a running value of the background noise energy (E(N)). In the presence of speech, a cross-correlator is applied to the two channel frames and a location estimate, corresponding to a peak in the cross-correlation output, is computed. The value of the signal energy of the beamformed output (E(S+N)), containing both signal (S) and noise (N) components, is used in conjunction with the running background noise energy to compute the SNR. SNR is defined to be the ratio of the signal energy to the noise energy. Assuming that the signal and noise components are independent, we get the following expression for the SNR in dB: ( E(s +N) SNR =10 log -i ( . E(N) The sound localization estimate and SNR value are sent to the Tracker task. 41 Beamformed Output Sound Channels (2) Cross- Preprocessing Energy Preprocessing Speech Detection Correlation 4 Localization V V Background SNR Noise { SNR Queue (Tracker) Sound Localization Distribution (Tracker) Figure 16: Audio Localizer block diagram. The sound channels and beamformed output are obtained from the Beamformer Task on the DSP board. 4.3 Visual Processing The visual localizer obtains images from the camera through the frame grabber board and places them in a circular buffer. As described in Section 3.2.1, the images are used to update the background image and create a motion image. The motion image is computed by taking the absolute value of the difference of the current and background images, all computations being performed on a per-pixel basis. The image is then thresholded to produce a bilevel image, and then dilated to create regions of pixels corresponding to moving objects in the camera image. Dilation is a standard morphological image operation that removes spurious holes in an aggregate collection of pixels to create a uniform "blob" [31]. A segmentation routine utilizes a connected components (CC) clustering algorithm to further associate spatially localized blobs to form clusters. The CC algorithm simply associates all 42 blobs that are adjacent to each other into a single cluster. A standard k-means algorithm is also employed to further segment the clusters into "object" candidates [32] . As was mentioned in Section 3.2.2, it is assumed that the desired target (referred to as the Tracked Object, or TO) has been determined previously. If not, the first single object after a period of inactivity is arbitrarily assigned to be the TO. The relevant information for the TO (position, pixel velocity, bounding box, and color histogram) as well as that of other candidates are maintained and updated on a frame-by-frame basis. The color histogram of the TO is used as a reference template to search among the current frame object candidates for the best match. The search is narrowed by considering the past positions and pixel velocities of the TO to predict the current location. Visual localization is accomplished by computing the centroid of all the constituent clusters of the TO. In the current design of the system, a one dimensional microphone array is used, resulting in azimuthal steering of the beam. Similarly, the camera is mounted centered and directly on top of the array at approximately eye level. Thus with visual localization only the horizontal component of the TO location needs to be determined. 7 It should be noted that the term object as used here is not related to that used in the image processing community, specifically in "object detection." No effort has been made to determine the identity or nature of the group of clusters. In this thesis, object detection refers to the process by which a group of clusters is aggregated into a single collection. 43 Frame Grabber Rotating Image Buffer Background Image Motion Image Threshold Reference Morphological Histogram Operators (Dilation) Histogram Matching Segmentation (CC, kmeans) Tracked Object (TO) ( Detected + 1Objects - -- Localization Visual Localization Distribution (Tracker) Figure 17: Visual Localizer block diagram. The segmentation routine incorporates both a connected components (CC) and a k-means segmentation algorithm. The Tracked Object and Detected Objects boxes indicate stored state information obtained from the segmentation. Likewise the Reference Histogram is a running color histogram of the TO. 44 Figure 18 shows sample output of the currently implemented visual object detector. The value here of a from Equation (7) (in Section 3.2.1) is low and therefore this is basically a motion detector. The vertical (green) line represents the output of the localizer, in image coordinates. With the current setup each image is 160 by 120 pixels with 24 bit color. The red and green boxes are the bounding boxes from the motion segmentation routine. The sub-image on the right side is the current TO. Note that, even with only a single moving speaker, there are still spurious motion pixels not associated with the target. These arise from, among other noise sources, flickering lights. Figure 18: Sample motion based localization output. The vertical line represents the computed location. The middle picture is the output of the motion segmentation routine. The rightmost picture is that of the currently tracked object (TO). 4.4 Tracker The tracker task (Figure 19) guides the microphone array beam based on the location estimates from both sound and vision localizers, using either or both modality depending on the existence of valid estimates and on a simple persistence model based on position and velocity. This allows it to handle cases when the TO may temporarily be hidden from view but still speaking, or when the TO is moving but not speaking. If there is inactivity for an extended period of time, the state model is reset and a new TO is selected. Localization estimates from both modalities are maintained in spatial distribution maps, and a single location value is computed for each modality, by finding peaks in the distributions 45 [12]. These maps have an integrating or "smoothening" effect on the raw localization estimates and make the tracker more robust to spurious noise data points. If both modalities indicate the same location, or if there is only one valid estimate, then that value is passed onto the Beamformer task. If each modality indicates different locations, a position estimate from the persistence model, which computes the likely location of the TO from the previously estimated location and the current (azimuthal) target velocity estimate, is obtained and compared to the ambiguous location estimates. If there is a match, that location estimate is sent to the Beamformer and the persistence model is updated with the new location estimate and target velocity. If there is no match, then the tracker has basically lost track of the TO and a new track object must be obtained using the SNR Map as described in Section 3.2.2. As will be discussed below, in all performed experiments the tracker never lost track of the TO so this feature was never used. 46 Sound Localization Distribution S NR Queue Vision Localization Distribution LocationSN Estimator Sing or Unambg uous Cues S Ambiguous Cues Map Predictor 4 Persistence On Track Lost Track SNR 4- Comparator Phase Conversion Phase (DSP) Figure 19: Tracker block diagram. The two localization distributions are updated asynchronously from the respective Localization tasks. 47 48 Chapter 5 Procedure As outlined in Section 2.3, the major contribution of this thesis is the comparison of various methods and techniques in visually-guided beamforming and tracking to determine the optimum system configuration; the goal of the project is to improve overall system performance given certain quantifiable constraints. This chapter will first discuss the system configurations and trial conditions that are the experimental variables. A discussion on the various measures used to evaluate performance will then be presented. Finally, the actual experimental setup and procedure will be given. 5.1 Experimental Variables 5.1.1 System Configuration Since there are several components to the entire system, evaluation of overall system performance must first start at the evaluation of each component in as close to an isolated configuration as possible. The three main components to be examined are the beamformer, multimodal localizers (Video and Audio), and tracker. The microphone array (and consequently, the beamformer task) can be operated in two modes, a single linear array and a compound multiarray. The linear array provides a finer spatial resolution, but has a limited frequency bandwidth. The multiarray has a wider, more uniform frequency bandwidth, but has much coarser spatial selectivity. 49 Most microphone array systems use only sound localization to guide the beam. This system has both audio and visual localizers, and their individual and relative performance must be measured. The tracker can accept input from either localizer, or both. By comparing its performance in these three configuration modes, we can determine which localizer results in better overall performance; we expect the visual localizer to provide more stable and accurate position information under most trial conditions. In addition, we need to determine if both localizers are required for optimal performance. Finally, a second microphone array was arranged to form a planar coverage area (as opposed to a radial coverage area with a single array) to determine how much of an improvement the addition of more elements (for a total of 16) will provide. Section 6.2.4.7 gives more details about the planar multiple array configuration, and Appendix C includes a design for a single array with 16 microphone elements. 5.1.2 Trial Condition In addition to varying the system configuration, the experimental trial condition can be varied by changing the stimulus presentation, which mainly involves changing the location of the speaker or the microphone array beam. The variations can be classified in two broad categories, static and dynamic. In the static condition, the speaker is located at a fixed position. In the dynamic condition, the speaker roams around the room in an unstructured pattern for each trial. For either condition, only one system configuration variable is changed at a time. This results in a large amount of data collection, but is necessary to isolate the experimental variable. By restricting the speaker's position, the static condition allows a broad range of experiments. The speaker may be located at various angles off the dead center array normal, and the beam itself may be guided directly at the speaker (on-beam condition) or away (off-beam). Onbeam angles may correspond to integral sample delays, or involve non-integral (interpolated) delays. 5.2 Measurements The primary measure used for overall system evaluation and local comparisons will be the performance, or word error rate, of the commercial speech recognizer. Intermediate measures will be utilized to compare performance improvements within each system component. For example, SNR will be used to compare the outputs of the single linear array and the compound array with the 50 controls, the close-talking and one element microphones. Performance of the localizers will be compared by their percentages of correct position estimates. 5.2.1 SNR One direct measure of the effect of a filter (in this case, the beamformer) is the improvement in the signal to noise ratio (SNR) of the input signal. SNR in dB is defined to be a log-ratio of signal and noise variances o and o , respectively: SNR =10log( ),J. (8) This expression is equivalent to that discussed in Section 4.2.2; with zero-centered waveform variance equivalent to signal energy. 5.2.2 Position Estimates The tracker and localizer tasks maintain a log of detected or tracked positions of object candidates on a per frame or per iteration basis. The visual and audio localizers perform computations on every frame, where frame is defined to be an image for the former and 185ms of sound samples for the latter. The tracker task computes the location of the TO each iteration, every I00ms. In the static trial condition, the location of the speaker is known and fixed. By speaking and slightly moving in place, both the audio and video localizers will have valid position estimates. A simple measure of the percentage of estimates that coincide withthe known location can be used to compare performance. 5.2.3 Word Error Rate Performance of the speech recognition software ViaVoice is measured by word error rate, defined to be the ratio of erroneous words to total words. Erroneous words are defined to be words that are incorrectly added, omitted, or substituted by the speech recognition software [33]. We are concerned more with relative differences in WER between various configurations, as opposed to absolute values; absolute levels of performance can be improved by more training, better recognition software, etc. 51 5.2.4 Method and Controls For each measure, a mechanism must be defined to identify and establish controls for the component variables. Table 3 lists the methods for collecting data as well as the specific control used for the measurement. To ensure that the same sound stimulus is present for both the control and variable, and to allow off-line analysis, whenever possible all speech output from the control microphones and the array were simultaneously and digitally recorded for later playback to the speech recognition software. Note however that due to physical limitations multiple trials are required; the linear and multi-array configurations can not be tested simultaneously, as they require a physical reconfiguration of the array. In this case the multiarray trial was performed immediately after the main trial (controls and linear array). As will be described in Chapter 6, a test was performed to determine if it is (statistically) justifiable in making comparisons across trials. Measurement SNR Positional Error Word Error Rate Control Method Headset microphone close to speaker and one microphone element of array. Predetermined, fixed positions at various angles and distances Predetermined text. Single close microphone. Simultaneous recording of controls and linear array output. Multiarray output is recorded separately. Discrete markings on floor. For localization comparisons, remain fixed in position. Simultaneously record single and linear array output of recitals of same text. Multiarray output is recorded separately Table 3: Procedure and control for each performance measurement 5.3 Experimental Setup For most trials, a single microphone array is mounted on the wall of a medium-sized laboratory room (approximately 20'x30') at approximately eye level at a height of 63". The camera's field of view covers practically the entire workspace. There are several background noise sources, including the air conditioning unit and cooling fans of a large rack-mounted computer system. The speaker is always the same person and the speech is read from a set of 39 untrained and previously trained sentences. In the static case the speaker is located approximately five feet from the array. In the dynamic condition the speaker is free to move around the room, but always 52 faces the array when speaking. Figure 20 graphically represents the experimental setup for the static condition. The speech recognition system, ViaVoice, requires substantial training for optimum performance; to reduce the speaker dependency of the results, only the bare minimum of the entire training set were used to train the speech recognition software. This does not affect relative comparisons of WER performance, as all experimental trials will be performed at the same level of training. Obviously, with more training, the actual WER values of all trials will be lower. The reduced training set of sentences taken from the user enrollment portion of the ViaVoice software was applied to three cases, one using the normal headset (close-talking) microphone and the other two using the microphone array outputs. Since the microphone arrays could be configured in two ways, linear and multiarray, separate training must be performed on each configuration. 32" 60" 00 10.40 15.70 32.70 array @ 63" height Figure 20: Experimental setup for single array, static condition. Arrows indicate speaker's facing direction. Even though the speech recognition software is capable of real-time processing, for reasons discussed above the various system and control outputs are processed off-line using digital 53 recordings. Every other component of the system (the beamformer, localizers, and tracker) were tested in real-time conditions. For the static condition trials, discrete positions were marked on the floor five feet away from the array, corresponding to the angles in the third column of Table 2. SNR, word error rate, and positional error measurements are obtained from these fixed positions. For the dynamic condition trials a free-form scenario, with the speaker roaming around the environment while speaking, was used. Since there would be too many variables for each trial, no attempt was made to fix the path, though a general pattern of (azimuthal) back-and-forth movement was used. 54 Chapter 6 Results 6.1 Introduction As was mentioned in Section 5.1.2, measurements were taken in two general experimental trial conditions, static and dynamic. The static condition refers to measurements taken when the sound source was spatially localized (i.e., stationary). Relevant quantitative measurements for the static condition include sound and visual localization output, signal-to-noise ratio, and word error rate. The dynamic condition refers to comparisons and measurements taken when the sound source is in motion. The primary quantitative measurement for the dynamic condition is the word error rate. This chapter is organized as follows: comparisons and measurements for the static condition are given first, followed by those for the dynamic condition. The outputs of the audio and visual localizers are quantitatively compared. The localizer outputs are also fed separately to the tracker and the relative performance is also measured. SNR and WER measurements are then used to quantitatively compare different variations of the microphone array. For the dynamic condition the audio-only, video-only, and integrated tracker outputs are compared. The combined tracker output is also examined to determine the relative contribution of each modality to the final estimate. Finally, WER measurements in the dynamic condition are quantitatively compared and a final performance evaluation is made. 55 6.2 Static Condition 6.2.1 Signal-to-Noise Ratio Signal-to-noise ratio (SNR) was defined in Section 5.2.1 to be the ratio (in dB) of the signal and noise variances. It is a useful quantity to compare the signal quality of the different microphone configurations. The noise variance is computed from small segments of each sample before the speech utterance begins. The signal variance is computed using a 1 Ims sliding window across the sample in the presence of speech, in a manner as described in Section 4.2.2. For each case, a single five second utterance ("This is a test.") was used. Table 4 summarizes the SNR results for each configuration, and Appendix A provides spectrograms for each utterance. SNR Close- One Element Multiple Multiple talking Mic. Mic. Linear Arrays Multiarrays 52.04 24.33 26.78 23.40 29.79 27.11 23.17 10.76 17.33 17.06 18.48 17.87 Max (dB) SNR Avg (dB) Table 4: SNR (dB) data for each microphone configuration, with utterance, "This is a test." As expected, the close-talking microphone has the best average and peak SNR, followed by the linear array. It is expected that the linear array produces better SNR than the multiarray, as the former has a narrower beam; the multiarray, with its wider beam, would pick up more noise from a greater spatial area than the linear array. The one element microphone should produce the lowest SNR values, and it is somewhat unexpected that it has a higher peak SNR than the multiarray; but the mean values are more in-line with what we expect. Two additional sets of data, corresponding to the multiple array configurations described in Section 6.2.4.7, is also given. Having more microphones directed at the target should increase the SNR and improve performance. For both the linear array and the multiarray, doubling the number 56 --- ONG-- . .......... -mom of elements has a marginal effect in improving SNR, but not enough to match the performance of the close talking microphone. Section 6.2.4.7 has more details on the multiple array configurations. 6.2.2 Localization Output As mentioned in Section 5.3, to test and compare the two localizers, a simple stimulus pattern (the speaker moving to and speaking at three locations corresponding to the integral sample delay angles, in order, 0, 15.7, and 32.7 degrees) was simultaneously provided to the audio and visual localizers. Figure 21 shows the output of the audio localizer, while Figure 22 is the output of the visual localizer. For both, the data points are in units of integral sample delays, which represent the normalized location of the speaker at the particular iteration. The visual image pixel coordinates have been converted into corresponding integral delays using a static look-up table. Negative sample delays indicate left of dead center. The time scales (x-axis) of each plot are different, as the two localizers run independently in separate tasks with different timing conditions. Iterations with no data points indicate no valid localization information (i.e., no speech sounds or no movement) at that particular moment. Figure 21: Normalized audio localization scatterplot. The y-axis is in units of integral sample delays (representing beamformed angles, with negative delay values corresponding to angles to the left of the array normal) and the x-axis is the iteration number of the localizer and roughly corresponds to time. Red points indicate the output of the localizer, and green points indicate actual position. Figure 22: Normalized visual localization scatterplot. The y-axis is in units of integral sample delays (representing beamformed angles, with negative delay values corresponding to angles to the left of the array normal) and the x-axis is the localizer iteration number. Red points indicate the output of the localizer, and green points indicate actual position. 57 It is evident from looking at the two plots that while both seem to be in general alignment and agreement in terms of locating the speaker at the three indicated locations, the audio localizer seems to be considerably noisier in terms of spurious location estimates. This suggests that the visual localizer, at least in the case of non-occluded TOs with simple movement patterns, is all that is required for tracking. Audio Localizer % correct Video Localizer % correct 43.2% 72.1% Table 5: Localizer accuracy. Numbers indicate percentage of localizer output that correctly identifies location of target. (See Figure 21 and Figure 22) Table 5 lists the accuracy of each localizer for this particular trial. Each respective localizer output is compared to the actual or known location of the speaker during the trial and a percentage of correctly identified target location is computed. It is evident from both the graphical and quantitative representations that the video localizer is considerably more accurate than the audio localizer. 6.2.3 Tracker Output The localizer outputs were then sent to the tracker separately. Figure 23 shows the output of the audio-only tracker. Compared to the audio localizer output shown in Figure 21, there is a definite "smoothening," due to the integrating properties of the tracker, which is necessary in directing the microphone array in a stable manner. At least the first two positions (0 and 15.7 degrees corresponding to delays of 0 and -3, respectively) are clearly evident. The third position (32.7 degrees, or a delay of -6) also appears near the end of the plot, although it is corrupted by other noise. Figure 24 shows the output of the visual tracker. Again the raw localizer output is smoothed, although the video localizer output (see Figure 22) was much smoother to begin with than the audio localizer output. The output of the video-only tracker coincides with the actual stimulus pattern much closer than the audio-only tracker. 58 Figure 23: Audio-only normalized tracker output. The y-axis is in units of integral sample delays (representing beamformed angles, with negative delay values corresponding to angles to the left of the array normal) and the x-axis is the tracker iteration number. Red points indicate the output of the tracker, and green points indicate actual position. Audio Tracker % correct Figure 24: Visual only tracker output. The yaxis is in units of integral sample delays (representing beamformed angles, with negative delay values corresponding to angles to the left of the array normal) and the x-axis is the tracker iteration number. Red points indicate the output of the tracker, and green points indicate actual position. Video Tracker % correct 66.1% 84.3% Table 6: Tracker accuracy. Numbers indicate percentage of tracker output that correctly identifies location of target. (See Figure 23 and Figure 24.) Table 6 lists the accuracy of each tracker for this particular trial. The percentage of correctly tracked target location is computed in a manner similar to that of the localizer outputs above. While the relative performance of the audio-only tracker compared to the video-only tracker is much better than with the audio and video localizers, it is still clear from both the graphical and quantitative representations that the video-only tracker is considerably more accurate and stable. The performance of the combined or integrated tracker in a dynamic condition is examined in Section 6.3.1. 59 6.2.4 WER Data Word error rate (WER) is a ratio of the number of (incorrect) word additions, deletions, and substitutions over the total number of words in the original speech. Nineteen sentences from the training set (312 words) and twenty untrained sentences (287 words) were used in all experiments. To calculate the WER for all combinations of configurations, a speech recognition software scoring program was used. NIST provides the sctk toolkit to do the scoring in a standard manner [34]. It is useful to separate previously trained and untrained utterances in measuring performance since speech recognition software naturally perform better on utterances on which they have already been trained. In what follows, all WER measurements are divided into three categories: trained, untrained, and total. In any quantitative comparison of stochastic systems, it is important to be aware of the statistical significance of any measurements. Significance tests start with the null hypothesis (H0 ) that there is no performance difference between the configurations being compared. The test then performs a specific comparison between the measurements and computes a "p" value, which is defined to be Pr(datalH),the probability of the observed (or more extreme) data given the null hypothesis [35]. The lower the value of p, the more likely that the null hypothesis can be rejected and that the difference between the observed measurement and the configuration with which it is being compared is significant [36]. The sctk toolkit includes the Matched Pairs Sentence-Segment Word Error (MAPSSWE) test, which is a statistical significance test, similar to the t-test, that compares the number of errors occurring in either whole utterances of segments of utterances. The MAPSSWE test uses somewhat standard thresholds of p=.001, p=.Ol, and p=.05 to determine the level of significance; measurements with a p value greater than .05 are considered statistically similar [37]. Section 5.2.4 discussed the necessity, due to experimental constraints, of testing whether there is any statistically significant difference between successive trials of utterances, with the system configuration and as much of the environmental condition as possible is kept constant. Section 6.2.4.2 below presents the justification for assuming speaker dependent changes in the utterance across trials do not significantly alter the results and conclusions of WER based comparisons. 60 6.2.4.1 Control Cases The close-talking headset microphone that was packaged with ViaVoice was used to provide the baseline control for all word error rate measurements. The performance for this control should be the highest of all experimental results, and represents the best possible case; the eventual goal for beamformed output performance is to match this level. Trained Set Untrained Set Total 16.5% 22.0% 19.2% Table 7: WER for control case. The control case is speech taken from a closetalking headset microphone. As expected, the trained set results in better performance than the untrained set. Note however that even for the trained set in ideal conditions, the errors are substantial. Another way of testing the effect of beamforming is to compare performance with a single element microphone. The performance for this control represents the worst case, and should be the lowest of all experimental results. For the following result a single microphone from the array was used to record speech. Trained Set Untrained Set Total 45.1% 63.9% 54.1% Table 8: WER for single element case. The speaker was located at dead center (angle 0). A single microphone closest to the center of the array was chosen. Again, the untrained set leads to greater errors than the trained, though for both cases the output is almost unintelligible. Together, the close-talking headset and single element microphone cases provide a range within which the various configurations of the microphone array can be compared. 61 6.2.4.2 Trial Variation Since it is impossible with the current setup to take all the data for every configuration and condition variations simultaneously from one speech trial, multiple trials must be performed, each time changing a single configuration or condition. To see if there is a statistically significant difference in the speaker generated speech (and background noise) across trials, two trials using the linear array while fixing all other controllable variables were performed. Figure 25 shows that there is no significant difference across trials for this particular case. Unear Array0 - Redo Unear Arraya Trained LI near Ar r yO - Redo LI near Ar rayO Untrai ned Uneo, Array0 - Redo Unear Arrayl0 00.0% 00.0%._ _ _ __ _ _ _ 40.0% 50.0%. 40.0% 40.0% 35.0% 40.0% 33.7% *30.4% 130.0% 30.0%. 20.0% 20.0% 20.0% 10y0% 10.0%. 10y0% 0.0% 0.0% *mem &Wytaned raef ISO bl - 0.0% 11inow eryO untr Mnod a ledola~untrailed bern -M0 rf b o Figure 25: WER data for linear array for two trials: trained, untrained, and total data sets, respectively. For each chart, the left bar represents the original trial and the right bar represents the second trial. The gray bars indicate no statistical difference in the results across trials. Of course, we can not state generally with absolute confidence that there is no difference across trials as human speech is extremely variable across even a short span of time. Furthermore, background noise is a constantly changing element that may also induce variations. The best that can be done is to perform as few trials as necessary to minimize such differences. 6.2.4.3 Configuration Variations There are two major array configurations that are examined, the linear array and the multiarray. (A third variation, the multiple microphone arrays, will be discussed separately in Section 6.2.4.7) We compare performance of each with respect to the two controls mentioned above, as well as with each other (See Table 9 and Figure 26-Figure 28.) 62 Configuration Trained Set Untrained Set Total Linear Array 31.1% 30.4% 30.8% Multiarray 24.8% 31.7% 28.1% Table 9: WER for linear array vs. multiarray configuration. In both cases, the speaker was located dead center from the array. Looking at Figure 26, we can see that on all data sets (trained, untrained, and total), the performance of the linear array is in between that of the two controls, and that the differences between the three are statistically significant at least to the p=O.I level. The multiarray performs similarly (Figure 27). An interesting point is that the performance of the multiarray is somewhat closer to that of the close-talking microphone compared to the linear array. COse-LA-One Close-LA-One Traied Unftaied Close-A-Os. 63.9% 60.0% 50.0% 45.Ma 40.0% 60.0% 60.0% 50.2% 50.0% 40.0% 40.0% 31.1% 54.1% & 30.4% 30.0% ac 19.2% 6% 20.0% 10.0%e 20.0% 00% 100% doe.' .0 oo%00 - es cls W - Figure 26: WER data for linear array: trained, untrained, and total data sets, respectively. For each chart, the bars are arranged in the order: close talking microphone, linear array, and one microphone element. The middle gray value (linear array) is the basis for statistical signficance testing. Red, green, and blue bars indicate a significance level of p<=0.001, p<=0.01, and p<=0.05, respectively. 63 ewe Close-MA-One Trined Clusa-MA-One CloeMA-one Untrained 63.9% o0.0% 00.0% 50.0% 40.0% 40V0% 30.0% 30.0% 60% 54.1% 50.0% 40.0% 31.7% de 28.1% 30.0% 20.0% 192% 22.0% 10.0% 20.0% 10.0% 0.0%4. close akied nmoi0my0raOed C10980 Unbakne OnlS elet akned Mulli~my unraknw or* elemnt untaled 0.0% cloe MURlirrayO one elemenM Figure 27: WER data for multiarray: trained, untrained, and total data sets, respectively. For each chart, the bars are arranged in the order: close talking microphone, multiarray, and one microphone element. The middle gray value (multiarray) is the basis for statistical signficance testing. Red (grid), green (vertical lined), and blue (horizontal lined) bars indicate a significance level of p<=0.001, p<=0.01, and p<=0.05, respectively. It is not evident from just Table 9, however, which array configuration performs better. Performing the MAPSSWE test (Figure 28) reveals that statistically, there is no difference in the performance of the two configurations, at least for the case of a stationary, on-beam, dead-center sound source. LA-MA Trained LA-MA L A-MA Untr dned 60.0% 60.0%- 60.0% 50.0% 40.0% 40.0% 40.0% - ... 30.0% 20.0% 10.0% 30,4% 31.1% em24.9% 30.0% 30.0% - 20.0% -- 1 - lineatyaayO tmined 33 mulflamayO Mrined 0.0% - 2010% 30..%28.1% - - - -- - 10.0%- - Inecr ar r q/0 untr dnod mr i c~r r avO untr d red 11neffiftyo Mulia"yo Figure 28: WER data for linear array vs. multiarray: trained, untrained, and total data sets, respectively. For each chart, the left bar represents the linear array and the right bar represents the multiarray. Statistically, there is no difference between the two in performance. 64 6.2.4.4 Angular Variations Both the linear array and multiarray were tested at various on-beam conditions, where the guided beam and the physical location of the speaker were aligned. Three angles (0', 15.70, and 32.7'), representing integral sample delays for the linear array and two of the multiarray subarrays, were chosen, and the results are given in Table 10 and Table 11. We would expect the results to be roughly similar regardless of angle. Angle Trained Set Untrained Set Total 00 31.1% 30.4% 30.8% 15.70 30.1% 38.5% 34.1% 32.70 27.5% 30.1% 28.7% Table 10: WER for on-beam, linear array. The three angles represent integral sample delays. Angle Trained Set Untrained Set Total 00 24.8% 31.7% 28.1% 15.70 28.5% 27.7% 28.1% 32.70 32.5% 31.8% 32.2% Table 11: WER for on-beam, multiarray. The three angles correspond to integral sample delays for the linear array as well as the A2 (mid frequency) and A3 (low frequency) subarrays of the multiarray. Figure 29 indicates that for all data sets the linear array performs similarly, which is expected. Likewise, Figure 30 indicates that in general the multiarray performs similarly across angles and data sets. However, in one specific case, angle 32.70 on trained data, there is a difference in performance with respect to the baseline dead center (angle 00) case. This can possibly be explained by the fact that the beamwidth is proportional to the directed angle [26]; as the angle increases, more interference noise will enter the beamformed signal. However; the statistical significance of the difference is at a level p=0.05, which is near the usual threshold of 65 what is considered to be significant; given that all other cases in this configuration are statistically similar, no firm conclusion about the difference in performance can be made. LA-LA34A6 Trkined LA04A3-LA6 Untraind 00.0% 8D.0% 50.0% 60.0% 30.1% 31.1% 77 30.0% 20.0% 10.0% 10.0% lieaayO Vraned kwnw~ay traned 3.% 30.4% 30.0% 75 20.0% 0.0% -0.0% 50.0% 40.0% 40.0% 0.0% imay~ra6 brabned LAO4A3-A6 - linwrayo Unnakned 40.0%34.1% -28.7% 301% WON-aaray3 unkaid Hmneraay6 unnakwed 0.0% Figure 29: WER data for linear array at different angles: trained, untrained, and total data sets, respectively. For each chart, the bars are arranged in the order: 00, 15.70, and 32.70. Statistically, there is no difference in performance for the three angles for both data sets. MA0-MA3MA 60.0% Traknd MA0.AA1-MA2 MA04MA14-A2 60.0% n 60.0% 50.0% 40.0% 40.0% 31.7% 32.5% 30.0% 2.% 28.5% 31.8% 2.% 30.0% P 3.%28.1% 28.1% SM2D.0% 20.0% 10.0% 0.0%1 J 00 mWWW MLW3 Muwy 0.0% Figure 30: WER data for multiarray at different angles: trained, untrained, and total data sets, respectively. For each chart, the bars are arranged in the order: 00, 15.70, and 32.70. In the trained data set, there is a statistically significant (at p<=O.1) difference between the performance at angle 0 and at angle 32.7. Otherwise, there is no statistically signficant difference between the angles in both data sets. 6.2.4.5 Beamforming Variation One way of testing the efficacy of beamforming is to compare performance between results from the on-beam case and the off-beam case. Doing so also qualitatively tests the spatial 66 selectivity of the beam. We expect to see a sharp drop in performance (i.e., increased WER) for the linear array, which has a tighter beam than the multiarray. Case Trained Set Untrained Set Total On-beam 31.1% 30.4% 30.8% Off-beam 65.9% 68.1% 66.9% Table 12: WER for linear array, on and off-beam. For both cases, the beam was guided to dead center (angle 00). For the off-beam case the speaker was located at angle 15.7'. Case Trained Set Untrained Set Total On-beam 24.8% 31.7% 28.1% Off-beam 30.2% 39.2% 34.6% Table 13: WER for multiarray, on and off-beam. For both cases, the beam was guided to dead center (angle 00). For the off-beam case the speaker was located at angle 15.70. Comparing the results in Table 12 and Table 13 confirms our expectations; there is a significant drop-off in performance on all data sets between the on-beam and off-beam case for the linear array. The drop-off for the multiarray is not as significant (p<=O. 1), due to the broad beam of the multiarray, and is also expected. 67 LA-LAO9 Unalned LA-LAOS Trined LA-LAO 681% -0.0% 60.0% -0.0%O 50.0% 40.0% 50.0% 40.0% 40.0% 3.% 30.0% 30.4% 30.8% 3.% 20.0% 20.0% 10.0% 10.0% 20.0% 10.0% 10.0% 10.0% O Figure 31: WER data for linear array, on-beam and off-beam: trained, untrained, and total data sets, respectively. For each chart, the left bar represents the on-beam case and the right bar represents the off-beam case. Red (grid) bar indicates significance level of p<=0.001. MA-MAO MA-MA06 Trahned 00.0%f 60.0% 50.0% 50.0% 40,Y% 40.0% Untraied MAMAO 00-04W.0% 30.0% 20.% 24.9% 302 30.0% 30.0% 2.1 20.0 - 10.0% 31.7% - 0.0% 10.0% 10.0% m0.0%mnW fWwoO tW 0.0% - -"". Figure 32: WER data for multiarray, on-beam and off-beam: trained, untrained, and total data sets, respectively. For each chart, the left bar represents the on-beam case and the right bar represents the off-beam case. Blue (horizontal lined) bar indicates significance level of p<=0.05. 6.2.4.6 Interpolation Variation In Section 4.2.1.2, it is claimed that interpolation is necessary to obtain intermediary angles corresponding to non-integral sample delays. To determine whether such interpolation is really necessary, WER results from a beamform angle corresponding to a non-integral sample delay (10.40) were compared with that from a location corresponding to the nearest integral sample 68 delay (15.7*, see Table 14). We expect the non-interpolated WER will be significantly higher than that of the interpolated results. Since there will always be some amount of interpolation in the multiarray case (subarray Al always has interpolation), only the linear array was examined. Case Trained Set Untrained Set Total Interpolated 31.5% 32.5% 32.0% Non-Interpolated 44.5% 40.0% 42.4% Table 14: WER for interpolated vs. non-interpolated linear array case. For both cases, the speaker was located at angle 10.4'. For the former, the beam was guided to the same angle, while for the latter, the beam was guided to the closest angle possible (15.70). LA2 I0 Trained4A2 NI Trained LA2 60.0% Int Untrained4A3 Nit Untrained LA2 Int-LA3 NI 60.0% 50.0% 445% 42.4% 40.0% 4A 40.0% 30.0% 4300% 20.0% 10.0%3 0.0% - fineararray2 int trained lineararray3 ni trained 200 o.0% finearanay2 int untrained 0% 10.0% 0 10.0% I 30.0% 0.0%L knwararray3 Ni untrained fineaany2 it lier r "n Figure 33: WER data for linear array, interpolated and non-interpolated: trained, untrained, and total data sets, respectively. For each chart, the left bar represents the interpolated case and the right bar represents the noninterpolated case. Green (vertical lined) bar indicates significance level of p<=0.01. Figure 33 indicates that there is a significant difference (p<=0.01) between the interpolated and non-interpolated case, which is as expected. Thus interpolation is required to obtain better performance at non-integral sample delay angles. 6.2.4.7 Multiple Array Configuration Obviously, as the number of microphone elements increase, the SNR and hence WER performance should improve. One extra configuration that was examined was that of two eight 69 element microphone arrays arranged at right angles (see Figure 34). Each array is an independent entity, with its own DSP board and beamformer. Both multiple linear arrays and multiple multiarrays were examined. To simplify data collection, only a single location, the point where the dead center normals of the two arrays intersect, was tested. More rigorous experimentation will be left for future work. 16" 32" - !4 48" arrays @ 63" height Figure 34: Multiple array configuration. The speaker is located at the intersection of the dead center normals of both arrays and is facing at 45 degrees from both. Configuration Trained Set Untrained Set Total Linear Array 31.1% 30.4% 30.8% Multiple Linear Arrays 17.1% 31.1% 23.8% Table 15: WER for linear array vs. multiple linear arrays configuration. In the linear array case the speaker was pointed directly towards the single array. In the latter case the speaker was pointed in the direction 45 degrees off dead center for both arrays. 70 Configuration Trained Set Untrained Set Total Multiarray 31.7% 30.4% 28.1% Multiple Multiarrays 33.3% 31.7% 29.2% Table 16: WER for linear array vs. multiarray configuration. In the multiarray case the speaker was pointed directly towards the single array. In the latter case the speaker was pointed in the direction 45 degrees off dead center for both arrays. As expected, having multiple linear arrays improves performance as compared to a single linear array. However, the performance gain seems to occur only in the trained data set. Furthermore, there does not appear to be a corresponding improvement for the multiarray configurations; neither the trained nor the untrained data sets are significantly different. There a few possible explanations for the above results. First, the current system was not designed to incorporate multiple arrays using independent computing platforms. While it is easy to add localizer and tracker tasks to the system, adding an additional beamformer task requires that the additional DSP board be added to the same host as the original board. In addition, a DSP board based synchronizer needs to be implemented to synchronize between the 16 or more channels that are simultaneously being captured. Currently, only a simple post processing synchronizer has been implemented. 71 Unear Array - Multiple Linear ArrayC Trained Linear ArrayC - Multiple Arreyc Untrained 60.0% 00.0% 50.0% 5D.0% 40.0% 40.0% Mu"pl Lnear Arrmy 0 60.0% a 50.0% 40.0% 31.1% 13.% 30.0% Linear Arrfy" - 30.8% 31.1% 30.4% 30.0% 2D.0% 20.0% 2X0% 10.0% 10.0% 0.0% nemf0W 0.0% rbed mulrW VWtraaned Figure 35: WER for untrained, and represents the multiple linear p<=0.001. The p<=0. 05 . MultiarrayO - Multiple Multlarray0 Trained 60.0% WwnwrarayO un*raWnd mulWpO 100 Wntried 0.0% "Mea Wrray MuL*0pl Wa linear array vs multiple linear arrays case: trained, total data sets, respectively. For each chart, the left bar single linear array and the right bar represents the arrays. The red (grid) bar indicates significance level of blue (horizontal lined) bar indicates significance level of MuloirrayO - Mufti*l Mularray0 Untrained Muftlerrayo - Multiple Multiarrayc 80.0% t 50.0% 50.0% 40.0% 40.0% -0.0% 31% 20.0% 10.0% 20.0% 10.0% --- 10.0% 0.0% L mnfiwnayO tmhied mufip* MOO tahned fmkltffayO untraied muln*pl MRO untned 0.0% MUM~rry mutdp loa Figure 36: WER for multiarray vs multiple multiarrays case: trained, untrained, and total data sets, respectively. For each chart, the left bar represents the single multiarray and the right bar represents the multiple multiarrays. 6.2.5 Summary In general, most static condition results given above have been as expected. It should be noted that given current speech recognition technology, totally perfect recognition is an unattainable goal even with the best microphone configuration. Best performance, measured in SNR as well as WER, is obtained using the headset microphone, followed by the linear array and 72 the multiarray. The single element microphone is practically useless in a dynamic acoustic environment. Both the linear array and multiarray have comparable performance on a straight on-beam situation; both perform approximately 50% worse than the "ideal" close talking microphone. Performance as measured in WER is not dependent on angle for either array, and is significantly better for trained data sets than untrained. It has been confirmed that the multiarray is much less spatially selective, as was predicted by the calculations in Section 3.1.4. This is both an advantage and disadvantage. The multiarray is more tolerant than the linear array with off-beam sound sources, which is helpful if the TO is not exactly on-beam, but harmful if other noise sources are nearby. The lower SNR of the multiarray compared to the linear array is another manifestation of this characteristic. Not much analysis can be made regarding the multiple array configurations at the present time. At least for the multiple linear array case there appears to be evidence that performance is enhanced with two arrays over one. However, more sophisticated synchronization mechanisms are necessary before more conclusive statements can be made. 6.3 Dynamic Condition In the static condition, measuring system performance is made easy by the fact that the speaker is stationary at well-defined and ideal locations (corresponding to integral sample delays). In the dynamic condition, a single speaker moves in front of the linear array or the multiarray in a free-form manner (roaming) to simulate a person walking around the room and dictating. The speech was taken from the same trained and untrained data sets of the static condition experiments. Unless the tracker perfectly tracks the speaker, errors will be introduced to the speech recognition on the order of the static non-interpolated or off-beam conditions. 6.3.1 Tracker Output In the combined or integrated track mode, the tracker incorporates both audio and visual localizer information to generate a target location estimate in a manner described in Section 4.4. Figure 37 and Figure 38 show portions of the integrated tracker output, the normalized direction of the beam, for two roaming trials with a linear array and multiarray, respectively. The tracker output is overlaid with the outputs of the individual modal trackers. 73 Figure 37: Integrated tracker output with a Figure 38: Integrated tracker output with a linear array. The y-axis is in units of integral sample delays (representing beamformed angles, with negative delay values corresponding to angles to the left of the array normal) and the x-axis is the tracker iteration number. Red points indicate the audio localizer based tracker output, blue points indicate video localizer based tracker output, and the purple line indicates the actual tracked output. multiarray. The y-axis is in units of integral sample delays (representing beamformed angles, with negative delay values corresponding to angles to the left of the array normal) and the x-axis is the tracker iteration number. Red points indicate the audio localizer based tracker output, blue points indicate video localizer based tracker output, and the purple line indicates the actual tracked output. It is evident from the plots that, for the most part, the integrated tracker used the output of the video localizer based tracker. There are also portions where both the audio and video based trackers agree. Table 17 lists the actual percentages of the individual modal tracker usage by the integrated tracker in the two roaming trials. Configuration Video used Audio used Video only Audio only Linear array 97.8% 29.3% 69.5% 1% Multiarray 98.1% 22.4% 76.4% 0.64% Table 17: Modal tracker usage by the integrated tracker 74 Only a small percentage of integrated tracker positions were determined solely by the audio localizer based tracker, though it did provide corroborating information in a considerable percentage. In the final analysis, the video localizer based tracker works remarkably well, enough that for the most part, the audio based tracker is redundant or unnecessary. Further experiments are necessary, with more variations in stimuli, to determine conclusively the relative merits of each modal tracker. 6.3.2 WER Data Table 18 gives the WER scores for a roaming speaker and a linear array under three tracker configurations: audio-only, video-only, and integrated. The same trained and untrained sets from the static condition experiments are used. It is expected that all WER values will be similar or worse than those obtained in the static condition experiments, as there are additional sources of error that will cause speech signal degradation; a mismatch between the actual location of the speaker and the array beam leads to increased recognition errors. In any tracking system with a moving target, tracking delays or outright errors will cause such mismatches. We further expect the combined and video-only tracker performances to be similar, as the previous section showed that the video localizer has the greatest contribution to the integrated tracker output. In addition, the audio-only tracker performance should be significantly worse. We expect the resulting increased WER values to be within the range of values for the static noninterpolated or static off-beam cases (see sections 6.2.4.5 and 6.2.4.6). Configuration Trained Set Untrained Set Total Audio info only 55.3% 58.2% 56.7% Video info only 30.0% 36.6% 33.0% Combined info 34.3% 37.0% 35.6% Table 18: WER for linear array with roaming speaker, with audio only, video only, and combined information for the tracker. Input from trained, untrained, and total set. 75 As expected, the video only and integrated tracker cases are statistically similar (See Figure 39), and the audio tracker case is significantly worse. The reason for this is evident from the sample audio tracker output plot (Figure 23), which shows an output that occasionally strays very far from the actual speaker location. The video and combined tracker outputs, on the other hand, are for the most part stable and show the tracker correctly following the speaker; consequently the WER values are close to that of the linear array in the static condition. Linear Array Roam Trained: Audio - Video Combined LinearArray Roam Untrained: Audio - Video Combined o o0.0% Unear Array Roam Total: Audio - Video Combined 60.0% 50.0% 50.0% 40.0%34.3% 30.0.%% 30.0% 20.0% 20.0% 10.0% 10.0% 50.0%a 0.0% 1L I I I I 0.%lknoamAu lIOnraVu KirvoamAMu % oamA 1nroamv lkwoa AV Figure 39: WER for linear array with roaming speaker and audio only, video only, and combined input to tracker. The charts are arranged in the order: trained, untrained, and total data sets. For each chart, the left bar represents the audio only case, the middle bar represents the video only case, and the right bar presents the combined case. The red (grid) bar indicates significance level of p<=0.001. Table 19 gives the WER scores for a roaming speaker and the multiarray under the three tracker configurations. We expect similar trends to that of the linear array, with values higher than that of the multiarray in the static condition. Configuration Trained Set Untrained Set Total Audio info only 42.9% 51.1% 46.7% Visual info only 36.5% 44.6% 40.3% Combined info 35.9% 43.3% 39.3% Table 19: WER for multiarray with roaming speaker, with audio only, video only, and combined information for the tracker. Input from trained, untrained, and total set. 76 Figure 40 shows that while the video and combined trackers perform better than the audioonly tracker, all three tracker outputs are similar across all data sets, at least statistically. It appears that the multiarray is less sensitive to the tracking errors of the audio tracker than the linear array. On the other hand, the video-only and combined trackers seem to be performing worse than with the linear array. MultiArray Roam Trained: Audio - Video Combined - MultiArray Roam Untralned: Audio - Video Combined MuNlArray Roam Total: Audio - Vkleo - Combined 60.0% a0.0% 51.1% 60.0% 50.0% A- 40,3% 39.3% nh*kwv MukkoaffkAV 40.0% 30.0% - 2D.0% 40 - 460066 oke~ mjoa0% 10.0% 0.0% nvikKoafflA Figure 40: WER for multiarray with roaming speaker and audio only, video only, and combined input to tracker. The charts are arranged in the order: trained, untrained, and total data sets. For each chart, the left bar represents the audio only case, the middle bar represents the video only case, and the right bar presents the combined case. Figure 41 provides a direct comparison between the linear array and the multiarray for a roaming speaker. The results are mixed; the multiarray appears to perform statistically better than the linear array with an audio-only tracker, but worse with a video-only tracker. The combined tracker performs similarly with the linear array combined tracker. 77 . ...... ............. ... .. .... . ...... ... . Linear Array Room Audio - MultiArray Roam Audio Linear Array Linear Array Roam Combined - MulliArray Room Combined %0.0% M7 50.0% Room Video - MultiArray Room Video 50.0% 30 40.0% 40.0% 20.0% 00% 10.0% mvjkkoanA - 10.0% 0.0% HnroafA 39.3% 0 andrownO *VmanV - 1u1 n0% - flnooAV A muftoenAV Figure 41: WER for linear array vs multiarray case with roaming speaker. The charts are arranged in the order: audio only, video only, and combined input to tracker. For each chart, the left bar represents the linear array result, and the right bar presents the multiarray result. The blue (horizontal lined) bar indicates significance level of p<=0.05. 6.3.3 Summary In comparing the two array configurations, we are basically comparing the tradeoff between two inversely related array characteristics, spatial selectivity and frequency bandwidth. The multiarray has a broader frequency response but less spatial selectivity due to a wider beam. The linear array has a narrower frequency response but better spatial selectivity. The performance of the different trackers determines which tradeoff is better suited. With the more unstable and error-filled output of the audio tracker, the broader beam and less spatial selectivity is an advantage, and the multiarray performs better than the linear array. With the more stable video tracker, better spatial selectivity is more advantageous in eliminating interfering background noise, so the linear array performs better. Ideally, the combined tracker should perform better than either of the individual trackers, but in practice it performs at least as well as the best individual tracker (in this case video). 6.4 Overall Summary For many experiments there were no clear or statistically significant advantages for either array configuration. For others, the multiarray had a slight advantage, primarily due to the broader spatial resolution (see Section 6.2.4.5). In a typical real-world application, with a moving speaker and the currently implemented system, the multiarray is the best choice. Improvements in the 78 localizers and trackers, or increasing the number of microphone elements may change the relative merits of each configuration, and further experiments are necessary. In absolute terms, no configuration, including the close-talking microphone, currently perform at a level acceptable for real world applications. However, in a realistic implementation more time and effort would be expended to tailor the system to the particular speaker; the speech recognition software can be trained much more to improve overall recognition scores. 6.5 Additional/Future Work A DSP board level synchronizer will be implemented to allow more rigorous testing with multiple arrays. In addition, various configurations of multiple arrays (right angle, planar, etc.) will be examined. More experiments can also be performed to test other aspects of the system. For example, more complex roaming patterns and additional sources of visual and audio noise are possible. In the analysis of microphone array response, the array shading coefficients, ai in Equation (4), have been assumed to be constant and unity, which is equivalent to a rectangular window FIR filter. Again from signal processing theory, it has been determined that the best tradeoff between beamwidth and sidelobe level using static coefficients are those based on the Chebyshev window [6]. Different approaches can be employed to adapt the array response to the specific conditions of the environment. One of the criticisms of simple delay-and-sum beamformers is that besides the natural attenuation of signals lying outside of the mainlobe, there is no activefiltering of interfering sources [38]. Interfering or competing noise sources include speech from other people or coherent sounds like air conditioning units. A common approach is to adaptively modify the shading coefficients b using a least mean squared (LMS) based algorithm to adjust the nulls in the array response so that their spatial location corresponds to that of the noise sources [26]. Still other approaches seek to reduce the effect of reverberation, which is caused by reflections of the source signal off walls and objects. Most dereverberation techniques require an estimate of the acoustic room transfer function, which represents how the source signal is modified by the physical environment. In theory, convolving the inverse of this function with the microphone output will remove the effects of the reflections. In practice, computing the direct inverse is often not stable or possible. One computationally efficient method, matched filter array 79 processing, uses the time-reverse of the transfer function as a "pseudo-inverse" and results in improved SNR of the beamformed signal [5]. 80 Chapter 7 Conclusion A real-time tracking and speech extraction system can be immensely useful in any intelligent workspace or human-computer interaction application. This thesis presents a simple low-cost modular system that automatically tracks and extracts the speech of a single, moving speaker in a realistic environment. Just as important as the implementation of the system, several experiments have been performed to test the system and explore the tradeoffs in changing various system variables, including microphone array configuration and visual or audio modality usage. While neither the linear array nor the multiarray has proven to be overwhelmingly better than the other, the multiarray, with a broad frequency response but a coarser spatial resolution, appears to perform slightly better. We sought to show that multimodal integration provided benefits to thetraditional means of locating the speaker, sound localization. The experimental results have shown that visual localization is a very powerful sensory modality that renders audio localization a redundant modality at best, and even unnecessary in certain circumstances. More experiments, involving a more complex environment with multiple speakers, are necessary to determine whether the reverse is true in other circumstances. Several possibilities exist for future expansion and work. Adding and testing adaptive and dereverberation features as discussed in Section 6.5 is the immediate next step. Different configurations of both the visual and audio hardware may also be examined. Currently, the microphone array is linear and can therefore handle one dimension of spatial separation. A two dimensional array capable of handling two spatial dimensions is possible with the development of 81 a DSP board level synchronizer. Similarly, only a single camera is currently being used. More precise tracking and object detection can be performed with multiple cameras placed at different locations. 82 Appendix A Speech Spectrograms For all of the following spectrogram figures, the utterance was "This is a test." Red denotes high signal energy, while blue indicates low energy. A.1 Controls Figure 42: One spectrogram channel microphone Figure 43: Close spectrogram talking microphone It is evident from the spectrogram plots that the close-talking microphone output has a bigger contrast (red to blue) between high and low signal energy than the one channel microphone output, which is mostly red or orange. Having a higher contrast corresponds directly to having a higher SNR, which confirms the results of Section 6.2.1. 83 ... . .... . ..... .... ...... A.2 Single Array Configurations Figure 44: Linear array spectrogram Figure 45: Multiarray spectrogram The linear array spectrogram appears to have a higher SNR and better frequency resolution, which is also in agreement with the results of Section 6.2.1 and our understanding of the benefits of the linear array over the multiarray. It is surprising that the multiarray works as well as was found in the experiments given the poor SNR and frequency resolution. 84 .. ...... . A.3 Multiple Array Configurations Figure 46: Multiple spectrogram linear arrays Figure 47: spectrogram Multiple multiarrays In the multiple arrays case there is a very clear increase in SNR compared to the single array case for both the linear and multiarray. The much improved spectrogram for the multiple multiarray case over the single multiarray case is at odds with the actual WER results, where it was found that performance was not much improved. A probable explanation is that the synchronization of the two arrays happened to coincide in this very short sample, while it drifted over the course of a much longer speech trial. 85 86 Appendix B Speech Sets and Sample Results B.1 Actual Text B.1.1 Trained Set 1. to enroll you need to read these sentences aloud speaking naturally and as clearly as possible, then wait for the next sentence to appear 2. this first section will describe how to complete this enrollment 3. additional sections will introduce you to this continuous dictation system and let you read some other entertaining material 4. the purpose of enrollment and training is to help the computer learn to recognize your speech more accurately 5. during enrollment you read sentences to the computer the computer records your speech and saves it for later processing 6. during training the computer processes your speech information helping it to learn your individual way of speaking 7. read normal text with any pauses you need such as for or any time you need to take a breath 8. but be sure to read hyphenated commands like like a single word with no pause 9. how can you tell if the computer has correctly understood the sentence you just read? 10.as you read a sentence check it to see if it turned red 11.if the computer understood what you said the next sentence to read will appear automatically 12.if the sentence you just spoke turned red then the computer did not understand everything you said 13.the first time this happens with a sentence try reading the sentence again 14.if it happens again click the playback button to hear what you just recorded 15.and pay special attention to the way you said the words and to any strong background noise 16.if everything sounds right to you click the next button to go to the next sentence you do not need to fix all red sentences 17.if you heard anything that did not sound right try recording the sentence one more time 87 18.if all or most of your sentences are turning red click the options button 19.then move the slider for match word to sound closer to approximate B.1.2 Untrained Set 1. consider the difficulties you experience when you encounter someone with an unusual accent 2. or if someone says a word which you don't know or when you can't hear someone clearly 3. fortunately people use speech in social situations 4. a social setting helps listeners figure out what speakers are trying to convey 5. in these situations you can exploit your knowledge of english and the topic of conversation to figure out what people are saying 6. first you use the context to make up for the unfamiliar or insufficient acoustic information 7. then if you still can't decipher the word you might ask the person to repeat it slowly 8. most people don't realize how typical it is for people to use context to fill in any blanks during normal conversation 9. but machines don't have this source of supplementary information 10.analyses based on meaning and grammar are not yet powerful enough to help in the recognition task 1l.therefore current speech recognition relies heavily on the sounds of the words themselves 12.even under quiet conditions the recognition of words is difficult 13.that's because no one ever says a word in exactly the same way twice 14.so the computer can't predict exactly how you will say any given word 15.some words also have the same pronunciation even though they have different spellings 16.so the system can not determine what you said based solely on the sound of the word 17.to aid recognition we've supplemented the acoustic analysis with a language model 18.this is not based on rules like a grammar of english 19.it is based on an analysis of many sentences of the type that people typically dictate 20.context also helps in distinguishing among sets of words that are similar in sound although not identical 88 B.2 Headset (Close) Microphone Data Set B.2.1 Trained Set Results 1. To enroll you need to read the sentences is aloud speaking naturally in as clearly as possible, then wait for the next sentence to appear (rei_1) 2. This first section would describe how to complete this enrollment (rei 2) 3. Additional sections will introduce you to this continues dictation system and (let) you read some other entertaining material (rei_3) 4. The purpose of enrollment and training is to help the computer learn to recognize your speech more accurately (rei 4) 5. During enrollment you read censuses to the computer Computer records to speech and sees it for later processing (rei_5) 6. During training the computer processes is speech information hoping it to learn your individual lawyers speaking (rei 6) 7. Read normal text with any pauses you need such as for or any time in need to take a breath (rei 7) 8. But be sure to be hyphenated commands like but a single word with no pause (rei 8) 9. How can you tell the computer screen is to the sense just read can? (rei 9) 10.As you to sentence ticket to see if it turned red (rei 10) 11.If the computer understood what he said the next sentence read appear automatically (rei 11) 12.If the sentencing judge spoke to and read and computer did not understand everything said (rei 12) 13.The first time this happens with the sentence try reading the sentence again (rei 13) 14.If that happens again click the playback button to hear what you just recorded (rei_14) 15.And pay special attention to the way said the words and to any strong background noise (rei 15) 16.If everything sounds right to you click the next but then to go to the next sentence he did not need to fix all read sentences (rei 16) 17.If you heard anything did not sound right track recording the sentence one more time (rei 17) 18.If all almost is sentences that turning red click the options but some (rei 18) 19. Then move tihe slider for a match with the sound closer to approximate (rei_19) 89 B.2.2 Untrained Set Results 1. Consider the difficulties expense with intent to someone with an unusual accent (rei 51) 2. Or if someone says a word which don't know when he can't hear someone clearly (rei 52) 3. Fortunately people use speech in social situations (rei 53) 4. A social setting helps listeners figure out what speakers are trying to convey (rei 54) 5. In these situations you can exploit your knowledge of English at the topic of conversation to figure out what people are saying (rei 55) 6. Foresees a context to make up for the unfamiliar or insufficient acoustic intimation (rei_56) 7. Beneath the sick and decipher the word you might ask the person to repeat slowly (rei 57) 8. Most people don't realize how to acquit is for people accuse context to felony blanks to normal conversations (rei_58) 9. But machines don't have this was a supplementary information (rei 59) 10.Analyses based on meaning Grandma how not yet powerful enough to help in the recognition task (rei_60) 11.Therefore current speech recognition relies heavily on the sounds of the words themselves (rei_61) 12.Even on a quiet conditions the recognition of words is difficult (rei 62) 13.That's because no one ever says a word in exactly the same way twice (rei 63) 14.Seven computer can't predict exactly how you say any given word (rei 64) 15.Some words also have the same pronunciation even though their defense spellings (rei 65) 16.Service system cannot determine what you said based solely on the sound of a word (rei_66) 17.To aid recognition with supplemented acoustic announces with the line which model (rei 67) 18.This is not based on rules like a grandmother English (rei 68) 19.Is based on an analysis of many sentences of a type that people to Berkeley dictate (rei 69) 20.Context also helps in distinguishing among assets of words that a similar and sound of the not identical (rei_70) 90 B.3 Single Element B.3.1 Trained Set 1. To enroll you need to be the sentences fallout speaking naturally and as clearly as possible and wait for the next sentence to appear (rei_1) 2. This first section would describe how the complete this and Rome (rei_2) 3. Additional sessions will introduce you to this continuation to use efficient system the only reason other entertainment (rei 3) 4. The purpose of enrollment and training is to help the computer learn to recognize is the more accurately (rei_4) 5. During enrollment fee resemblances to the computer Peter of course feature SEC for later processing (rei 5) 6. During training the computer processes your speech definition helping it to learn your individual is speaking (rei 6) 7. Make no more intensive if any pause is unique such as four is for any time in the fifth (rei 7) 8. But be sure to read hyphenated commands let me have raft when the single word with no is behalf (rei_8) 9. How can you tell me if the computer has correctly understood the sentence you just read (rei_9) 1O.As you read a sentence a ticket to see if it turned red (rei 10) 11.The computer others that were used the the next sentence to read will appear automatically (rei 11) 12.If the Senate considers both turned red (rei_12) 13.The first time this happens for sentence try reading the sentence again (rei_13) 14.It happens again with Iraq but here we just reported (rei_14) 15.A special attention to the ways of the words and to many strong background noise (rei 15) 16.Getting some 40 but the next button to go to the next up The money to fix all right set (rei 16) 17.Here and in the the not sound right time recording the Senate, (rei 17) 18.It almost is sentences are turning red, the auctions by (rei 18) 19.Then move this order or a match for the sound of to to oppose (rei_19) 91 B.3.2 Untrained Set 1. Severe difficulties this fence running counter someone with an unusual accents (rei 51) 2. For someone says a word of tonal and here's some clearly (rei_52) 3. Fortunately he please finish in social situations (rei 53) 4. For social setting helplessness figure out what speakers are trying to convey (rei_54) 5. In these situations finance for your knowledge of English hero with it was saying (rei 55) 6. For Caesar conquered from the fund familial were insufficient of Pacific nation (rei 56) 7. Then if is to have to suffer the work he might as the person to peacefully (rei 57) 8. To speed with unrealized profit to is for people whose contents Of the land any flights during normal conversation (rei_58) 9. machines that have resources of woman fifth majore (rei 59) 10.Analyses based on meaning and drummer for an idea how far enough to hold further recognition has (rei_60) 11.Therefore current speech recognition was heavily on.is also worth saw (rei 61) 12.In another part Commission the recognition of words as the " (rei 62) 13.Best because no one ever says a word is that is simply point (rei 63) 14.Still popular tactic SEC economists say any import (rei_64) 15.Some words or so as in pronunciation even though their incomes by (rei_65) 16.Service system cannot summon ways that is so we are some of the word (rei_66) 17.To recognition result: the purpose but announces what language, (rei 67) 18.Is not based on Wall by the Liberal mobbing which (rei_68) 19.is based on analysis of many senses of retired (rei 69) 20.Complex also hopes of distinguishing among sense of words (rei_70) 92 B.4 Linear Array, On-beam Angle=O B.4.1 Trained Set 1. To enroll in need to the face sentences aloud naturally and as clearly as possible and wait for the next sentence to appear (rei_1) 2. This first section were describe how to complete this enrollment (rei 2) 3. Additional sections will introduce you to this continues dictation 3) system and the reason other entertaining material (rei 4. Purpose of enrollment and training is hoped the computer learn to recognize your speech more accurately (rei_4) 5. during enrollment you read sentences to the computer Peter of (rei_5) for later processing court to speech and Caesar 6. during training in the computer processes your speech defamation 6) to learn your individual lawyers speaking (rei hoping it 7. Lead normal text with any pauses unique scissors for or any time any take a breath (rei_7) 8. But he sure to read hyphenated commands like with a single word with no pause (rei_8) 9. Hundreds tell the computer has correctly understood the sentence (rei 9) just read? (rei 10) 10.Busily the sentence check in see if it turned red 11.Is the computer understood we said the next sentence should read will appear automatically (rei 11) 12.If the sentence judge spoke to and red The not understand it in is set (rei_12) 13.The for is time this happens for the sentence try reading the sentence again (rei_13) 14.If it happens again put that there that want to hear we just reported (rei_14) 15.That the special attention to the way said the words after any strong background Alex (rei_15) 16.If everything sounds trite you click the next autumn to go to the red Sox (rei 16) next sentence money you do not need to fix all 17.To let a that the not sound right track recording the sentence one more time (rei 17) 18.Is almost is sentences are turning red but the options but (rei_18) closer to boxed (rei_19) 19.Then move the slider for much work to sell 93 B.4.2 Untrained Set 1. Consider the difficulties experience an encounter someone with an unusual accent (rei 51) 2. Or someone says a word which don't know you can't hear someone clearly (rei 52) 3. Fortunately people use speech in social situations (rei 53) 4. The social setting hopes listeners figure out what speakers are trying to convey (rei 54) 5. In these situations even explain the knowledge of English and the topic of conversation to figure out for St in (rei_55) 6. First peace conference to make up for the unfamiliar or insufficient the acoustic information (rei_56) 7. Then he still can't decipher the word you might pass the person to repeat slowly (rei 57) 8. Most people don't realize how to put is for people to use context to fill and enable lines during normal conversations (rei 58) 9. But machines don't have this source of supplementary information (rei 59) 10.Analyses based on meaning and Grandma not yet offer enough to help in the recognition task (rei_60) 11.Therefore current speech recognition was heavily on the sounds of the words and soaps (rei_61) 12.Even under what conditions the recognition of words is to (rei_62) 13.That's because no one of us as a word and set is in which buys (rei 63) 14.To the computer can't predict exactly how you say any deport (rei 64) 15.Some words also of the same pronunciation even though they defense blowups (rei 65) 16.Service system can not determine ways that based solely on the sound of a word (rei_66) 17.To late recognition will supplemented the acoustic announces with the language of (rei 67) 18.This is not based on goals by the ballot in which (rei 68) 19.Is based on analysis of many sentences but the type people to predict (rei 69) 20.Context also helps stocks finished in amounts of words is similar in sound can not adapt (rei_70) 94 B.5 Multiarray, On-beam Angle=O B.5.1 Trained Set 1. Drumroll of you need to read these sentences alloud, speaking naturally as clearly as possible and wait for the next sentence to appear (rei_1) 2. This for a section with describe how to complete this and Roman (rei_2) 3. Additional sections will introduce you to this continuous dictation system and let me read some other entertaining material (rei 3) 4. The purpose of enrollment and training is to help the computer learn to recognize your speech more accurately (rei_4) 5. Turn enrollment Sentences to the computer As a set for later processing (rei_5) 6. During training the computer processes speeds information cocaine to learn your individual lawyers speaking (rei_6) 7. Read normal text with any pauses the need such as for were any time any typical crime (rei_7) 8. Be sure to read hyphenated commands like what the single word no pause (rei_8) 9. And needs no of the computer has correctly understood the sentence just read (rei_9) 10.As you read a sentence check in to see if it turned red (rei 10) 11.If the computer understood we said the next sentence to read will appear automatically (rei_11) 12.If the center suggests to red and computer can not understand levees said (rei_12) 13.The first time this happens with a sentence try reading the sentence again (rei_13) 14.If it happens again click the playback button to hear what you just recorded (rei_14) 15.Pay special attention to the way you said the words and to any strong background bonds (rei_15) 16.If everything sounds right to you click the next one to close the next sentence if the money to fix all red a sentence (rei 16) 17.If you heard of anything that the not sound right track recording the sentence one more time (rei 17) 18.All most of the sentences are turning red click the options but (rei 18) 19.Then move the slider for match were to sell closer to box (rei_19) 95 B.5.2 Untrained Set 1. Consider the difficulties experience an encounter someone with an unusual access (rei 51) 2. Or someone says a word to Donald note: can't hear someone clearly (rei 52) 3. Fortune and the people use speech and social situations (rei 53) 4. A social setting helps listeners figure out what speakers are trying to convey (rei_54) 5. The situations begin exporting your knowledge of English topic of conversation to figure out what people sang (rei_55) 6. Firsts use the context to make up for the unfamiliar or insufficient but this information (rei 56) 7. Then if you still can't decide for the word As the person to repeat slowly (rei 57) 8. Most people don't realize how to put it is for people to use context of felony blanks to normal conversations (rei_58) 9. Machines don't have the source of supplementary information (rei 59) 10.Analyses meaning and Grandma not yet for now to have been the recognition past (rei 60) 11.therefore current speech recognition relies heavily on the sounds of the words themselves (rei_61) 12.Even on a quiet conditions for recognition awards is difficult (rei 62) 13.Us because no one ever says a word in excess of the same way towards (rei 63) 14.So the computer can't predict exactly how you say any thing worth (rei 64) 15.Some words most of the same pronunciation even though they have defense balance (rei 65) 16.Services and can not determine what you said based solely on some of the word (rei_66) 17.To name recognition we've supplemented the acoustic announces with a language more (rei 67) 18.this is not based on rules for Grumman of English (rei 68) 19.Is based on analysis of many sentences of the tight the people of the dictate (rei 69) 20.Context also hopes of distinguishing among sense of words and a similar unsolved problem not identical (rei_70) 96 Appendix C 16 Element Array Design d2 Sub-array 3 d Figure 48: Microphone placement for 16 element compound array. Additional elements can be easily added to the system by incorporating extra DSP32C boards, which have built-in inter-board data sharing capabilities [39]. Figure 48 shows the configuration for a 16 element compound array. Interelement spacing is the same as for the eight element case. Figure 49 and Figure 50 are the high and mid frequency subarray beam patterns, respectively. The low frequency subarray is unchanged from the one in the eight element compound array. 97 51 15 . ..... 0 0 15 2 ..... 18 . 2 0 ... 0.. - 0 2 21 [10,0.02,8000] I 0 1 [10,0.02,2000] 1 0 1. [10,0.02,1000] 0 1, [10,0.02,500] 10 ..... 18 .. . 0 0 21 0 21 0 2 0 2 0 [10,0.02,8000] [1 0,0.02,1000] [10,0.02,2000] 1 1 - 0 1 0 1. 00 . 1 18 . .. .....1.... ...... 18 . 0 12 18 ...... 21 .. 18 00 21 0 0 21 0 21 21 0C 2 0 2 2 C 0 2 [10,0.02,500] 1 0 15 Figure 49: High Frequency Sub-Array Pattern 1 18 --+ - - 18 - 0 2 2 0 0 0 21 2 0 2 1 - 15.. [10,0.06,8000] I 1 - 0 1 18 -- 0 2170 2 0 Figure 50: Mid Frequency Sub-Array Pattern 98 0 . .. ' -+-- 0 0 2 0 [10,0.06,2000] [10,0.06,1000] 1 0 . 1 .0 0 15 1 1 8 ....--.-. - 18 - -0 2 [10,0.06,8000] 1 1 o 1-00 18 [10,0.06,500] 1 1 . 0 21 - 0 21 2 0 1 10 . ..... 0 2 21 [10,0.06,2000] [10,0.06,1000] [10,0.06,500] 10 0 1 '. ,.. 18 .-- - . ---- 0 21 0 2 0 References [1] Durlach, N. I. and Mayor, A. S., "Virtual Reality: Scientific and Technological [2] Challenges,". Washington, D.C.: National Academy Press, 1995, pp. 542. Shockley, E. D., "Advances in Human Language Technologies," IBM, White Paper 1999. [3] Brookner, E., Tracking and Kalman FilteringMade Easy. New York: John Wiley & Sons, [4] Inc., 1998. Flanagan, J. L., Berkley, D. A., Elko, G. W., West, J. E., and Sondhi, M. M., [5] "Autodirective Microphone Systems," Acustica, vol. 73, 1991. Rabinkin, D. V., "Optimum Sensor Placement for Microphone Arrays," inPh.D., Dept. of Electrical and Computer Engineering.New Brunswick, NJ: Rutgers, State University of New Jersey, 1998, pp. 169. [6] [7] [8] Lustberg, R. J., "Acoustic Beamforming Using Microphone Arrays," in MS., Dept. of ElectricalEngineeringand Computer Science. Cambridge: MIT, 1993, pp. 72. Bub, V., Hunke, M., and Waibel, A., "Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming," presented at Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1995. Teranishi, R., "Temporal aspects in hearing perception," in Handbook of Hearing,Namba, S., Ed. Kyoto: Nakanishiya Shuppan (in Japanese), 1984. [9] [10] [11] Cole, R. A., Mariani, J., Uszkoreit, H., Zaene, A., and Zue, V., "Survey of the State of the Art in Human Language Technology,".: National Science Foundation, 1995, pp. 590. Silsbee, P., "Sensory Integration in Audiovisual Automatic Speech Recognition," presented at 28th Asilomar Conference on Signals, Systems and Computers, 1994. Bregler, C., Omohundro, S., and Konig, Y., "A hybrid approach to bimodal speech recognition," presented at 28th Asilomar Conference on Signals, Systems and Computers, 1994. [12] Irie, R. E., "Multimodal Sensory Integration for Localization in a Humanoid Robot," presented at Second IJCAI Workshop on Computational Auditory Scene Analysis, Nagoya, Japan, 1997. [13] Irie, R. E., "Multimodal Integration for Clap Detection," NTT Basic Research Laboratory, Japan, Internal Report 1998. 99 [14] Knudsen, E. I. and Brainard, M. S., "Creating a Unified Representation of Visual and Auditory Space in the Brain," Annual Review ofNeuroscience, vol. 18, pp. 19-43, 1995. [15] Stein, B. E. and Meredith, M. A., The Merging of the Senses. Cambridge: MIT Press, 1993. [16] Meredith, M. A., Nemitz, J. W., and Stein, B. E., "Determinants of multisensory integration in superior colliculus neurons," Journal of Neuroscience, vol. 7, pp. 3215-29, [17] [18] Bracewell, R., The FourierTransform and Its Applications: McGraw-Hill, 1986. Chou, T. C., "Broadband Frequency-Independent Beamforming," in MS., Dept. of ElectricalEngineeringand Computer Science. Cambridge: MIT, 1995, pp. 105. Inoue, K., "Trainable Vision based Recognizer of Multi-person Activities," in Dept. of ElectricalEngineeringand Computer Science. Cambridge: MIT, 1996, pp. 79. Knapp, C. H. and Carter, G. C., "The Generalized Correlation Method for Estimation of Time Delay," IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-24, 1987. [19] [20] 1976. [21] [22] Omologo, M. and Svaizer, P., "Acoustic Source Location in Noisy and Reverberant Environment using CSP Analysis," presented at ICASSP96, 1996. Rosenberg, A. E. and Soong, F. K., "Recent Research in Automatic Speaker Recognition," in Advances in Speech Signal Processing,Furui, S. and Sondhi, M. M., Eds. New York: Marcel Dekker, 1992, pp. 701-738. [23] [24] [25] Murase, H. and Nayar, S. K., "Visual Learning and Recognition of 3-D Objects from Appearance," InternationalJournalof Computer Vision, vol. 14, pp. 5-24, 1995. Zhang, Z. and Faugeras, 0., 3D Dynamic Scene Analysis: Springer-Verlag, 1992. Swain, M. J. and Ballad, H., "Color Indexing," InternationalJournalof Computer Vision, vol. 7, pp. 11-32, 1991. [26] Johnson, D. H. and Dudgeon, D. E., Array Signal Processing: Concepts and Techniques. NJ: Prentice Hall, 1993. [27] [28] [29] Lee, J., "Acoustic Beamforming in a Reverberant Environment," in Dept. of Electrical Engineeringand Computer Science. Cambridge: MIT, 1999, pp. 64. Dudgeon, D. E. and Mersereau, R. M., Multidimensional DigitalSignal Processing.New Jersey: Prentice Hall Inc., 1984. Goodwin, M. M. and Elko, G., "Constant Beamwidth Beamforming," presented at Proceedings of the 1993 IEEE ICASSP, 1993. [30] [31] Oppenheim, A. V. and Schafer, R. W., Discrete-Time Signal Processing. New Jersey: Prentice Hall, 1989. Parker, J. R., Algorithms for Image Processingand Computer Vision. New York: Wiley Computer Publishing, 1997. [32] Gose, E. a. J., Richard and Jost, Steve, Pattern Recognition and Image Analysis. NJ: Prentice Hall PTR, 1996. [33] Rabiner, L. and Juang, B.-H., Fundamentalsof Speech Recognition. New Jersey: Prentice [34] [35] Hall, 1993. NIST, "SCTK NIST Scoring Toolkit,",, 1.2 ed: NIST, 1998. USGS, "The Insignificance of Statistical Significance Testing,".: USGS Norther Prairie [36] Wildlife Research Center, 1999. Gillick, L. and Cox, S., "Some Statistical Issues in the Comparison of Speech Recognition Algorithms," presented at ICASSP 89, 1989. [37] Pallett, D. and al., e., "Tools for the Analysis of Benchmark Speech Recognition Tests," [38] presented at ICASSP 90, 1990. Haykin, S., Adaptive Filter Theory. New Jersey: Prentice Hall, 1996. 100 [39] Signalogic, SIG32C-8 System User Manual. Texas, 1994. 101