Novel scheme of real-time direction finding and tracking of multiple speakers by robot-embedded microphone array Daobilige Su†, Masashi Sekikawa†, Kazuo Nakazawa‡ and Nozomu Hamada‡ †Signal processing Lab, School of Integrated Design Engineering, Keio University ‡Department of System Design Engineering, Keio University Hiyoshi 3-14-1, Yokohama 223-8522 Japan {su, sekikawa}@hamada.sd.keio.ac.jp, {nakazawa, hamada}@sd.keio.ac.jp Abstract. Recently, interest on artificial robot audition is growing for developing human-robot interaction. The main purposes of an artificial audio system mounted on mobile robot are localizing sound sources, separating speech signal that is relevant to a particular speaker such as robot’s master, and processing speech sources to extract useful information such as master’s uttering commands. This paper reports a novel proposed method of a speaker’s direction tracking algorithm, and a realization of the real tracking system on a mobile robot. Basic approach of this study belongs to a category of direction finding known as sparseness-based one which employs time-frequency decomposition and disjoint property between different speech signals. The novel points in the proposed source tracking exist on a reliable data selection from time-frequency cells and the application of mean shift tracking to the kernel density estimator derived from these reliable time-frequency components. A wheel-based mobile robot is developed and built-in audio processing system. Experiments are conducted and demonstrate the ability to localize in real environments. Keywords: robot audition, sound source localization, microphone array, kernel density estimator, mean shift tracking 1 Introduction For robot audition system the sound source tracking problem by it draws lots of attentions recently [1]-[6]. The same issues have also been studied in a wide field of applications, such as, automatic camera steering for surveillance and video conference, and acoustic discrimination between individuals in real multi-speaker environments. In this paper we are particularly interested in a speaker’s direction of arrival (DOA) tracking by microphone array mounted on mobile robots, where the problem is in the moving sources as well as moving sensor framework. In addition, the method should be sufficiently robust against environmental noise and other speaker’s voices. That is, it is required a better directional sensitivity and better anti-interference ability. Under the assumption of fixed source location, there have been proposed a large amount of studies. Among them, approaches based on the sparseness of time-frequency (T-F) component in speech signals are very attractive because of their capability for underdetermined case where the sources outnumber the sensors. This study is also based on the sparseness of speech signal in T-F domain. In the context of fixed sensor array, several methods of tracking moving source have been proposed. Most popular approaches are adaptive beamforming tracking [4] and particle filter or sequential Monte Carlo method [4][6]. The methods treat two kinds of localization issues: namely, tracking the source position such as exact position tracking in an environment and the source direction tracking. The former problem can be solved by combining the estimation by individual microphone arrays. Another type of problem integrates two-types of microphone arrays one of which is mounted on mobile robot and the others are on the room’s walls [2]. The acoustic source tracking system in mobile robot has been developed in the context of moving sensors/sources condition. This moving sensor/source problem would be different from that of fixed source/sensor location especially with respect to the number of data available because of real time processing. As obviously recognized in general estimation problem, the use of reliable data promises an accurate result. Thus, the deletion of outliers from observations is effective for statistical estimation algorithms. Basically the T-F sparseness approach attempts to cluster in the T-F cell distribution on which spatial cues such as direction angles are labeled. Thus the existence of outliers must disturb the clustering process, and tends to fail in detecting source positions. The idea of the T-F cell selection have been proposed by [7] and by an author of this paper [8] and [9]. This paper will present a modified T-F cell selection related with [9]. The main contributions of this paper are the following twofold. (a) Applying modeseeking mean-shift algorithm to the directivity likelihood in terms of kernel density estimator for time-varying source directions finding. (b) Implementing the above audition system upon a mobile robot and demonstrating and evaluating the tracking capability through real experiments. The rest of this paper is structured as follows. In section 2, the proposed DOA estimation method for arbitrary array configuration is presented, and our direction tracking algorithm based on the mean shift algorithm and the kernel density estimator approach is proposed. Section 3 summarizes the implementation of our mobile robot and signal processing system mounted on it. In section 4, tracking experiments and their results are demonstrated, and the paper is concluded in section 5. 2 2.1 Proposed method Speaker’s direction finding by arbitrary microphone arrangement Observation Model: Considers an array of M microphones with arbitrary configuration satisfying non-spatial aliasing condition where the inter-sensor distance is bounded, for instance, 4cm inter-sensor distance for 8kHz sampling. The observation ๐ฅ๐ (๐) at m-th microphone (m=1,…,M) is modeled by the following convolutive mixture of N source signals ๐ ๐ (๐) (i=1,…,N) ๐ฅ๐ (๐) = ∑ ∑ โ๐๐ (๐)๐ ๐ (๐ − ๐) (1) ๐=1 ๐=1 where โ๐๐ (k) is the impulse response from i-th source to m-th microphone. The method proposed in this paper is applicable to the underdetermined case where the sources outnumber the sensors, namely N>M. In the time-frequency domain by transformed by STFT (Short-Time Fourier Transform) the mixed observations ๐ฅ๐ (τ) of (1) is represented ๐๐ (๐, ๐) = ๐ป๐๐ (๐)๐๐ (๐, ๐) (2) where ๐๐ (๐, ๐), ๐๐ (๐, ๐) are the STFT’s of ๐ฅ๐ (๐), ๐ ๐ (๐) respectively, ๐ป๐๐ (๐) is the DFT of โ๐๐ (๐), l is the index of discrete frequency bin, k is the time-frame index. Let ๐ซm = [๐ฅm , ๐ฆm , ๐งm ]T (m=1, …, M) denote the location of the m-th sensor in 3-D space, and assume the first sensor is located at the origin (r1 = o) without loss of generality. Time Delay and Phase Difference: A source direction vector referred to as the propagation direction vector is defined by sin ๐ cos ๐ T ๐(๐, ๐) = [ sin ๐ sin ๐ ] cos ๐ (3) where ๐(−π < ๐ ≤ π) denotes the azimuth angle of source direction, ๐(−π/2 ≤ ๐ ≤ π/2) , and ๐(๐, ๐) constitutes a unit sphere. An acoustic signal with a propagation direction vector ๐(๐) induces time delays ๐๐ (๐ = 2, โฏ , ๐) of arrival between m-th sensor and the first or the reference sensor as follows. ๐๐ = −๐T๐ ๐⁄๐ (4) where c is the travelling speed. The vector-matrix formulation of above is represented by ๐=− ๐น๐( ๐) ๐ (5) where ๐ = [๐2 , … , ๐๐ ]T , ๐ = [๐2 , โฏ , ๐๐ ]T Define the following phase difference vector from the observations. ๐(๐) = [๐12 (๐), โฏ , ๐1๐ (๐)] , ๐1๐ (๐) = ∠๐๐ (๐) − ∠๐1 (๐) (6) , then from (5) we have the following relationship ๐= 1 ๐ (๐) ๐ (๐) (7) 2๐๐ ๐ where ๐ (๐) = ๐ , ๐๐ is sampling frequency. ๐ฟ Phase Difference and propagation vector: Let’s define a mapping R : ๐(๐, ๐) → ๐(๐, ๐) (8) , in specific, R can be represented by R(๐(๐, ๐)) = − ๐ (๐) ๐ (9) ๐น๐(๐, ๐) The inverse operation of R, that is R-1 : ๐(๐, ๐) → ๐(๐, ๐) (10) can be obtained by our previous method proposed in [9], by exploiting Gram-Schmidt orthogonalization in ๐-space. Reliable T-F cell selection: In real observations, due to multiple sources interaction, environmental noise and computational errors in STFT, the estimated phase ฬ (๐, ๐) (symbol ฬ is used for estimated value) does not give the difference vector ๐ ฬ on the unit sphere. When we set propagation vector ๐ ฬ(๐.๐) = R-1[๐ ฬ (๐, ๐)] ๐ (11) , reliable T-F cells, denoted by (๐, ๐) ∈ ๐ด, will be selected by the following rule. (12) ฬ(๐.๐) โ < 1 + ε 1 − ε < โ๐ where ๐ is sufficiently small positive value. ฬ(๐.๐) , (๐, ๐) ∈ ๐ด would generate N clusters each of In multiple sources case, all ๐ which corresponds one of the sources. Kernel density estimator: Besides the sell selection above, the power threshold and the consistency criteria [8] are applied to determine a set of reliable propagating direction vectors. Then we apply the kernel density estimator or the Parzen window technique for this set of data, and consequently, the local minimum points or modes of the resulted density function corresponds to the source directions. Applying this ๐ ฬ ๐๐ solely for a set of selected data denoted by(๐ฬ๐๐๐ , ๐ฝ ๐๐ ,), where k means the time frame index of the observation, ๐๐ is the frequency bin of underlying reliable cell and I is the number of T-F cells in the set A. The kernel density estimator with respect to the direction angles can be formulated as follow. ๐ผ ๐ ๐ 1 1 ๐๐ก − ๐ฬ๐ก๐๐ ๐๐ก − ๐ฬ๐ก๐๐ ๐ฬ (๐๐ก , ๐๐ก ) = ∑ ๐พ( , ) ๐ผ ๐(๐๐ )๐ฟ(๐๐ ) ๐(๐๐ ) ๐ฟ(๐๐ ) (13) ๐=1 where ๐(๐๐ ) and ๐ฟ(๐๐ ) are the band widths of 2-D kernel function ๐พ(๐, ๐) with respect to ๐ and θ respectively, and which can be respectively represented by ๐(๐๐ ) = 1 ๐ (๐)โcos ๐ cos ๐ ๐ซ๐ฅ + cos ๐ sin ๐ ๐ซ๐ฆ − sin ๐ ๐๐ง โ ๐ฟ(๐๐ ) = โ, ฬ ๐๐ ,๐=๐ ฬ ๐๐ ๐=๐ ๐ก๐ ๐ก๐ 1 ๐ (๐)โ−sin ๐ sin ๐ ๐ซ๐ฅ + sin ๐ cos ๐ ๐ซ๐ฆ โ โ ฬ ๐๐ ,๐=๐ ฬ ๐๐ ๐=๐ ๐ก๐ ๐ก๐ where the band width โ is determined by experiment. See more detail in [9] (14) Fig. 1. Flow of the proposed system operation 2.2 Direction tracking by mean-shift algorithm The DOA estimation problem discussed 2.1 results in the seeking of the local maximums or the modes of the obtained density estimator (13). In our system the mean-shift algorithm is employed for this purpose because of its low computational cost. In general, the mean shift algorithm is an effective clustering and mode seeking technique which does not require a prior knowledge of the number of clusters, and does not constrain the cluster distribution [10]. Our autonomous robot system needs to find and track the direction of a specific person as a robot master. The direction of the source is represented by its azimuth and elevation angles, and these time-varying values are recursively estimated by use of the mean shift algorithm. In our formulation the individual source directions correspond to the local maximum points of the kernel density estimator (13). Fig. 2 (a) shows one example of reliable T-F cell (blue dots) distribution on the unit sphere of a(φ,θ), and the figure (b) is an expanded image of the lower-left part of the sphere. These can be obtained by applying the proposed DOA estimation algorithm for real-life observation in a noisy environment. The initial point and the successively updated mode estimates (red points) by meanshift algorithm are also indicated in Fig. 2(b). It shows how the mean-shift algorithm could track the robot master direction starting from a previous estimate. Here, the red point on the top is the previous direction and it converges to the local maximum which should be the current master direction. (a) (b) Fig. 2. Reliable T-F cell distribution and mean shift recursion 3 3.1 Implementation Software Architecture Programing Environment In the realized signal processing system, the Linux Ubuntu 10.04.4 LTS is used as the operating system, on which the driver was installed for multichannel synchronal sampling A/D board. On top of these, we use ROS (Robot Operation System) Electric Version [11] as the main programming architecture. Software Architecture: ROS nodes and topics In ROS, each node is a process that can publish/subscribe or be client/server to another node. The communication between nodes is made by ROS topic. The software architecture represented by ROS nodes and topics in our system are shown in Fig. 3 in which rectangulars represent ROS nodes and arrows represent ROS topics. The nodes for Audio Sensor, Actuator and Communication are nodes that deal with external devices and nodes for DOA, Tracking, Speaker Iden and Speech Recog constitute the “brain” for the robot. The construction of Speaker Iden node and Speech Recog node is our long-term objective for future works. Fig. 3. Software Architecture The function of each node is described as blew. The node for Audio Sensor is responsible for continuously mining the audio data from microphone array and publishing it for other nodes. Each frame of audio data is 8k length under 16 kHz sampling frequency. The node for DOA will subscribe to the data from Audio Sensor node and estimate the DOA of TF cells with reliable cell selection. Then it will publish these data for Tracking node. The node for Tracking adopts the Mean Shift algorithm for tracking the robot master. In multi speaker’s case, it may track only the predefined robot master which is determined by Speaker Iden node. The Speaker Iden node will determine robot master. In single speaker’s case it will tell if the current speaker is predefined robot master. In multi speaker’s case, it can tell which one is robot master. Then it sends the message to the Tracking node which one it should track. The node for Speech Recog is to get the audio data from Audio Sensor node and master direction from Tracking node for translating voice command from robot master. In single speaker case, depending on whether one master is being tracked, it will choose to translate the voice command or keep idle. In multi speaker’s case, it solely translates the voice command by robot master by speech separation technique based on DOA. The node for Actuator is to subscribe to Speech Recog node and Tracking node for getting the voice command and speaker direction so that the robot could follow the simple task such as “move forward” or even more sophisticated task such as “follow me”. The node for communication is reserved for the future usage. 3.2 Hardware System The hardware of our robot platform consists of two parts: (a) audio signal processing, (b) robot controller, and these are connected by Bluetooth module. (a) Audio signal Processing As a signal capturing a tetrahedral microphone array with four omni-directional microphones is mounted, and uses analog amplifier followed by the synchronized 16- hannel analog-to-digital converter with 16bits and 16 kHz sampling frequency. (b) Robot Controller The motor control system of the robot is composed of several modules with connecting the SH2A (Renesas Electronics R5F72167ADF, 200MHz) CPU with ROM, SRAM, and SDRAM. The robot’s steering mechanism is the two powered wheel steering, and the driving motor is the Maxon A-max22 (6W) with reduction gear ratio 25.6:1. The rotation angle control is performed by the rotary encoder system (HEDL5540, 500CPT). In this setup the motor control is controlled solely by the velocity. The electric battery mounted in this robot is set to 28v3.9AH and its life is about 5 hours. Fig. 4. Robotic Platform 4 Experiments and results The scene of audio source tracking experiment is shown in Fig.5. Real time tracking of two moving speakers is illustrated in Fig. 6. Other experimental results will be presented. Fig. 5. Scene of experiments Fig. 6. Experimental results for two sources real time tracking 5 Conclusions An artificial robot audition system for detecting and tracking of sound sources is proposed. A novel DOA estimation method presented here is applicable for arbitrary array configuration and even for underdetermined case by exploiting the timefrequency approach. The modes of the proposed kernel density estimator for a set of selected data determine the corresponding source directions. Then mode seeking issue can be solved by the mean shift algorithm which is proved to be suitable for real time tracking. The developed mobile robot systems on which audio signal processing and robot controller system are mounted are connected by a wireless link. To verify the effectiveness of our system in real environments several experiments are conducted and tracking ability is evaluated. For future problems, speaker identification and speech recognition by applying the speech separation mechanism are required for realizing real time human-robot communication system. Acknowledgements The authors would like to appreciate Dr Toshiyuki Murakami and Dr. Yasue Mitsukura for their valuable suggestions on this study. The first author especially appreciates the committee of EMARO (Europian Master on Advnced Robotics) program for their support. References 1. Byoungho Kwon, Youngjin Park and Youn-sik Park, “Sound Source Localization for Robot Auditory System Using the Summed GCC Method”, International Conference on Control, Automation and Systems 2008, Oct. 14-17, 2008 in COEX, Seoul, Korea. 2. Kazuhiro Nakadai, Hirofumi Nakajima, Masamitsu Murase, Hiroshi G. Okuno, Yuji Hasegawa and Hiroshi Tsujino, “Real-Time Tracking of Multiple Sound Source by Integration of In-Room and Robot-Embedded Microphone Arrays”, International Conference on Intelligent Robots and System, October 9-15, 2006, Beijing, China. 3. Jie Huang, Noboru Ohnishi and Noboru Sugie, “Building ears for robots: sound localization and separation”, Artif Life Robotics (1997), Vol. 1, pp. 157—163. 4. Jean-Marc Valin, Franาซois Michaud and Jean Rouat, “Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering”, Robotics and Autonomous System, Vol. 55 (2007), pp. 216—228. 5. Kosuke Hosoya, Tetsuji Ogawa and Tetsunori Kobayashi, “Robot auditory system using head-mounted square microphone array”, The 2009 IEEE/RSJ International Conference on Intelligent Robotics and Systems, October 11-15, 2009 St. Louis, USA. 6. H.Asoh, I.Hara, F.Asano, K.Yamamoto, "Tracking human speech events using a particle filter", IEEE, trans. pp.1153-1156, 2005. 7. S. Arberet, R. Gribonval, and F. Bimbot, "A Robust Method to Count and Locate Audio Sources in a Multichannel Underdetermined Mixture, " IEEE Trans. SP, VOL. 58, NO. 1, pp.121-133, JANUARY 2010. 8. Ning DING, Nozomu Hamada, “DOA Estimation of Multiple Speech Source from a Stereophonic Mixture in Underdetermined Case”, IEICE Trans. Fundamentals, Vol.E95-A, No.4, pp. 735-744, Apr. 2012. 9. Fujimoto, N. Ding, N. Hamada, “Multiple Sources’ Direction Finding by using Reliable Component on Phase Difference Manifold and Kernel Density Estimator”, pp. 2601-2604, IEEE ICASSP 12, Mar. 2012. 10. Yizong Cheng, “Mean Shift, Mode Seeking, and Clustering”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 8, August 1995, pp. 790—799. 11. http://www.ros.org/wiki/