RiTA2012 final paper

advertisement
Novel scheme of real-time direction finding and
tracking of multiple speakers by robot-embedded
microphone array
Daobilige Su†, Masashi Sekikawa†, Kazuo Nakazawa‡
and Nozomu Hamada‡
†Signal processing Lab, School of Integrated Design Engineering, Keio University
‡Department of System Design Engineering, Keio University
Hiyoshi 3-14-1, Yokohama 223-8522 Japan
{su, sekikawa}@hamada.sd.keio.ac.jp, {nakazawa, hamada}@sd.keio.ac.jp
Abstract. Recently, interest on artificial robot audition is growing for
developing human-robot interaction. The main purposes of an artificial audio
system mounted on mobile robot are localizing sound sources, separating
speech signal that is relevant to a particular speaker such as robot’s master, and
processing speech sources to extract useful information such as master’s
uttering commands. This paper reports a novel proposed method of a speaker’s
direction tracking algorithm, and a realization of the real tracking system on a
mobile robot. Basic approach of this study belongs to a category of direction
finding known as sparseness-based one which employs time-frequency
decomposition and disjoint property between different speech signals. The
novel points in the proposed source tracking exist on a reliable data selection
from time-frequency cells and the application of mean shift tracking to the
kernel density estimator derived from these reliable time-frequency components.
A wheel-based mobile robot is developed and built-in audio processing system.
Experiments are conducted and demonstrate the ability to localize in real
environments.
Keywords: robot audition, sound source localization, microphone array, kernel
density estimator, mean shift tracking
1
Introduction
For robot audition system the sound source tracking problem by it draws lots of
attentions recently [1]-[6]. The same issues have also been studied in a wide field of
applications, such as, automatic camera steering for surveillance and video conference,
and acoustic discrimination between individuals in real multi-speaker environments.
In this paper we are particularly interested in a speaker’s direction of arrival (DOA)
tracking by microphone array mounted on mobile robots, where the problem is in the
moving sources as well as moving sensor framework. In addition, the method should
be sufficiently robust against environmental noise and other speaker’s voices. That is,
it is required a better directional sensitivity and better anti-interference ability. Under
the assumption of fixed source location, there have been proposed a large amount of
studies. Among them, approaches based on the sparseness of time-frequency (T-F)
component in speech signals are very attractive because of their capability for
underdetermined case where the sources outnumber the sensors. This study is also
based on the sparseness of speech signal in T-F domain.
In the context of fixed sensor array, several methods of tracking moving source
have been proposed. Most popular approaches are adaptive beamforming tracking [4]
and particle filter or sequential Monte Carlo method [4][6]. The methods treat two
kinds of localization issues: namely, tracking the source position such as exact
position tracking in an environment and the source direction tracking. The former
problem can be solved by combining the estimation by individual microphone arrays.
Another type of problem integrates two-types of microphone arrays one of which is
mounted on mobile robot and the others are on the room’s walls [2]. The acoustic
source tracking system in mobile robot has been developed in the context of moving
sensors/sources condition. This moving sensor/source problem would be different
from that of fixed source/sensor location especially with respect to the number of data
available because of real time processing.
As obviously recognized in general estimation problem, the use of reliable data
promises an accurate result. Thus, the deletion of outliers from observations is
effective for statistical estimation algorithms. Basically the T-F sparseness approach
attempts to cluster in the T-F cell distribution on which spatial cues such as direction
angles are labeled. Thus the existence of outliers must disturb the clustering process,
and tends to fail in detecting source positions. The idea of the T-F cell selection have
been proposed by [7] and by an author of this paper [8] and [9]. This paper will
present a modified T-F cell selection related with [9].
The main contributions of this paper are the following twofold. (a) Applying modeseeking mean-shift algorithm to the directivity likelihood in terms of kernel density
estimator for time-varying source directions finding. (b) Implementing the above
audition system upon a mobile robot and demonstrating and evaluating the tracking
capability through real experiments.
The rest of this paper is structured as follows. In section 2, the proposed DOA
estimation method for arbitrary array configuration is presented, and our direction
tracking algorithm based on the mean shift algorithm and the kernel density estimator
approach is proposed. Section 3 summarizes the implementation of our mobile robot
and signal processing system mounted on it. In section 4, tracking experiments and
their results are demonstrated, and the paper is concluded in section 5.
2
2.1
Proposed method
Speaker’s direction finding by arbitrary microphone arrangement
Observation Model: Considers an array of M microphones with arbitrary
configuration satisfying non-spatial aliasing condition where the inter-sensor distance
is bounded, for instance, 4cm inter-sensor distance for 8kHz sampling. The
observation ๐‘ฅ๐‘š (๐œ) at m-th microphone (m=1,…,M) is modeled by the following
convolutive mixture of N source signals ๐‘ ๐‘– (๐œ) (i=1,…,N)
๐‘ฅ๐‘š (๐œ) = ∑ ∑ โ„Ž๐‘š๐‘– (๐‘˜)๐‘ ๐‘– (๐œ − ๐‘˜)
(1)
๐‘–=1 ๐‘˜=1
where โ„Ž๐‘š๐‘– (k) is the impulse response from i-th source to m-th microphone. The
method proposed in this paper is applicable to the underdetermined case where the
sources outnumber the sensors, namely N>M.
In the time-frequency domain by transformed by STFT (Short-Time Fourier
Transform) the mixed observations ๐‘ฅ๐‘š (τ) of (1) is represented
๐‘‹๐‘š (๐‘˜, ๐‘™) = ๐ป๐‘š๐‘– (๐‘™)๐‘†๐‘– (๐‘˜, ๐‘™)
(2)
where ๐‘‹๐‘š (๐‘˜, ๐‘™), ๐‘†๐‘– (๐‘˜, ๐‘™) are the STFT’s of ๐‘ฅ๐‘š (๐œ), ๐‘ ๐‘– (๐œ) respectively, ๐ป๐‘š๐‘– (๐‘™) is
the DFT of โ„Ž๐‘š๐‘– (๐œ), l is the index of discrete frequency bin, k is the time-frame index.
Let ๐ซm = [๐‘ฅm , ๐‘ฆm , ๐‘งm ]T (m=1, …, M) denote the location of the m-th sensor in 3-D
space, and assume the first sensor is located at the origin (r1 = o) without loss of
generality.
Time Delay and Phase Difference: A source direction vector referred to as the
propagation direction vector is defined by
sin ๐œƒ cos ๐œ™ T
๐š(๐œ™, ๐œƒ) = [ sin ๐œƒ sin ๐œ™ ]
cos ๐œƒ
(3)
where ๐œ™(−π < ๐œ™ ≤ π) denotes the azimuth angle of source direction, ๐œƒ(−π/2 ≤
๐œ™ ≤ π/2) , and ๐‘Ž(๐œ™, ๐œƒ) constitutes a unit sphere.
An acoustic signal with a propagation direction vector ๐‘Ž(๐œ™) induces time delays
๐œ๐‘š (๐‘š = 2, โ‹ฏ , ๐‘€) of arrival between m-th sensor and the first or the reference sensor
as follows.
๐œ๐‘š = −๐’“T๐‘š ๐š⁄๐‘
(4)
where c is the travelling speed. The vector-matrix formulation of above is represented
by
๐‰=−
๐‘น๐š( ๐œ™)
๐‘
(5)
where ๐‰ = [๐œ2 , … , ๐œ๐‘€ ]T , ๐‘ = [๐’“2 , โ‹ฏ , ๐’“๐‘€ ]T
Define the following phase difference vector from the observations.
๐›—(๐‘™) = [๐œ‘12 (๐‘™), โ‹ฏ , ๐œ‘1๐‘€ (๐‘™)] ,
๐œ‘1๐‘š (๐‘™) = ∠๐‘‹๐‘š (๐‘™) − ∠๐‘‹1 (๐‘™)
(6)
, then from (5) we have the following relationship
๐‰=
1
๐‹ (๐‘™)
๐œ…(๐‘™)
(7)
2๐œ‹๐‘“ ๐‘™
where ๐œ…(๐‘™) = ๐‘  , ๐‘“๐‘  is sampling frequency.
๐ฟ
Phase Difference and propagation vector: Let’s define a mapping
R : ๐š(๐œ™, ๐œƒ) → ๐‹(๐‘˜, ๐‘™)
(8)
, in specific, R can be represented by
R(๐š(๐œ™, ๐œƒ)) = −
๐œ…(๐‘™)
๐‘
(9)
๐‘น๐š(๐œ™, ๐œƒ)
The inverse operation of R, that is
R-1 : ๐‹(๐‘˜, ๐‘™) → ๐š(๐œ™, ๐œƒ)
(10)
can be obtained by our previous method proposed in [9], by exploiting Gram-Schmidt
orthogonalization in ๐‹-space.
Reliable T-F cell selection: In real observations, due to multiple sources interaction,
environmental noise and computational errors in STFT, the estimated phase
ฬ‚ (๐‘˜, ๐‘™) (symbol ฬ‚ is used for estimated value) does not give the
difference vector ๐‹
ฬ‚ on the unit sphere. When we set
propagation vector ๐’‚
ฬ‚(๐‘˜.๐‘™) = R-1[๐‹
ฬ‚ (๐‘˜, ๐‘™)]
๐’‚
(11)
, reliable T-F cells, denoted by (๐‘˜, ๐‘™) ∈ ๐ด, will be selected by the following rule.
(12)
ฬ‚(๐‘˜.๐‘™) โ€– < 1 + ε
1 − ε < โ€–๐’‚
where ๐œ€ is sufficiently small positive value.
ฬ‚(๐‘˜.๐‘™) , (๐‘˜, ๐‘™) ∈ ๐ด would generate N clusters each of
In multiple sources case, all ๐’‚
which corresponds one of the sources.
Kernel density estimator: Besides the sell selection above, the power threshold and
the consistency criteria [8] are applied to determine a set of reliable propagating
direction vectors. Then we apply the kernel density estimator or the Parzen window
technique for this set of data, and consequently, the local minimum points or modes
of the resulted density function corresponds to the source directions. Applying this
๐’ ฬ‚ ๐’๐’Š
solely for a set of selected data denoted by(๐œ™ฬ‚๐’Œ๐’Š๐’Š , ๐œฝ
๐’Œ๐’Š ,), where k means the time frame
index of the observation, ๐‘™๐‘– is the frequency bin of underlying reliable cell and I is
the number of T-F cells in the set A. The kernel density estimator with respect to the
direction angles can be formulated as follow.
๐ผ
๐‘™
๐‘™
1
1
๐œ™๐‘ก − ๐œ™ฬ‚๐‘ก๐‘–๐‘– ๐œƒ๐‘ก − ๐œƒฬ‚๐‘ก๐‘–๐‘–
๐‘ฬ‚ (๐œ™๐‘ก , ๐œƒ๐‘ก ) = ∑
๐พ(
,
)
๐ผ
๐œ–(๐‘™๐‘– )๐›ฟ(๐‘™๐‘– )
๐œ–(๐‘™๐‘– )
๐›ฟ(๐‘™๐‘– )
(13)
๐‘–=1
where ๐œ–(๐‘™๐‘– ) and ๐›ฟ(๐‘™๐‘– ) are the band widths of 2-D kernel function ๐พ(๐œ™, ๐œƒ) with
respect to ๐œ™ and θ respectively, and which can be respectively represented by
๐œ–(๐‘™๐‘– ) =
1
๐œ…(๐‘™)โ€–cos ๐œƒ cos ๐œ™ ๐ซ๐‘ฅ + cos ๐œƒ sin ๐œ™ ๐ซ๐‘ฆ − sin ๐œƒ ๐’“๐‘ง โ€–
๐›ฟ(๐‘™๐‘– ) =
โ„,
ฬ‚ ๐‘™๐‘– ,๐œ™=๐œ™
ฬ‚ ๐‘™๐‘–
๐œƒ=๐œƒ
๐‘ก๐‘–
๐‘ก๐‘–
1
๐œ…(๐‘™)โ€–−sin ๐œƒ sin ๐œ™ ๐ซ๐‘ฅ + sin ๐œƒ cos ๐œ™ ๐ซ๐‘ฆ โ€–
โ„
ฬ‚ ๐‘™๐‘– ,๐œ™=๐œ™
ฬ‚ ๐‘™๐‘–
๐œƒ=๐œƒ
๐‘ก๐‘–
๐‘ก๐‘–
where the band width โ„ is determined by experiment. See more detail in [9]
(14)
Fig. 1. Flow of the proposed system operation
2.2
Direction tracking by mean-shift algorithm
The DOA estimation problem discussed 2.1 results in the seeking of the local
maximums or the modes of the obtained density estimator (13). In our system the
mean-shift algorithm is employed for this purpose because of its low computational
cost. In general, the mean shift algorithm is an effective clustering and mode seeking
technique which does not require a prior knowledge of the number of clusters, and
does not constrain the cluster distribution [10]. Our autonomous robot system needs to
find and track the direction of a specific person as a robot master. The direction of the
source is represented by its azimuth and elevation angles, and these time-varying
values are recursively estimated by use of the mean shift algorithm. In our
formulation the individual source directions correspond to the local maximum points
of the kernel density estimator (13). Fig. 2 (a) shows one example of reliable T-F cell
(blue dots) distribution on the unit sphere of a(φ,θ), and the figure (b) is an expanded
image of the lower-left part of the sphere. These can be obtained by applying the
proposed DOA estimation algorithm for real-life observation in a noisy environment.
The initial point and the successively updated mode estimates (red points) by meanshift algorithm are also indicated in Fig. 2(b). It shows how the mean-shift algorithm
could track the robot master direction starting from a previous estimate. Here, the red
point on the top is the previous direction and it converges to the local maximum
which should be the current master direction.
(a)
(b)
Fig. 2. Reliable T-F cell distribution and mean shift recursion
3
3.1
Implementation
Software Architecture
Programing Environment
In the realized signal processing system, the Linux Ubuntu 10.04.4 LTS is used as the
operating system, on which the driver was installed for multichannel synchronal
sampling A/D board. On top of these, we use ROS (Robot Operation System) Electric
Version [11] as the main programming architecture.
Software Architecture: ROS nodes and topics
In ROS, each node is a process that can publish/subscribe or be client/server to
another node. The communication between nodes is made by ROS topic. The
software architecture represented by ROS nodes and topics in our system are shown
in Fig. 3 in which rectangulars represent ROS nodes and arrows represent ROS topics.
The nodes for Audio Sensor, Actuator and Communication are nodes that deal with
external devices and nodes for DOA, Tracking, Speaker Iden and Speech Recog
constitute the “brain” for the robot. The construction of Speaker Iden node and
Speech Recog node is our long-term objective for future works.
Fig. 3. Software Architecture
The function of each node is described as blew. The node for Audio Sensor is
responsible for continuously mining the audio data from microphone array and
publishing it for other nodes. Each frame of audio data is 8k length under 16 kHz
sampling frequency. The node for DOA will subscribe to the data from Audio Sensor
node and estimate the DOA of TF cells with reliable cell selection. Then it will
publish these data for Tracking node. The node for Tracking adopts the Mean Shift
algorithm for tracking the robot master. In multi speaker’s case, it may track only the
predefined robot master which is determined by Speaker Iden node. The Speaker Iden
node will determine robot master. In single speaker’s case it will tell if the current
speaker is predefined robot master. In multi speaker’s case, it can tell which one is
robot master. Then it sends the message to the Tracking node which one it should
track. The node for Speech Recog is to get the audio data from Audio Sensor node and
master direction from Tracking node for translating voice command from robot
master. In single speaker case, depending on whether one master is being tracked, it
will choose to translate the voice command or keep idle. In multi speaker’s case, it
solely translates the voice command by robot master by speech separation technique
based on DOA. The node for Actuator is to subscribe to Speech Recog node and
Tracking node for getting the voice command and speaker direction so that the robot
could follow the simple task such as “move forward” or even more sophisticated task
such as “follow me”. The node for communication is reserved for the future usage.
3.2
Hardware System
The hardware of our robot platform consists of two parts: (a) audio signal processing,
(b) robot controller, and these are connected by Bluetooth module.
(a) Audio signal Processing
As a signal capturing a tetrahedral microphone array with four omni-directional
microphones is mounted, and uses analog amplifier followed by the synchronized 16-
hannel analog-to-digital converter with 16bits and 16 kHz sampling frequency.
(b) Robot Controller
The motor control system of the robot is composed of several modules with
connecting the SH2A (Renesas Electronics R5F72167ADF, 200MHz) CPU with
ROM, SRAM, and SDRAM. The robot’s steering mechanism is the two powered
wheel steering, and the driving motor is the Maxon A-max22 (6W) with reduction
gear ratio 25.6:1. The rotation angle control is performed by the rotary encoder
system (HEDL5540, 500CPT). In this setup the motor control is controlled solely by
the velocity. The electric battery mounted in this robot is set to 28v3.9AH and its life
is about 5 hours.
Fig. 4. Robotic Platform
4
Experiments and results
The scene of audio source tracking experiment is shown in Fig.5. Real time tracking
of two moving speakers is illustrated in Fig. 6. Other experimental results will be
presented.
Fig. 5. Scene of experiments
Fig. 6. Experimental results for two sources real time tracking
5
Conclusions
An artificial robot audition system for detecting and tracking of sound sources is
proposed. A novel DOA estimation method presented here is applicable for arbitrary
array configuration and even for underdetermined case by exploiting the timefrequency approach. The modes of the proposed kernel density estimator for a set of
selected data determine the corresponding source directions. Then mode seeking issue
can be solved by the mean shift algorithm which is proved to be suitable for real time
tracking. The developed mobile robot systems on which audio signal processing and
robot controller system are mounted are connected by a wireless link. To verify the
effectiveness of our system in real environments several experiments are conducted
and tracking ability is evaluated. For future problems, speaker identification and
speech recognition by applying the speech separation mechanism are required for
realizing real time human-robot communication system.
Acknowledgements
The authors would like to appreciate Dr Toshiyuki Murakami and Dr. Yasue
Mitsukura for their valuable suggestions on this study. The first author especially
appreciates the committee of EMARO (Europian Master on Advnced Robotics)
program for their support.
References
1. Byoungho Kwon, Youngjin Park and Youn-sik Park, “Sound Source Localization for Robot
Auditory System Using the Summed GCC Method”, International Conference on Control,
Automation and Systems 2008, Oct. 14-17, 2008 in COEX, Seoul, Korea.
2. Kazuhiro Nakadai, Hirofumi Nakajima, Masamitsu Murase, Hiroshi G. Okuno, Yuji
Hasegawa and Hiroshi Tsujino, “Real-Time Tracking of Multiple Sound Source by
Integration of In-Room and Robot-Embedded Microphone Arrays”, International
Conference on Intelligent Robots and System, October 9-15, 2006, Beijing, China.
3. Jie Huang, Noboru Ohnishi and Noboru Sugie, “Building ears for robots: sound localization
and separation”, Artif Life Robotics (1997), Vol. 1, pp. 157—163.
4. Jean-Marc Valin, Franาซois Michaud and Jean Rouat, “Robust localization and tracking of
simultaneous moving sound sources using beamforming and particle filtering”, Robotics and
Autonomous System, Vol. 55 (2007), pp. 216—228.
5. Kosuke Hosoya, Tetsuji Ogawa and Tetsunori Kobayashi, “Robot auditory system using
head-mounted square microphone array”, The 2009 IEEE/RSJ International Conference on
Intelligent Robotics and Systems, October 11-15, 2009 St. Louis, USA.
6. H.Asoh, I.Hara, F.Asano, K.Yamamoto, "Tracking human speech events using a particle
filter", IEEE, trans. pp.1153-1156, 2005.
7. S. Arberet, R. Gribonval, and F. Bimbot, "A Robust Method to Count and Locate Audio
Sources in a Multichannel Underdetermined Mixture, " IEEE Trans. SP, VOL. 58, NO. 1,
pp.121-133, JANUARY 2010.
8. Ning DING, Nozomu Hamada, “DOA Estimation of Multiple Speech Source from a
Stereophonic Mixture in Underdetermined Case”, IEICE Trans. Fundamentals, Vol.E95-A,
No.4, pp. 735-744, Apr. 2012.
9. Fujimoto, N. Ding, N. Hamada, “Multiple Sources’ Direction Finding by using Reliable
Component on Phase Difference Manifold and Kernel Density Estimator”, pp. 2601-2604,
IEEE ICASSP 12, Mar. 2012.
10. Yizong Cheng, “Mean Shift, Mode Seeking, and Clustering”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 17, No. 8, August 1995, pp. 790—799.
11. http://www.ros.org/wiki/
Download