Uploaded by taiyu JU

Three ring microphone array for 3D sound localization and separation for mobile robot audition

advertisement
Three Ring Microphone Array for 3D Sound Localization and
Separation for Mobile Robot Audition
Yuki TAMAI1,2 , Yoko SASAKI1,2 , Satoshi KAGAMI2,1,3 , Hiroshi MIZOGUCHI1,2
1 Department of Mechanical Engineering, Tokyo University of Science
2641, Yamazaki, Noda-shi, Chiba, Japan
2 Digital Human Research Center, National Institute of Advanced Industrial Science and Technology
3F 2-41-6, Aomi, Koto-ku, Tokyo
3 PRESTO, Japan Science and Technology Agency
{j7503636@ed.noda.tus.ac.jp;y-sasaki@aist.go.jp;s.kagami@aist.go.jp;hm@rs.noda.tus.ac.jp}
Abstract— This paper describes a three ring microphone
array estimates the horizontal/vertical direction and distance of
sound sources and separates multiple sound sources for mobile
robot audition. Arrangement of microphones is simulated and
an optimized pattern which has three rings is implemented with
32 microphones. Sound localization and separation are achieved
by Delay and Sum Beam Forming(DSBF) and Frequency Band
Selection(FBS). From on-line experiments results of sound
horizontal and verticle localization, we confirmed that one or
two sounds sources could be localized with an error of about
5 degrees and 200 to 300mm in the case of the distance of
about 1m. The off-line experiments of sound separation were
evaluated by power spectrums in each frequency of separated
sounds, and we confirmed that an appropriate frequency band
could be selected by DSBF and FBS. The system can separate
3 different pressure speech sources without drowning out.
Index Terms— Sound localization, Sound separation, Delay
and Sum Beamforming, Frequency Band Selection
I. I NTRODUCTION
We are performing research which centers around humanrobot interaction[1]. Recently, research about robot audition
has become active according to the speed-up of personal
computers, and multiple microphones for robot audition has
actually been implemented on humanoid robots[2]. However,
the accuracy of sound localization, sound separation and
voice recognition in a real environment still has room for
improvement.
In this research, we propose a three ring microphone
array to locate the distance and direction of sound sources
and to separate sound sources for mobile robot audition.
Arrangement of microphones is simulated and an optimized
pattern which has three rings, is implemented with 32 elements. In Section 2, the algorithms of sound localization and
separation are described. Sound localization is achieved by
the Delay and Sum Beamforming (DSBF) method[3]. Sound
separation is achieved by the DSBF method and Frequency
Band Selection (FBS) which was proposed by [4]. In Section
3, the composition of the three ring microphone array is
described. In our microphone array, ”ART-Linux” is used as
a Real Time Operating System to keep a fixed cycle at 23µs,
and a PCI 128-channel simultaneous input analog-to-digital
(AD) board is used for multi-channel simultaneous sampling.
In Section 4, simulation results are described. Sound sensitivity maps of each frequency are made by arrangement of
our three ring microphone array. Then, the frequency band
whose characteristics can be used in sound localization is
determined from simulation results. In Section 5, experimental results are described. First, the frequency characteristics
of the three ring microphone array is measured, and the
performance of the focused sound from only an appropriate
direction by DSBF is confirmed in each frequency. Next,
results of sound horizontal localization are evaluated by time
progress in the case of 1 source and 2 sources. The verticle
localization is evaluated by a 3D graph whose elements are
the horizontal direction, vertical direction and power. Then,
results of sound distance localization are evaluated by the
average error and the probability distribution. Finally, the
performances of sound localization by using DSBF and FBS
methods are evaluated by power spectrums in each frequency
of separated sounds. In Section 6, a summary of this research
is given and future work is described.
II. A LGORITHM
A. Sound Localization
In this research, sound directional localization is achieved
using the Delay and Sum Beam Forming algorithm (DSBF).
Sound distance localization is achieved by triangulation using
the localization results of three circular microphone arrays.
The DSBF method is a technique for forming strong direction characteristics in the direction of the source by aligning
all time shifts and amplitudes of sound waves inputted from
each microphone and adding. Fig.1 shows an illastration
of the DSBF algorithm. Aligning the phase of each signal
amplifies the dominant sounds and attenuates ambient noise.
Here is the array steering technique of DSBF: let Li
be the distance of the focus to the i-th(i = 1, 2, · · · , M )
microphone, and the minimum of Li discribed as Lmin .
Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply.
Fig. 1.
The image of DSBF algorithm
Attenuation (Ai ) and delay (Di ) of the i-th microphone are
expressed as:
Li
Ai =
(1)
Ls
Li − Lmin
(2)
V
where Ls is the standard length and V is the speed of sound.
Let Xi (t) be the sound wave input from the i-th microphone, the synthetic sound wave S(t) is expressed as:
Di =
S(t) =
N
Xi (t + Di )
(3)
i=1
Steering the focus obtains the spatial spectrum. Our system
sets the focus to 2m distant from the array center and samples
180 data points at 2degree intervels.
Because the array has 3 circular microphone rings, sound
distance localization is achieved by triangulation. From two
sound direction estimates at different positions, it can estimate the distance by calculating the cross point of two
vectors.
Moreover, our three ring microphone array can estimate
the verticle direction because the microphones are arranged
parallel to the ground. Sound vertical angle localization is
achieved by DSBF method in consideration of the height of
a sound source.
B. Sound Separation
Though sound from sources other than the focus can
be weakened by using the DSBF method, the sound can’t
be perfectly eliminated. In order to eliminate frequency
ingredients of sound from sources other than the focus, the
Frequency Band Selection (FBS) method is aplied after the
DSBF method emphasis the focus. Fig.2 shows an illastration
of the FBS method.
The mixed sound shown in Fig.2 (2) is divided into
each signal identified by DSBF (3). The solid line shows
frequency ingredients of the target sound, and the broken line
shows those of another sound, in short, attenuated frequency
Fig. 2.
The image of FBS algorithm
ingredients by DSBF. (4) shows the image of the comparison
of each volume of frequency ingredients of (d) and (e). Each
of the frequency ingredients of (d) and (e) are compared
and attenuated frequency ingredients are eliminated. Those
frequencies without a target sound are eliminated perfectly.
(5)(f) and (g) show frequency spectrums of (d) and (e)
after applying the FBS method respectively. Let frequency
ingredients of (d), (e), (f) and (g) be Xd (ωi ), Xe (ωi ), Xf (ωi )
and Xg (ωi ) respectively. This process is expressed as:
Xd (ωi ) if Xd (ωi ) ≥ Xe (ωi )
(4)
Xf (ωi ) =
0
else
Xe (ωi ) if Xe (ωi ) ≥ Xd (ωi )
Xg (ωi ) =
(5)
0
else
By using the FBS method, multiple sound sources can
be theoretically separated perfectly if the assumption that
frequency ingredients of a target sound and those of an
obstruction sound don’t overlapped. In many cases, the
rarefaction wave of human voices makes this assamption a
valid one.
Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply.
III. C OMPOSITION OF T HREE R INGS M ICROPHONE
A RRAY
Fig.3 shows the three ring microphone array we developed.
8 microphones are arranged on the circumference(440mm
in diameter) at equal intervals. Three rings of 200mm in
diameter are arranged inside the outer circle. 8 microphones
are arranged respectively on the circumference of three rings.
The microphone amplifiers are installed on the substrate. The
total number of microphones in the array is 32. ”ART-Linux”
is used as a Real Time Operating System to keep a fixed cycle
at 23µs, and a PCI 128-channel simultaneous input analogto-digital (A/D) board is used for simultaneous multi-channel
sampling. The sampling rate is 22kHz, and the resolution of
the A/D board is 14bits. Flame length of sound localization
and sound separation is 4096 points. Shift length of sound
separation by FBS is 512 points. In the FFT, a Hamming
window is used as the window function.
CC =
N
Li
Lmin + Ri − Li
))
sin(2πf (
R
V
i
i=1
(9)
Let P be sound pressure, and P0 be standard sound pressure!%
In general, sound pressure SP L is expressed as:
α = arctan
BC
CC
(10)
Let Af be the amplitude of the synthetic wave at the focus. In
the simulations, sound pressure SP LC at point C is expressed
as:
AC
(11)
SP LC = 20 log
Af
The sound pressure at the focus is 0 dB.
B. Simulation Results
Fig.4 shows the microphone arrangement of our three ring
microphone array, and the simulation results of the array is
shown in Fig.5. Fig.6 shows the simulation of the sound
Fig. 3.
Three rings microphone array
IV. S IMULATION
This section describes the beam forming simulation and
the design of a microphone array suitable for a mobile robot.
Array design was adjusted by analysing the sound field
distribution. This also indicates the effective frequency for
localization.
A. Equations of Sound pressure
Let Ri be the length from Point C to the i-th microphone.
In this case, the synthetic wave Qc at Point C is expressed
as:
N
Li
Lmin + Ri − Li
))
(6)
exp(2πf (t +
QC (t) =
Ri
V
i=1
Where f is frequency of the signal. Let Ac be the amplitude
of the synthetic wave at Point C. Ac is expressed as:
2 + C2
AC = |QC (t)| = BC
(7)
C
However,
BC =
N
Li
Lmin + Ri − Li
))
cos(2πf (
Ri
V
i=1
(8)
Fig. 4.
Microphone arrangement
pressure of an 8 channel circular microphone array. The X
mark shows the position of the focus. Each white point shows
the position of a microphone.
In Fig.5, the area of high sensitivity is spread widely in
the range from 250 Hz to 750 Hz. So, it is expected that
the accuracy of sound localization falls if a sound source
frequency is within this range. At 1000Hz, the area of high
sensitivity is a beams in the direction of the focus. So, high
accuracy of sound localization can be expected within this
range. In the range over 2000Hz, the performance of the
emphasis of a target sound is confirmed though multiple
side lobes are formed in various directions because of spatial
aliasing.
In Fig.6, the area of high sensitivity is generally spread
widely compared with Fig.5 because intervals between microphones is very short. In only the range from 1000Hz
to 2000Hz is the emphasis of a target sound is somewhat
confirmed.
Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply.
to a single microphone is 0dB.
Fig.7 shows the frequency characteristics of the three ring
microphone array. From the result, we confirmed that the
performance of suppressing is not seen at all in 250Hz, and
that there are some effects of it in 500Hz, and that the
frequency ingredients of more than 750Hz can be used for
sound localization.
Fig. 7.
Fig. 5.
Frequency characteristics of our system
Simulation results of three rings microphone array
B. Sound Localization
Fig. 6.
Simulation results of 8 channel circular microphone array
V. E XPERIMENTAL R ESULTS
a) Horizontal Direction
: The experiments of sound source horizontal localization are
evaluated by using the time progress of sound pressure. In
each experiment, sound sources are localized 100 continuous
times in the case of 1 or 2 sources by using three ring
microphone array and a 8 channel circular microphone array.
From 100 results of sound localization, a 3D time progress
graph of the power in each direction was made. Each source
generates music in the experiments. However, sound sources
are localized after the frequency ingredients from 1000 to
3000Hz had been extracted with a band-pass filter.
Fig.8 left shows the result of 1 sound source horizontal
localization by the three ring microphone array, and Fig.8
right shows the result localisation with 2 sound sources. In
the case of 1 source, it can be confirmed that the highest
peak always appears in the direction of the sound source,
and that the error is less than 3 degrees. In the case of 2
sources, the highest and the second highest peaks always
appear in the directions of the 2 sources, and the error is less
than 5 degrees. However, the Signal-to-Noise Ratio (SNR)
in directions other than that of the sources worsens by about
3dB.
A. Frequency Characteristics
We conducted an experiment in order to confirm the
frequency characteristics of the three ring microphone array
in a real environment. A speaker is placed at a location of 0
degrees, the distance from the center of microphone array to it
is 1m. Then, the speaker generates each sine wave (250, 500,
750, 1000, 2000 and 3000Hz) and we calculate the frequency
characteristics of focused sound by the DSBF at each position
of 0, 22.5, 45, 67.5, 90 degrees. The power spectrum of input
Fig. 8.
Sound horizontal localization results by three rings
Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply.
Fig.9 left shows the result of 1 sound source horizontal
localization by a 8 channel microphone array, and Fig.9
right shows the result of 2 sound sources. The accuracy of
localization is worse than that of the three ring array. The
error is around 10 degrees, SNR in directions other than the
sources worsens by about 5dB because an 8 channel circular
microphone array is very small and a large enough phase
difference can not be obtained.
in the case of a fixed source if the distance is less than 1.5m.
From the above, we confirmed that the distance could be
localized in the neighborhood of about 1m.
Fig. 11.
Fig. 9.
Sound distance localization results
Sound horizontal localization results by 8ch circular
C. Sound Separation
b) Vertical Direction
: A speaker as a sound source is put in a position 1m distance
from the microphone array at a horizontal direction of 90
degrees, and a vertical direction of 30 degrees. The source
generates music. Sound localization is performed by the three
ring microphone array.
Fig.10 shows the result of sound height localization. From
the result, it can be confirmed that the error is less than 6
degrees, and that the performance of localization improves by
about 3dB when compared with the case of no consideration
of height.
Fig. 10.
Sound height localization result
c) Vertical Direction
: A speaker as a sound source is put at the position 500mm
from the microphone array. The sound source generates
music. Then, sound distance is localized 100 times. The same
experiment is done on the position of 600, 700,!D, 1900 and
2000mm from the microphone array at intervals of 100mm.
Then, the average and maximum error are calculated from
the results.
Fig.11 shows the experimental results of sound distance
localization. From the maximum error, the error is mostly
less than 300mm if the distance is less than 1m. Moreover,
from the average error, the average is settled to a true value
Sound separation is evaluated by separating synthetic
waves made from multiple sine waves. 2 speaker as sound
sources are put respectively at position A of 1000mm from
the microphone array at angle of 0 degrees, and at position
B of 1000 mm and 30 degrees. A source at A generates
synthetic waves made from each sine wave of 480, 740, 1000,
1510, 2010 and 3010Hz, and another source at B generates
one made from each sine wave of 620, 875, 1250, 1760, 2200
and 3300Hz. Each source is separated by using the DSBF
method and cooperation method of DSBF and FBS.
Fig.12 shows the power spectrum of a single microphone.
The power spectrum of A is larger than the one of B because
there is a power difference between sound from A and sound
from B. Fig.13 shows the separation result of A by only
the DSBF method. Compared with Fig.12, the difference
between the power spectrum of A and the one of B becomes
larger. Fig.14 shows the separation result of B by only the
DSBF method. Though the power spectrum of A is larger
than the one of B in Fig.12, the difference between A and
B becomes small. Fig.15 shows separation result of A by
the DSBF and FBS. Frequency ingredients from B are as
almost small as that of surrounding noise though frequency
ingredients from A remain as they were. The performance of
separation has been improved by about 30dB compared with
the result when using only DSBF. Fig.16 shows the separation
result of B by DSBF and FBS. The performance of separation
has been improved from 30 to 60dB. From the results of
sound separation, we confirmed that the performance of
sound separation had been greatly improved by using both
DSBF and FBS, and that FBS algorithms could be used when
the difference between the power of multiple sound sources
was very large.
Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply.
Fig. 12.
The power spectrum of a single microphone
Fig. 13.
Separation result A by DSBF
Fig. 14.
Separation result B by DSBF
Fig. 16.
Separation result B by DSBF and FBS
VI. C ONCLUSIONS
In this paper, the performances of 3D sound localization
and separation of a three ring microphone array are evaluated.
In the experiments of sound localization, sound horizontal
direction could be localized with an error of less than 5
degrees. The performance of sound localization could be
improved by about 3dB by considering the vertical of a sound
source in the case of a vertical direction of 30 degrees. Sound
height localization itself could be achieved with an error of
less than 6 degrees. Sound distance localization could be done
with an error of less than 300mm in the case that the distance
is less than about 1m. In the experiments of sound separation,
the performance could be improved more than 30dB by using
both the DSBF method and the FBS method.
For future work, our microphone array system will be
mounted on a mobile robot for a robot audition, and the
performances of sound localization and separation must be
evaluated. Then, the performance of sound separation has
to be evaluated quantitatively by voice recognition. In voice
recognition, the distortion of high frequency ingredients by
the accuracy of sound separation and frequency characteristics of a microphone influences the results. So, the
introduction of a compensating filter is to be investigated.
Finally, we will make our three ring microphone array able to
do sound localization, sound separation and voice recognition
simultaneously.
R EFERENCES
Fig. 15.
Separation result A by DSBF and FBS
[1] Y. Tamai, S. Kagami, Y. Amemiya and H. Nagashima: ”Circular
Microphone Array for Robot’s Audition”, Proceedings of the Third
IEEE International Conference on Sensors (SENSORS2004), 2004.
[2] S. Yamamoto, K. Nakadai, H. Tsujino, T. Yokoyama and H. G. Okuno:
”Improvement of robot audition by interfacing sound source separation
and automatic speech recognition with missing feature theory”, Proceedings of The 2004 International Conference on Robotics and Automation
(ICRA2004), pp. 1517-1523, 2004.
[3] D. E. Sturim, M. S. Drandstein and D. F. Silverman: ”Tracking multiple
talkers using microphone-array measurements”, Proceedings of 1997
International Conference on Acoustics, Speech, and Signal Process-ing
(ICASSP-97). IEEE, 1997.
[4] T. Sawada, T. Sekiya, S. Ogawa and T. Kobayashi: ”Recognition of the
Mixed Speech based on multi-stage Audio Segregation”, Proceedings
of the 18th Meeting of Special Interest Group on AI Challenges, pp.
27-32, 2003.
Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply.
Download