Three Ring Microphone Array for 3D Sound Localization and Separation for Mobile Robot Audition Yuki TAMAI1,2 , Yoko SASAKI1,2 , Satoshi KAGAMI2,1,3 , Hiroshi MIZOGUCHI1,2 1 Department of Mechanical Engineering, Tokyo University of Science 2641, Yamazaki, Noda-shi, Chiba, Japan 2 Digital Human Research Center, National Institute of Advanced Industrial Science and Technology 3F 2-41-6, Aomi, Koto-ku, Tokyo 3 PRESTO, Japan Science and Technology Agency {j7503636@ed.noda.tus.ac.jp;y-sasaki@aist.go.jp;s.kagami@aist.go.jp;hm@rs.noda.tus.ac.jp} Abstract— This paper describes a three ring microphone array estimates the horizontal/vertical direction and distance of sound sources and separates multiple sound sources for mobile robot audition. Arrangement of microphones is simulated and an optimized pattern which has three rings is implemented with 32 microphones. Sound localization and separation are achieved by Delay and Sum Beam Forming(DSBF) and Frequency Band Selection(FBS). From on-line experiments results of sound horizontal and verticle localization, we confirmed that one or two sounds sources could be localized with an error of about 5 degrees and 200 to 300mm in the case of the distance of about 1m. The off-line experiments of sound separation were evaluated by power spectrums in each frequency of separated sounds, and we confirmed that an appropriate frequency band could be selected by DSBF and FBS. The system can separate 3 different pressure speech sources without drowning out. Index Terms— Sound localization, Sound separation, Delay and Sum Beamforming, Frequency Band Selection I. I NTRODUCTION We are performing research which centers around humanrobot interaction[1]. Recently, research about robot audition has become active according to the speed-up of personal computers, and multiple microphones for robot audition has actually been implemented on humanoid robots[2]. However, the accuracy of sound localization, sound separation and voice recognition in a real environment still has room for improvement. In this research, we propose a three ring microphone array to locate the distance and direction of sound sources and to separate sound sources for mobile robot audition. Arrangement of microphones is simulated and an optimized pattern which has three rings, is implemented with 32 elements. In Section 2, the algorithms of sound localization and separation are described. Sound localization is achieved by the Delay and Sum Beamforming (DSBF) method[3]. Sound separation is achieved by the DSBF method and Frequency Band Selection (FBS) which was proposed by [4]. In Section 3, the composition of the three ring microphone array is described. In our microphone array, ”ART-Linux” is used as a Real Time Operating System to keep a fixed cycle at 23µs, and a PCI 128-channel simultaneous input analog-to-digital (AD) board is used for multi-channel simultaneous sampling. In Section 4, simulation results are described. Sound sensitivity maps of each frequency are made by arrangement of our three ring microphone array. Then, the frequency band whose characteristics can be used in sound localization is determined from simulation results. In Section 5, experimental results are described. First, the frequency characteristics of the three ring microphone array is measured, and the performance of the focused sound from only an appropriate direction by DSBF is confirmed in each frequency. Next, results of sound horizontal localization are evaluated by time progress in the case of 1 source and 2 sources. The verticle localization is evaluated by a 3D graph whose elements are the horizontal direction, vertical direction and power. Then, results of sound distance localization are evaluated by the average error and the probability distribution. Finally, the performances of sound localization by using DSBF and FBS methods are evaluated by power spectrums in each frequency of separated sounds. In Section 6, a summary of this research is given and future work is described. II. A LGORITHM A. Sound Localization In this research, sound directional localization is achieved using the Delay and Sum Beam Forming algorithm (DSBF). Sound distance localization is achieved by triangulation using the localization results of three circular microphone arrays. The DSBF method is a technique for forming strong direction characteristics in the direction of the source by aligning all time shifts and amplitudes of sound waves inputted from each microphone and adding. Fig.1 shows an illastration of the DSBF algorithm. Aligning the phase of each signal amplifies the dominant sounds and attenuates ambient noise. Here is the array steering technique of DSBF: let Li be the distance of the focus to the i-th(i = 1, 2, · · · , M ) microphone, and the minimum of Li discribed as Lmin . Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply. Fig. 1. The image of DSBF algorithm Attenuation (Ai ) and delay (Di ) of the i-th microphone are expressed as: Li Ai = (1) Ls Li − Lmin (2) V where Ls is the standard length and V is the speed of sound. Let Xi (t) be the sound wave input from the i-th microphone, the synthetic sound wave S(t) is expressed as: Di = S(t) = N Xi (t + Di ) (3) i=1 Steering the focus obtains the spatial spectrum. Our system sets the focus to 2m distant from the array center and samples 180 data points at 2degree intervels. Because the array has 3 circular microphone rings, sound distance localization is achieved by triangulation. From two sound direction estimates at different positions, it can estimate the distance by calculating the cross point of two vectors. Moreover, our three ring microphone array can estimate the verticle direction because the microphones are arranged parallel to the ground. Sound vertical angle localization is achieved by DSBF method in consideration of the height of a sound source. B. Sound Separation Though sound from sources other than the focus can be weakened by using the DSBF method, the sound can’t be perfectly eliminated. In order to eliminate frequency ingredients of sound from sources other than the focus, the Frequency Band Selection (FBS) method is aplied after the DSBF method emphasis the focus. Fig.2 shows an illastration of the FBS method. The mixed sound shown in Fig.2 (2) is divided into each signal identified by DSBF (3). The solid line shows frequency ingredients of the target sound, and the broken line shows those of another sound, in short, attenuated frequency Fig. 2. The image of FBS algorithm ingredients by DSBF. (4) shows the image of the comparison of each volume of frequency ingredients of (d) and (e). Each of the frequency ingredients of (d) and (e) are compared and attenuated frequency ingredients are eliminated. Those frequencies without a target sound are eliminated perfectly. (5)(f) and (g) show frequency spectrums of (d) and (e) after applying the FBS method respectively. Let frequency ingredients of (d), (e), (f) and (g) be Xd (ωi ), Xe (ωi ), Xf (ωi ) and Xg (ωi ) respectively. This process is expressed as: Xd (ωi ) if Xd (ωi ) ≥ Xe (ωi ) (4) Xf (ωi ) = 0 else Xe (ωi ) if Xe (ωi ) ≥ Xd (ωi ) Xg (ωi ) = (5) 0 else By using the FBS method, multiple sound sources can be theoretically separated perfectly if the assumption that frequency ingredients of a target sound and those of an obstruction sound don’t overlapped. In many cases, the rarefaction wave of human voices makes this assamption a valid one. Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply. III. C OMPOSITION OF T HREE R INGS M ICROPHONE A RRAY Fig.3 shows the three ring microphone array we developed. 8 microphones are arranged on the circumference(440mm in diameter) at equal intervals. Three rings of 200mm in diameter are arranged inside the outer circle. 8 microphones are arranged respectively on the circumference of three rings. The microphone amplifiers are installed on the substrate. The total number of microphones in the array is 32. ”ART-Linux” is used as a Real Time Operating System to keep a fixed cycle at 23µs, and a PCI 128-channel simultaneous input analogto-digital (A/D) board is used for simultaneous multi-channel sampling. The sampling rate is 22kHz, and the resolution of the A/D board is 14bits. Flame length of sound localization and sound separation is 4096 points. Shift length of sound separation by FBS is 512 points. In the FFT, a Hamming window is used as the window function. CC = N Li Lmin + Ri − Li )) sin(2πf ( R V i i=1 (9) Let P be sound pressure, and P0 be standard sound pressure!% In general, sound pressure SP L is expressed as: α = arctan BC CC (10) Let Af be the amplitude of the synthetic wave at the focus. In the simulations, sound pressure SP LC at point C is expressed as: AC (11) SP LC = 20 log Af The sound pressure at the focus is 0 dB. B. Simulation Results Fig.4 shows the microphone arrangement of our three ring microphone array, and the simulation results of the array is shown in Fig.5. Fig.6 shows the simulation of the sound Fig. 3. Three rings microphone array IV. S IMULATION This section describes the beam forming simulation and the design of a microphone array suitable for a mobile robot. Array design was adjusted by analysing the sound field distribution. This also indicates the effective frequency for localization. A. Equations of Sound pressure Let Ri be the length from Point C to the i-th microphone. In this case, the synthetic wave Qc at Point C is expressed as: N Li Lmin + Ri − Li )) (6) exp(2πf (t + QC (t) = Ri V i=1 Where f is frequency of the signal. Let Ac be the amplitude of the synthetic wave at Point C. Ac is expressed as: 2 + C2 AC = |QC (t)| = BC (7) C However, BC = N Li Lmin + Ri − Li )) cos(2πf ( Ri V i=1 (8) Fig. 4. Microphone arrangement pressure of an 8 channel circular microphone array. The X mark shows the position of the focus. Each white point shows the position of a microphone. In Fig.5, the area of high sensitivity is spread widely in the range from 250 Hz to 750 Hz. So, it is expected that the accuracy of sound localization falls if a sound source frequency is within this range. At 1000Hz, the area of high sensitivity is a beams in the direction of the focus. So, high accuracy of sound localization can be expected within this range. In the range over 2000Hz, the performance of the emphasis of a target sound is confirmed though multiple side lobes are formed in various directions because of spatial aliasing. In Fig.6, the area of high sensitivity is generally spread widely compared with Fig.5 because intervals between microphones is very short. In only the range from 1000Hz to 2000Hz is the emphasis of a target sound is somewhat confirmed. Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply. to a single microphone is 0dB. Fig.7 shows the frequency characteristics of the three ring microphone array. From the result, we confirmed that the performance of suppressing is not seen at all in 250Hz, and that there are some effects of it in 500Hz, and that the frequency ingredients of more than 750Hz can be used for sound localization. Fig. 7. Fig. 5. Frequency characteristics of our system Simulation results of three rings microphone array B. Sound Localization Fig. 6. Simulation results of 8 channel circular microphone array V. E XPERIMENTAL R ESULTS a) Horizontal Direction : The experiments of sound source horizontal localization are evaluated by using the time progress of sound pressure. In each experiment, sound sources are localized 100 continuous times in the case of 1 or 2 sources by using three ring microphone array and a 8 channel circular microphone array. From 100 results of sound localization, a 3D time progress graph of the power in each direction was made. Each source generates music in the experiments. However, sound sources are localized after the frequency ingredients from 1000 to 3000Hz had been extracted with a band-pass filter. Fig.8 left shows the result of 1 sound source horizontal localization by the three ring microphone array, and Fig.8 right shows the result localisation with 2 sound sources. In the case of 1 source, it can be confirmed that the highest peak always appears in the direction of the sound source, and that the error is less than 3 degrees. In the case of 2 sources, the highest and the second highest peaks always appear in the directions of the 2 sources, and the error is less than 5 degrees. However, the Signal-to-Noise Ratio (SNR) in directions other than that of the sources worsens by about 3dB. A. Frequency Characteristics We conducted an experiment in order to confirm the frequency characteristics of the three ring microphone array in a real environment. A speaker is placed at a location of 0 degrees, the distance from the center of microphone array to it is 1m. Then, the speaker generates each sine wave (250, 500, 750, 1000, 2000 and 3000Hz) and we calculate the frequency characteristics of focused sound by the DSBF at each position of 0, 22.5, 45, 67.5, 90 degrees. The power spectrum of input Fig. 8. Sound horizontal localization results by three rings Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply. Fig.9 left shows the result of 1 sound source horizontal localization by a 8 channel microphone array, and Fig.9 right shows the result of 2 sound sources. The accuracy of localization is worse than that of the three ring array. The error is around 10 degrees, SNR in directions other than the sources worsens by about 5dB because an 8 channel circular microphone array is very small and a large enough phase difference can not be obtained. in the case of a fixed source if the distance is less than 1.5m. From the above, we confirmed that the distance could be localized in the neighborhood of about 1m. Fig. 11. Fig. 9. Sound distance localization results Sound horizontal localization results by 8ch circular C. Sound Separation b) Vertical Direction : A speaker as a sound source is put in a position 1m distance from the microphone array at a horizontal direction of 90 degrees, and a vertical direction of 30 degrees. The source generates music. Sound localization is performed by the three ring microphone array. Fig.10 shows the result of sound height localization. From the result, it can be confirmed that the error is less than 6 degrees, and that the performance of localization improves by about 3dB when compared with the case of no consideration of height. Fig. 10. Sound height localization result c) Vertical Direction : A speaker as a sound source is put at the position 500mm from the microphone array. The sound source generates music. Then, sound distance is localized 100 times. The same experiment is done on the position of 600, 700,!D, 1900 and 2000mm from the microphone array at intervals of 100mm. Then, the average and maximum error are calculated from the results. Fig.11 shows the experimental results of sound distance localization. From the maximum error, the error is mostly less than 300mm if the distance is less than 1m. Moreover, from the average error, the average is settled to a true value Sound separation is evaluated by separating synthetic waves made from multiple sine waves. 2 speaker as sound sources are put respectively at position A of 1000mm from the microphone array at angle of 0 degrees, and at position B of 1000 mm and 30 degrees. A source at A generates synthetic waves made from each sine wave of 480, 740, 1000, 1510, 2010 and 3010Hz, and another source at B generates one made from each sine wave of 620, 875, 1250, 1760, 2200 and 3300Hz. Each source is separated by using the DSBF method and cooperation method of DSBF and FBS. Fig.12 shows the power spectrum of a single microphone. The power spectrum of A is larger than the one of B because there is a power difference between sound from A and sound from B. Fig.13 shows the separation result of A by only the DSBF method. Compared with Fig.12, the difference between the power spectrum of A and the one of B becomes larger. Fig.14 shows the separation result of B by only the DSBF method. Though the power spectrum of A is larger than the one of B in Fig.12, the difference between A and B becomes small. Fig.15 shows separation result of A by the DSBF and FBS. Frequency ingredients from B are as almost small as that of surrounding noise though frequency ingredients from A remain as they were. The performance of separation has been improved by about 30dB compared with the result when using only DSBF. Fig.16 shows the separation result of B by DSBF and FBS. The performance of separation has been improved from 30 to 60dB. From the results of sound separation, we confirmed that the performance of sound separation had been greatly improved by using both DSBF and FBS, and that FBS algorithms could be used when the difference between the power of multiple sound sources was very large. Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply. Fig. 12. The power spectrum of a single microphone Fig. 13. Separation result A by DSBF Fig. 14. Separation result B by DSBF Fig. 16. Separation result B by DSBF and FBS VI. C ONCLUSIONS In this paper, the performances of 3D sound localization and separation of a three ring microphone array are evaluated. In the experiments of sound localization, sound horizontal direction could be localized with an error of less than 5 degrees. The performance of sound localization could be improved by about 3dB by considering the vertical of a sound source in the case of a vertical direction of 30 degrees. Sound height localization itself could be achieved with an error of less than 6 degrees. Sound distance localization could be done with an error of less than 300mm in the case that the distance is less than about 1m. In the experiments of sound separation, the performance could be improved more than 30dB by using both the DSBF method and the FBS method. For future work, our microphone array system will be mounted on a mobile robot for a robot audition, and the performances of sound localization and separation must be evaluated. Then, the performance of sound separation has to be evaluated quantitatively by voice recognition. In voice recognition, the distortion of high frequency ingredients by the accuracy of sound separation and frequency characteristics of a microphone influences the results. So, the introduction of a compensating filter is to be investigated. Finally, we will make our three ring microphone array able to do sound localization, sound separation and voice recognition simultaneously. R EFERENCES Fig. 15. Separation result A by DSBF and FBS [1] Y. Tamai, S. Kagami, Y. Amemiya and H. Nagashima: ”Circular Microphone Array for Robot’s Audition”, Proceedings of the Third IEEE International Conference on Sensors (SENSORS2004), 2004. [2] S. Yamamoto, K. Nakadai, H. Tsujino, T. Yokoyama and H. G. Okuno: ”Improvement of robot audition by interfacing sound source separation and automatic speech recognition with missing feature theory”, Proceedings of The 2004 International Conference on Robotics and Automation (ICRA2004), pp. 1517-1523, 2004. [3] D. E. Sturim, M. S. Drandstein and D. F. Silverman: ”Tracking multiple talkers using microphone-array measurements”, Proceedings of 1997 International Conference on Acoustics, Speech, and Signal Process-ing (ICASSP-97). IEEE, 1997. [4] T. Sawada, T. Sekiya, S. Ogawa and T. Kobayashi: ”Recognition of the Mixed Speech based on multi-stage Audio Segregation”, Proceedings of the 18th Meeting of Special Interest Group on AI Challenges, pp. 27-32, 2003. Authorized licensed use limited to: University of Glasgow. Downloaded on February 07,2023 at 12:28:44 UTC from IEEE Xplore. Restrictions apply.