robotics Article Room Volume Estimation Based on Ambiguity of Short-Term Interaural Phase Differences Using Humanoid Robot Head † Ryuichi Shimoyama 1, * and Reo Fukuda 2 1 2 * † College of Industrial Technology, Nihon University, Tokyo 275-8575, Japan System Research Division Image Information Systems Laboratory, Canon Electronics Inc., Tokyo 105-0011, Japan; fukuda.reo@canon-elec.co.jp Correspondence: shimoyama.ryuichi@nihon-u.ac.jp; Tel.: +81-47-474-2418 This paper is an extended version of our paper published in Proceedings of the 23rd International Conference on Robotics in Alpe-Adria-Danube Region (RAAD), Smolenice, Slovakia, 3–5 September 2014. Academic Editor: Huosheng Hu Received: 16 June 2016; Accepted: 14 July 2016; Published: 21 July 2016 Abstract: Humans can recognize approximate room size using only binaural audition. However, sound reverberation is not negligible in most environments. The reverberation causes temporal fluctuations in the short-term interaural phase differences (IPDs) of sound pressure. This study proposes a novel method for a binaural humanoid robot head to estimate room volume. The method is based on the statistical properties of the short-term IPDs of sound pressure. The humanoid robot turns its head toward a sound source, recognizes the sound source, and then estimates the ego-centric distance by its stereovision. By interpolating the relations between room volume, average standard deviation, and ego-centric distance experimentally obtained for various rooms in a prepared database, the room volume was estimated by the binaural audition of the robot from the average standard deviation of the short-term IPDs at the estimated distance. Keywords: room volume; estimation; short-term IPD; biologically inspired; binaural audition; humanoid robot; PACS; J0101 1. Introduction An autonomous robot needs to know its actual location, orientation, and surrounding environment in order to turn around or otherwise move safely in unregulated environments. The robot can localize its position visually in a well-regulated environment, in which every object is recognized by its position. A monitoring system may send the robot much information on the surrounding environment either manually or automatically by networked sensors. In unregulated environments, such as those destroyed by disaster, or when sensors are malfunctioning, visual information is not enough. However, an additional acoustic sensing system could provide the robot with a redundant way for recognizing its location and changes in the environment. It is well known that a humanoid robot can adapt easily to human living environments due to its physical shape, size, and appearance. The audition of humanoid robots has conventionally been an array system and not a binaural system. As an example of an array system, multiple lightweight microphones are mounted on the head of humanoid robots ASIMO (Advanced Steps in Innovative Mobility) and HRP-2 (Humanoid Robotics Project-2) [1,2]. Multiple microphones are necessary for realizing multiple acoustical functions. In conventional design, the number of microphones and the algorithm are selected before developing the hardware. Since humans can recognize approximate room size by only binaural hearing, this work examines the possibility of binaural audition for room volume Robotics 2016, 5, 16; doi:10.3390/robotics5030016 www.mdpi.com/journal/robotics Robotics 2016, 5, 16 2 of 15 estimation by a humanoid robot, in a reverse approach from hardware design. Binaural audition is a kind of biologically inspired approach. To estimate the volume of a room, impulse room response has conventionally been used. Reverberation time, which is related to room volume, is calculated from the damping characteristic of the impulse response measured by one microphone in a room [3,4]. Binaural room impulse response can also be used [5–9]. The conventional impulse room response is actively detected, so the measurement is noisy and interferes with our living environments. Binaural audition is passively detected and does not interfere with our environments. Studies on sound source localization and distance or trajectory estimation from binaural signals have been reported [10–12]. Sound reverberation is not negligible in reverberation environments such as an enclosed space (except for an anechoic room), since reverberation comprises significant components of the received signal. The ratio of energies of direct and reflected sound depends on the perceived distance of a sound source [13–17]. In psychophysical experiments, the effect of reflected sound on received sound depended on the positions of the sound source and the receiver in a room. The absolute position or trajectory of the sound source in a room has usually been the focus of the estimation. The reverberation causes temporal fluctuations in short-term interaural phase differences (IPDs) and interaural level differences (ILDs) [8,11]. The statistical properties of the received sound signal have been studied for sound source location classification and distance estimation [11,18,19]. Hu et al. discussed the influence of amplitude variation of non-stationary sound sources on the distribution pattern of IPDs and ILDs when short-term Fourier transform is used [18]. Their method is invalid in an unknown room or a changing environment. Georganti et al. proposed a method for binaural distance estimation of a sound source from a voice sound [11]. The accuracy of the estimation decreased in a shorter time frame. Reverberation degrades the repeatability of measurement due to the temporal fluctuations in short-term IPDs and ILDs. No algorithm based on unrepeatable data has yet produced satisfactory results. Though a shorter time frame is valid for conducting temporal measurements, unrepeatable data are inadequate for exact measurement. In this study, a novel method for estimating the room volume is proposed. The proposed method is based on the statistical properties of short-term IPDs of sound pressure and uses a binaural humanoid robot. The time frame adapted is less than 1 s. In this short time frame, the accuracy of estimation decreases in the conventional method. The temporal fluctuation in short-term IPDs in rooms is evaluated as the average standard deviation. The humanoid robot turns its head toward a sound source, recognizes the object of the sound source, and then estimates the ego-centric distance to the object. Ego-centric means that the robot is always at the origin of the local coordinates. The relative distance between the robot at the origin of the local coordinates and the sound source in a room is utilized for estimating room volume. The absolute locations of the robot and sound source in a room are not the focus of this study. The room volume is estimated from the average standard deviation of short-term IPDs at the estimated distance by interpolating the data in a prepared database. The database contains the relations between room volume, average standard deviation, and ego-centric distance experimentally obtained for various rooms. When the acoustical properties that relate only to the room volume can be detected, the robot can identify various environments. This method may be a solution for recognizing unknown or changed environments where no information on the environment is available. In Section 2, the methods and procedures are presented. The ego-centric distance by robot vision and the statistical properties of sound by robot audition are described. A procedure for evaluating the statistical properties of IPDs is described. Other procedures include localizing the sound source, rotating the robot head toward the sound source, and preparing the database. The experimental measuring system is also described, which consists of eight room measurements and volumes. The statistical properties of IPDs in a short-term frequency analysis are shown in Section 3. The prepared database and the results of room volume estimations in unknown rooms are Robotics 2016, 5, 16 3 of 15 Robotics Robotics 2016, 2016, 5, 5, 16 16 33 of of 15 15 shown. The effect of surrounding obstacles on the room volume estimation is described. In Section 4, are are discussed discussed in in comparison comparison with with the the results results of of conventional conventional methods. methods. Finally, Finally, in in Section Section 5, 5, aa experimental results are discussed in comparison with the results of conventional methods. Finally, in summary summary is is given. given. Section 5, a summary is given. 2. 2. Methods Methods and Procedures 2. Methodsand andProcedures Procedures 2.1. Vision 2.1. Ego-Centric Ego-Centric Distance Distance Estimation Estimation by by Robot Robot Vision Vision Figure Figure 11 shows humanoid robot robot head head and and aa loudspeaker loudspeaker in in aaareverberation reverberation room. Figure shows the the geometry geometry of of aa humanoid reverberationroom. room. The ego-centric distance D to the loudspeaker is estimated by robot vision provided by two cameras. The ego-centric distance D to the loudspeaker is estimated by robot vision provided by two cameras. The D to the loudspeaker is estimated by robot vision provided by two cameras. Loudspeaker Loudspeaker D D Humanoid Humanoid robot robot head head Reverberate Reverberate rooms rooms L[m] L[m] xx W[m] W[m] xx H[m] H[m] Figure 1. Geometryof humanoid robot head loudspeaker in reverberative (an overhead Figure robot head and loudspeaker in rooms (an view). 1. Geometry Geometry ofofhumanoid humanoid robot head andand loudspeaker in reverberative reverberative roomsrooms (an overhead overhead view). view). Theturns robotits turns itstoward head toward the loudspeaker by binaural audition, recognizes the object, and The head the by audition, recognizes the and The robot robot turns its head toward the loudspeaker loudspeaker by binaural binaural audition, recognizes the object, object, and then then then estimates Ego-centric distance D the to the loudspeaker stereovision.Short-term Short-termIPDs IPDsof of binaural estimates Ego-centric distance D loudspeaker by stereovision. estimates Ego-centric distance D to to the loudspeaker byby stereovision. Short-term IPDs of binaural sound signals volume. signals are are utilized utilized for for identifying identifying the the room room volume. volume. The The geometry geometry of of the the two two cameras cameras mounted mounted in the robot head and the object is shown in Figure 2. The geometry of the two cameras mounted in in the the robot robot head head and and the the object object is is shown shownin inFigure Figure2. 2. The distance from the center point between the two cameras to the object is defined as the ego-centric The distance from the center point between the two cameras to the object is defined as the ego-centric The distance from the center point between the two cameras to the object is defined as the ego-centric distance. distance. It It should be be emphasized emphasized that that Figure shows no no absolute absolute location location in in the the room. room. The The object, distance. It should should be emphasized that Figure Figure 22 2 shows shows no absolute location in the room. The object, object, which is the sound source, is detected by each camera. which is the sound source, is detected by each camera. which is the sound source, is detected by each camera. Object θ11 D θ22 S Left camera Right camera Figure Figure 2. 2. Geometry Geometry of stereo cameras mounted in robot’s eye and the object as sound source. Geometry of of stereo stereo cameras cameras mounted mounted in in robot’s robot’s eye eye and and the the object object as as aa sound sound source. source. Ego-centric angles. Ego-centric distance distance D D to to the the object object is is estimated estimated from from two two visual visual angles. angles. In In Figure Figure 2, 2, the the two two visual visual angles angles for for the the object object are are θ1 θ1 and and θ2, θ2, and and distance distance D D is is given given by by Robotics 2016, 5, 16 4 of 15 In Figure 2, the two visual angles for the object are θ1 and θ2, and distance D is given by Robotics 2016, 5, 16 D“ 4 of 15 S tanθ1 ` tanθ2 (1) S D (1) where S is the distance between the two cameras. Each visual angle is calculated as the difference tan 1 tan 2 between the center of the graphical image and the center position of the object recognized in the same S is the distance between the two cameras. Each visual angle is calculated as the difference image.where The accuracy of ego-centric distance D decreases with decreasing distance between the two between the center of the graphical image and the center position of the object recognized in the same cameras. The absolute difference ε between θ1 and θ2 is expressed as image. The accuracy of ego-centric distance D decreases with decreasing distance between the two cameras. The absolute difference between θ2 is expressed as ε “θ1 |θ1and ´ θ2| (2) 1 2 (2) If θ1 = θ2, the object is positioned at the same distance from each camera. The robot can rotate its If θ1 = θ2, object is Equation positioned(2). at the distance from each its camera. robot can rotate the head horizontally tothe minimize Thesame robot can also adjust head The vertically to match its head horizontally to minimize Equation (2). The robot can also adjust its head vertically to match center of the object to the vertical center of the image. To find the exact center position of the object, the the center of the object to the vertical center of the image. To find the exact center position of the object pattern is calculated by a geometry matching algorithm [20]. The center position of the object is object, the object pattern is calculated by a geometry matching algorithm [20]. The center position of defined as the center of the matched template. The error of the object position in the graphical image the object is defined as the center of the matched template. The error of the object position in the may significantly affectmay the ego-centric estimation. template patterns photographed graphical image significantlydistance affect the ego-centricSeveral distance estimation. Several template at various distances are used and the best matching automatically calculating patterns photographed at various distances are template used and is theselected best matching templatefor is selected the precise object position in the the graphical The estimated distance image. D is corrected at several automatically for calculating precise image. object position in the graphical The estimated distance D is corrected at several with measuring positions by with of true so that the measuring positions by comparing true distances, socomparing that the error thedistances, estimated ego-centric error of the estimated ego-centric distance is below 0.02 m over the range from 0.5 m to 1.5 m for “ε distance is below 0.02 m over the range from 0.5 m to 1.5 m for “ε < 0.5 degrees”. < 0.5 degrees”. 2.2. Estimation of Statistical Properties of Sound by Binaural Audition 2.2. Estimation of Statistical Properties of Sound by Binaural Audition Two acoustical signals, x(t) and y(t), are detected with two microphones. The cross-power Two acoustical signals, x(t) and y(t), are detected with two microphones. The cross-power spectrum G12 is expressed as spectrum G12 is expressed as G12 “ X p f q Y p f q˚ (3) G12 X f Y f * (3) where X(f ) and Y(f ) are the respective spectra after discrete Fourier transform (DFT) processing and * where and Y(f)conjugate are the respective spectra after discrete Fourier transform (DFT) processing and indicates theX(f) complex [20]. * indicates theG12 complex [20]. as The phase of (IPD)conjugate is expressed The phase of G12 (IPD) is expressed as =G12 “ = pXp f qY ˚*p f qq G12 X ( f )Y ( f ) (4) (4) 200 fi ⊿f Standard deviation [deg.] Interaural phase difference [deg.] Figure 3a shows a asample the IPDs IPDsforfor measured pressure in aThe room. Figure 3a shows sample of of the thethe measured soundsound pressure in a room. discontinuity of the byby thethe standard deviation. The discontinuity theIPDs IPDsisisquantified quantified standard deviation. 100 0 -100 -200 4000 6000 8000 10000 Frequency [Hz] (a) 12000 60 fmin 50 40 fmax S.D.AVE 30 20 10 0 4000 6000 8000 10000 12000 Frequency [Hz] (b) Figure 3. Method for evaluating statistical properties of binaural signals: (a) method for calculating Figure 3. Method for evaluating statistical properties of binaural signals: (a) method for calculating the the frequency spectrum of standard deviation from short-term IPDs of binaural signals; and frequency spectrum of standard deviation from short-term IPDs of binaural signals; and (b) frequency (b) frequency spectrum of the standard deviation and its average value from fmin to fmax. spectrum of the standard deviation and its average value from fmin to fmax. Robotics 2016, 5, 16 5 of 15 If 2m + 1 data of the discrete phase difference are included within the frequency width Δf,5 where of 15 m is an integer, the standard deviation σi at center frequency fi is given by the following equation: Robotics 2016, 5, 16 im 1 are 2 the frequency width ∆f, where If 2m + 1 data of the discrete phase difference included within i j i (5) m is an integer, the standard deviation σi at 2center frequency f is given by the following equation: i m 1 j i m g iÿ ` where ϕj is the phase difference value at f frequency fj,mand f ` 1 i φ j ´ φi σi “ e is ˘2 the average of the phase differences (5) 2m ` 1 at frequency fi. The frequency spectrum of the standard deviation is obtained as the center frequency j “i ´ m shifted over the measured frequency range (Figure 3b). The average value S.D.AVE over the frequency where φj isfmin thetophase range from fmax isdifference given by value at frequency f j , and φi is the average of the phase differences at frequency fi . The frequency spectrum of the standard deviation is obtained as the center frequency 1 n shifted over the measured frequency rangeS(Figure average value S.D.AVE over the frequency(6) .D. AVE3b). The i n i 1 range from f min to f max is given by n 1ÿ where n data of the discrete standard deviation min to S.D. AVEare “ included σi within the frequency range from f(6) n i“1reverberation condition in the room. fmax. S.D.AVE is a steady indicator that corresponds to the The frequency spectrum of the standard deviation is not always necessary, sincefrom the faverage where n data of the discrete standard deviation are included within the frequency range min to standard deviation value indicator can be directly calculatedtofrom the short-term IPDs ofinbinaural signals, as f max . S.D. that corresponds the reverberation condition the room. AVE is a steady shown The in Figure 3a. The frequency the standard is utilized forthe selecting frequency spectrum of thespectrum standard of deviation is not deviation always necessary, since averagethe standard deviationrange value that can be directly calculated from the short-term IPDs of binaural signals, as adequate frequency is most sensitive to the change of room volume. shown in Figure 3a. The frequency spectrum of the standard deviation is utilized for selecting the frequency range that is most sensitive to the change of room volume. 2.3.adequate Procedures an actual environment, sound is not always generated in front of the robot. In addition, the 2.3.InProcedures sound source cannot be distinguished between several objects by using only the visual image In an actual environment, sound is not always generated in front of the robot. In addition, the captured by the two cameras. Therefore, after localizing the sound source, the robot head is rotated sound source cannot be distinguished between several objects by using only the visual image captured toward the sound source and the object nearest to the center of the cameras is assumed to be the by the two cameras. Therefore, after localizing the sound source, the robot head is rotated toward the sound source. Additional procedures are necessary for the robot head to rotate toward the sound sound source and the object nearest to the center of the cameras is assumed to be the sound source. source and to calculate the ego-centric distance and room volume. The proposed algorithm is shown Additional procedures are necessary for the robot head to rotate toward the sound source and to in calculate Figure 4.the ego-centric distance and room volume. The proposed algorithm is shown in Figure 4. Left signal x ( t) Right signal y ( t) DFT DFT Conjugate Y * (f ) X (f ) Interaural Phase differences ( Cross- power spectral phase ) Finding sound source azimuth Rotating robot head toward sound source Distance estimation by robot vision ( Ego- centric distance D ) ( Evaluation function S.D. AVE ) Calculation of the frequency spectra of the standard deviation on phase differences Finding evaluation function S.D.AVE over limited frequency range Database D ΔS .D. AVE S .D. AVE Room volume Estimation of the room volume, interpolating the database Figure algorithm. Figure 4. 4. Proposed Proposed algorithm. Four procedures are additionally needed as pre-processes: Robotics 2016, 5, 16 6 of 15 Robotics 2016, 5, 16 6 of 15 ‚ ‚ ‚ ‚ Four procedures are additionally needed as pre-processes: Sound is generated in the room. Sound is generated the room. The sound source is in localized with binaural signals by using the localization algorithm CAVSPAC [21]. The robot head is rotated toward sound source. there the sound The sound source localized withthe binaural signalsThough by using theis ambiguity localizationin algorithm source’s position in terms of front or back, the robot can distinguish front from by rotating CAVSPAC [21]. The robot head is rotated toward the sound source. Though thereback is ambiguity in its head several times. the sound source’s position in terms of front or back, the robot can distinguish front from back by The object is recognized by robot vision. The robot adjusts its head angle horizontally and rotating its head several times. vertically ε < 0.5 by degrees Equation (2). This adjusts threshold based on the results The objecttoissatisfy recognized robotinvision. The robot itsvalue head was angle horizontally and of preliminary measurements. vertically to satisfy ε < 0.5 degrees in Equation (2). This threshold value was based on the results The ego-centricmeasurements. distance is calculated by Equation (1). The error of the estimated ego-centric of preliminary distance is below 0.02 m over the rangeby from 0.5 m to(1). 1.5 The m after correction. The ego-centric distance is calculated Equation error of the estimated ego-centric distance is below 0.02bemthe over the range 0.5 m to 1.5 m the afterego-centric correction.distance is estimated The object that may sound sourcefrom is recognized and by the above procedures. The object that may be the sound source is recognized and the ego-centric distance is estimated by above procedures. the Binaural sound signals are measured during the measuring time frame. The cross-power spectral phase is calculated by Equations (3) and (4). the measuring time frame. The cross-power spectral Binaural sound signals are measured during ‚ The frequency spectrum of the standard deviation is obtained by Equation (5). phase is calculated by Equations (3) and (4). ‚ S.D. AVE is calculated for evaluating the reverberation by Equation (6). The frequency spectrum of the standard deviation is obtained by Equation (5). The room volume is estimated by referring to the database, which contains the relations between ‚ S.D.AVE is calculated for evaluating the reverberation by Equation (6). the average standard deviation, the ego-centric distance, and the room volume, experimentally ‚ The room volume is estimated by referring to the database, which contains the relations between defined in advance. the average standard deviation, the ego-centric distance, and the room volume, experimentally defined in advance. 2.4. Experimental Measuring System 2.4. Experimental Measuring System of the humanoid robot head (M3-Neony, Vstone). The robot head Figure 5 shows the appearance is approximated as athe rotated ellipsoidal of 0.11robot m (Length) × 0.05 m (Radius). An robot auto-focus Figure 5 shows appearance of theshell humanoid head (M3-Neony, Vstone). The head camera (Q2F-00008, Microsoft) was mounted in each robot’s eye. The interval between the two is approximated as a rotated ellipsoidal shell of 0.11 m (Length) ˆ 0.05 m (Radius). An auto-focus cameras(Q2F-00008, is approximately 0.05 was m. Earphone-type microphones 4101, Bruel & Kjӕr) were added camera Microsoft) mounted in each robot’s eye.(Type The interval between the two cameras for sound detection. The interval between the two microphones is approximately 0.10 m. The robot is approximately 0.05 m. Earphone-type microphones (Type 4101, Bruel & Kjær) were added for sound head was fixed on the pan tilt unit (PTU-46-17P70T, FLIR), in which motion is controlled detection. The interval between the two microphones is approximately 0.10 m. The robot headby wasa workstation. The angular of rotating the tilt unit are is 0.013 degreesby in apan and 0.003 fixed on the pan tilt unit resolutions (PTU-46-17P70T, FLIR), inpan which motion controlled workstation. degrees in tilt. The angular resolutions of rotating the pan tilt unit are 0.013 degrees in pan and 0.003 degrees in tilt. Auto-focus camera Microphone Pan tilt Unit Figure Figure 5. 5. Appearance of humanoid robot head. An auto-focus camera was mounted in each robot’s eye. Earphone-type microphones eye. Earphone-type microphones are are added added for for sound sound detection. detection. The The robot robot head head was was fixed fixed on on the the pan pan tilt tilt unit, unit, in in which which motion motion is is controlled controlled by by aa workstation. workstation. The configuration of the experimental measuring system is shown in Figure 6. Robotics 2016, 5, 16 7 of 15 Robotics 2016, 5, 16 7 of 15 The configuration of the experimental measuring system is shown in Figure 6. Robot head Loudspeaker X-Stage NEXUS2693, B&K T3500,Dell PCI-4474,N.I. EH-150,Hitachi PC Figure 6. 6. Measuring Measuring system system configuration. configuration. Figure The The measurement measurement and data processing processing of the the acoustical acoustical signals signals were were conducted conducted on on aa workstation workstation (T3500, Dell) Dell)equipped equipped with a 24-bit resolution A/D converter (PCI-4474, N.I.). The frequency sampling with a 24-bit resolution A/D converter (PCI-4474, N.I.). The sampling frequency was 24frame kHz. One frame was 0.2 s, which included for each This was 24 kHz. One period was period 0.2 s, which included 4800 data for4800 eachdata channel. Thischannel. frame period frame period isshorter remarkably shorter than in that in the conventional methods.noise Broadband noise is remarkably than that adapted theadapted conventional methods. Broadband was radiated was radiated via the(0.19 loudspeaker (0.19mm(W) (H)ˆ×0.13 0.11mm(D)), (W)which × 0.13was m (D)), which was positioned atm. a via the loudspeaker m (H) ˆ 0.11 positioned at a height of 0.83 height of 0.83 m. The loudspeaker on the speaker stand fixed on the computer-controlled X-stage The loudspeaker on the speaker stand was fixed on the was computer-controlled X-stage and the position and positionbywas controlled by a sequencer Hitachi). of fmin to fmax3b shown was the controlled a sequencer (EH-150, Hitachi).(EH-150, The values of f minThe to fvalues max shown in Figure were in 3b and were11set at 4 respectively. kHz and 11 kHz, respectively. setFigure at 4 kHz kHz, Table Table 1 shows the room measurements and volumes used for preparing the database. Each Each room room volume was approximated using architectural drawings. Lecture rooms B and C were similar to each each other other in volume, and D and E were also similar to each other in volume. Table 1. Room measurements and volumes. Table Room Room Gymnasium Lecture room A Lecture room B Gymnasium Lecture room C Lecture room D Lecture room E Lecture room A Lecture room F Elevator hall Lecture room B Room Measurement Room measurement L[m]ˆW[m]ˆH[m] L[m]×W[m]×H[m] 26.0 ˆ 39.0 ˆ 11.7 20.1 ˆ 14.7 ˆ 3.3 14.3 ˆ 17.6 ˆ 3.0 26.0ˆ×17.1 39.0ˆ × 11.7 14.1 3.0 9.0 ˆ 13.6 ˆ 3.0 8.5 ˆ 13.6 ˆ 3.0 20.1 14.7 3.3 5.7 ˆ×7.6 ˆ ×2.6 7.8 ˆ 3.6 ˆ 3.0 14.3 × 17.6 × 3.0 Room Volume [m3 ] Room volume [m3] 119 ˆ 102 971 755 726 119 × 102 367 347 971 113 70 755 3. Experimental Results Lecture room C 14.1 × 17.1 × 3.0 726 3.1. Statistic Properties of IPDs in Short-Term Frequency Analysis Lectureproperties room D 9.0sound × 13.6 ×pressure 3.0 367 The statistical of the IPDs of in a short-term frequency analysis are next discussed. The short-term IPDs measured in the reverberation rooms fluctuate temporally [11,18]. Lecture room E were tested first. 8.5 ×One 13.6 loudspeaker × 3.0 347a distance 1 m from One room and one distance was located at the robot head. Figure 7a,b show the two IPDs measured under the same conditions. Though they Lecture room F 5.7 ×identical. 7.6 × 2.6 As shown in Figure113 have almost the same properties, they are not 7c, which is a plot of the differences between the two IPDs, the differences fluctuated randomly over the full measuring Elevator hall and several distances 7.8 × 3.6 were × 3.0 tested next. The standard 70 frequencies. Three rooms deviation was 3.1. Statistic Properties of IPDs in Short-Term Frequency Analysis 200 100 0 -100 -200 0 10000 5000 Frequency [Hz] Interaural phase difference [deg.] Interaural phase difference [deg.] The statistical properties of the IPDs of sound pressure in a short-term frequency analysis are next discussed. The short-term IPDs measured in the reverberation rooms fluctuate temporally [11,18]. One room and one distance were tested first. One loudspeaker was located at a distance 1 m Robotics 2016, 5, 16 8 of 15 from the robot head. Figure 7a,b show the two IPDs measured under the same conditions. Though they have almost the same properties, they are not identical. As shown in Figure 7c, which is a plot of the differences between the two IPDs, the differences fluctuated randomly over the full measuring introduced for quantizing the fluctuations. The relation between thestandard standard deviation frequencies. Three rooms and several distances were tested next. The deviation was and the ego-centricintroduced distancefor according the three room is shown 7d.deviation The standard quantizingtothe fluctuations. Thesizes relation betweenin theFigure standard and the deviation distance according to theand three increasing room sizes is shown in Figure 7d. The standard deviation increased ego-centric with decreasing room size distance of the source. This indicates that increased with decreasing room size and increasing distance of the source. This indicates that short-term IPDs are ambiguous, unrepeatable, and unreliable for a distant sound source. short-term IPDs are ambiguous, unrepeatable, and unreliable for a distant sound source. 200 100 0 -100 -200 0 5000 (a) (b) 200 80 Standard deviation[deg.] Difference of IPDs [deg.] 10000 Frequency [Hz] 100 0 -100 -200 0 5000 10000 Frequency [Hz] 60 40 20 0 0 0.5 1.5 1 2 Ego-centric distance to loudspeaker [m] (c) (d) Figure 7. These figures show that the repeatability of short-term IPDs depends on the room volume. Figure 7. These figures show the repeatability ofdistance short-term IPDs depends room volume. Standard deviation was that proportional to ego-centric to loudspeaker: (a) first;on (b)the second; Standard (c) deviation proportional to (d) ego-centric distance to loudspeaker: (a)room first; (b) second; Differencewas between (a) and (b); and Standard deviation to distance. (a,b) The same and recording conditions recorded theStandard first and second recordings, respectively. (c) Difference between (a) and (b); during and (d) deviation to distance. (a,b) The same room and recording conditions recorded during the first and second recordings, respectively. Next, the repeatability of S.D.AVE calculated using Equation (6) is discussed. Figure 8 shows the relation between S.D.AVE and the ego-centric distance. The standard deviation of S.D.AVE, which was five times atof each distance, was a maximum 0.5 degrees (6) at various distances Figure from 0.5 8 mshows the Next,measured the repeatability S.D. usingofEquation is discussed. AVE calculated to 1.5 m. This value was markedly smaller than the fluctuation of IPDs shown in Figure 7d. This relation between S.D.AVE and the ego-centric distance. The standard deviation of S.D.AVE , which was implies that S.D.AVE is more repeatable and reliable than IPD. Average standard deviation S.D.AVE [deg.] measured five times at each distance, was a maximum of 0.5 degrees at various distances from 0.5 m to 1.5 m. This value was markedly smaller than the fluctuation of IPDs shown in Figure 7d. This implies Robotics 2016, 5,is 16more repeatable and reliable than IPD. 9 of 15 that S.D. AVE 15 10 5 0 0 0.5 1 1.5 2 Ego-centric distance to loudspeaker D [m] 8. Relation Relation between between S.D. S.D.AVE AVE and the ego-centric distance distance to to loudspeaker. loudspeaker.S.D. S.D.AVE AVE is repeatable Figure 8. steady for for measurement measurement repetitions repetitions under under same same conditions. conditions. and steady Figure 9 shows that the values of S.D.AVE were proportional to the ego-centric distance in the eight rooms of different volumes listed in Table 1. The slopes of S.D.AVE with respect to the ego-centric distance depended on the room volume. S.D.AVE values of the short-term IPDs increased with decreasing room size when the distance was held constant. The values of S.D.AVE were compared in eight rooms at distances 0.5 m, 1.0 m, and 1.5 m, when the IPDs were measured repeatedly under the same conditions. Taking account of the repeated measurement fluctuations, these rooms can be Average 0 0 0.5 1 1.5 2 Ego-centric distance to loudspeaker D [m] Robotics Figure 2016, 5, 8. 16 Relation between S.D.AVE and the ego-centric distance to loudspeaker. S.D.AVE is repeatable 9 of 15 and steady for measurement repetitions under same conditions. Average standard deviation S.D.AVE [deg.] Figure the values valuesofofS.D. S.D. were proportional toego-centric the ego-centric distance AVE Figure99shows shows that that the AVE were proportional to the distance in the in the eight rooms of different volumes listed in Table 1. The slopes of S.D. with respect eight rooms of different volumes listed in Table 1. The slopes of S.D.AVE with respectAVE to the ego-centric to the ego-centric distance depended on the S.D. room values of theincreased short-term IPDs distance depended on the room volume. AVEvolume. values ofS.D. theAVE short-term IPDs with increased with decreasing size when constant. The values of S.D. decreasing room size whenroom the distance wasthe helddistance constant.was Theheld values of S.D.AVE were compared inAVE eight rooms at distances 0.5 m, 1.0 m, and 1.50.5 m, m, when repeatedly under the were compared in eight rooms at distances 1.0the m,IPDs and were 1.5 m,measured when the IPDs were measured same conditions. Taking of theTaking repeated measurement fluctuations, these rooms can be repeatedly under the sameaccount conditions. account of the repeated measurement fluctuations, categorized room sizes forinto the relatively longsizes (1.5 m) distance from(1.5 the m) loudspeaker: these rooms into canfour be categorized four room forego-centric the relatively long ego-centric small (elevator room), middle (room F),(elevator intermediate (rooms D, E),(room large (gymnasium, and rooms A,D, B, E), distance from the loudspeaker: small room), middle F), intermediate (rooms 3, 113 m3, 347–367 m3, and 726 m3 and3above 3 and C). The corresponding room volumes were 70 m large (gymnasium, and rooms A, B, and C). The corresponding room volumes were 70 m , 113 m , according the726 values of S.D. AVE. The room volumes are clearly distinguished in the smaller rooms 347–367 m3 ,toand m3 and above according to the values of S.D.AVE . The room volumes are clearly (rooms D, E, F, and elevator hall) with a distant sound source due to with sounda reflection. Large roomsdue distinguished in the smaller rooms (rooms D, E, F, and elevator hall) distant sound source above 726 m3 (gymnasium, and rooms A, B, and C) volume were not as precisely distinguished by the 3 to sound reflection. Large rooms above 726 m (gymnasium, and rooms A, B, and C) volume were not present method. as precisely distinguished by the present method. 60 50 40 ○:Gymnasium ●:Room A △:Room B ▲:Room C □:Room D ■:Room E ◇:Room F ◆:Elevator hall 30 20 10 0 0.5 1 1.5 Ego-centric distance to loudspeaker [m] Figure and the ego-centric distance to loudspeaker in the eight rooms of Figure9.9.Relation Relationbetween betweenS.D. S.D.AVE AVE and the ego-centric distance to loudspeaker in the eight rooms of different volumes listed in Table 1. different volumes listed in Table 1. 3.2. Estimation 3.2. EstimationofofTest TestRoom RoomVolume Volume Estimation test room roomG, G,which whichisisnot notincluded included Table 1,next is next discussed. Estimationofofthe theroom roomvolume volume of test in in Table 1, is discussed. Figure showsaa3-D 3-Dsurface surface for for interpolating interpolating the room volumes, thethe average Figure 1010shows therelations relationsbetween betweenthe the room volumes, average standarddeviations, deviations,and andego-centric ego-centric distances distances that obtained in in thethe eight rooms standard thatwere wereexperimentally experimentally obtained eight rooms shown in Table 1. The measurements were conducted in a relatively wide space at the center of each room, in which the desks and the chairs were moved to one wall. The height of rooms A to F and the elevator hall was approximately 3 m. The height of the binaural microphones of the robot head was 0.8 m. The distances to the side walls were longer than the distances to the floor or the ceiling in the measured rooms. The room volume in test room G was estimated by using the relations shown in Figure 10. The volume of test room G was calculated as 1108 m3 , 21 m ˆ 11 m ˆ 4.8 m, from the design layout. The estimated room volume was 1113 m3 , and the value of S.D.AVE was 10.3 degrees at a distance of 1 m to the loudspeaker. The estimated room volume was in good agreement with the volume calculated from the room design drawing. The maximum error rate of room volume estimation was 22% in the largest room beside the gymnasium, lecture room A, when the IPDs were measured repeatedly in the same room and under the same recording conditions. Since the magnitude of the slope of room volume versus S.D.AVE increases with increasing room volume, the error of the room volume estimation will decrease with decreasing room volume in same way as shown for S.D.AVE in Figure 9. The estimated room volumes approximately agreed with each other in several different rooms with the same actual volumes. Though the room volumes shown in Table 1 are different from volume calculated from the room design drawing. The maximum error rate of room volume estimation was 22% in the largest room beside the gymnasium, lecture room A, when the IPDs were measured repeatedly in the same room and under the same recording conditions. Since the magnitude of the slope of room volume versus S.D.AVE increases with increasing room volume, the error of the room volume estimation will decrease with decreasing room volume in same way as Robotics 2016, 5,for 16S.D.AVE in Figure 9. The estimated room volumes approximately agreed with each other in 10 of 15 shown several different rooms with the same actual volumes. Though the room volumes shown in Table 1 are different from each other, large rooms beyond 726 m3 were not as precisely distinguished, as each other, largeinrooms beyond 726 m3 were as precisely distinguished, Section mentioned Section 3.1. The robot could not acoustically categorize test roomas G mentioned as relativelyinlarge in 3.1. The robot could acoustically categorize test room G as relatively large in terms of room volume. terms of room volume. 106 10 4 Room G Room volume [m3] 10 2 Ego-centric distance to loudspeaker [m] Average standard deviation S .D. AVE [deg.] Figure 10. Relations between theroom roomvolumes, volumes, the the average deviations, andand ego-centric Figure 10. Relations between the averagestandard standard deviations, ego-centric distances that were experimentally obtained in the eight rooms shown in Table 1. Room G, which was was distances that were experimentally obtained in the eight rooms shown in Table 1. Room G, which not included in Table 1, was a test room. not included in Table 1, was a test room. 3.3. Effect of Surrounding Obstacles on the Room Volume Estimation 3.3. Effect of Surrounding Obstacles on the Room Volume Estimation Several obstacles, such as furniture, desks, and chairs, are found in a real room. Sound reflects Several obstacles, such as furniture, and chairs, are found a real room. Sound reflects from their surfaces and diffracts around desks, them. Curtains on windows mayinalso absorb sound energy. from The their surfaces and diffracts around them. Curtains on windows mayinalso energy. effect of obstacles on the present room volume estimation is discussed this absorb section. sound The kind, size, and location ofon furniture in a room vary, so the estimation following two casesinare discussed thekind, The effect of obstacles the present room volume is simple discussed this section.forThe samelocation room. of furniture in a room vary, so the following two simple cases are discussed for the size, and First same room. is the case that no surrounding obstacles are set around the robot for a relatively wide space. The same loudspeaker was located at four positions from A to D at the center of the room, as shown First is the case that no surrounding obstacles are set around the robot for a relatively wide space. in Figure 11. The absolute positions of the loudspeaker are different in the room, and each angle and The same loudspeaker was located at four positions from A to D at the center of the room, as shown ego-centric distance for the loudspeaker is different. The estimated room volumes are shown in in Figure absolute room positions of values the loudspeaker different in the room, and each Figure11. 12. The The estimated volume were almostare constant. The statistical properties of theangle and ego-centric distance for pressure the loudspeaker is different. The estimated roomreflected volumes arethe shown short-term IPDs of sound were not remarkably influenced by the sound from in Figure 12. The obstacles estimated volume values constant. The statistical properties surrounding at room the center of the room.were This almost result supports that relative positions, not of the short-term Robotics 2016, 5, 16 IPDs of sound pressure were not remarkably influenced by the sound reflected11from of 15 the surrounding obstacles at the center of the room. This result supports that relative positions, not loudspeaker andand robot are appropriate for estimating the room absolute positions, positions,between betweenthethe loudspeaker robot are appropriate for estimating thevolume room in the present method. volume in the present method. 0.8[m] 1.0[m] 1.2[m] 60 30 30 0.6[m] volume[m3] Figure 11. One speaker located at four positions from A to D in the center of the room. The absolute Figure 11. One speaker located at four positions from A to D in the center of the room. The absolute positions in the the room room of of the the loudspeaker loudspeaker differed differed and and the the angles angles and and ego-centric ego-centric distances distances for for the the positions in speaker also differed. speaker also differed. 104 103 Estimated roomvolume[m Estimatedroom volume[m33]] Figure 11. 11. One One speaker speaker located located at at four four positions positions from from A A to to D D in in the the center center of of the the room. room. The The absolute absolute Figure positions in the room of the loudspeaker differed and the angles and ego-centric distances for the the positions in the room of the loudspeaker differed and the angles and ego-centric distances for speaker also differed. Robotics 2016, 5, 16 11 of 15 speaker also differed. 1044 10 1033 10 1022 10 1011 10 1000 10 0.5 0.5 11 1.5 1.5 Ego-centric distance distance to to loudspeaker loudspeaker [m] [m] Ego-centric Figure 12. 12. Estimated Estimated room roomvolumes. volumes.The Theestimated estimatedroom roomvolume volumevalues values were almost constant. That Figure were almost constant. That is, Estimated room volumes. The estimated room volume values were almost constant. That is, the present room volume estimation may not depend on the absolute position of the loudspeaker. the present room volume estimation may not depend on the absolute position of the loudspeaker. is, the present room volume estimation may not depend on the absolute position of the loudspeaker. Second, the the effect effect of of surrounding surrounding obstacles, obstacles, here here aa side side wall, wall, partitions, partitions, and and aa curtain, curtain, on on the the Second, room volume volume estimation estimation is is discussed. discussed. The The sound from the the loudspeaker loudspeaker located located near near the the wall, wall, the the room The sound sound from partition, or the curtain was measured. The geometry of the robot, loudspeaker, and obstacles is was measured. measured. The partition, or the curtain was The geometry geometry of of the robot, loudspeaker, and obstacles is shown in Figure 13. The loudspeaker was located at a distance of 1 m from the robot in parallel with shown in Figure 13. The loudspeaker was located at a distance of 1 m from the robot in parallel with one side side wall. wall. The The room room volume volume was was estimated estimated when when the the robot robot and and the the loudspeaker loudspeaker were were located located at at one various distances from the wall while keeping it parallel with the wall. The sounds were additionally from the wall while keeping it parallel with the wall. The sounds were additionally various distances from measured in in the the two two cases cases where where four four partitions partitions (1.2 (1.2 m m (L) (L) ˆ 1.95 m (H)) (H)) were were placed placed in in front front of of the the measured partitions m (L) ×× 1.95 1.95 m wall or or all all of of the the partitions partitions were were covered coveredwith withaaacurtain. curtain. partitions were covered with curtain. wall Robot Robot 1.0[m] 1.0[m] Loudspeaker Loudspeaker 0.4~3.6[m] 0.4~3.6[m] Curtain Curtain Partition Partition Wall Wall 4.8[m] 4.8[m] Figure 13. Geometry of the robot, loudspeaker, and obstacles: a wall, partitions, and partitions covered with a thin curtain. Figure 14 shows: (a) the relations between the distance to the obstacle and the average standard deviation S.D.AVE ; and (b) the relations between the distance to the obstacle and the estimated room volume. As shown in Figure 14a, S.D.AVE increased with decreasing distance, when the distance to the side wall or the partitions was less than approximately 2 m. This result means that sound reflected from the surface of obstacles may affect S.D.AVE near the obstacles. S.D.AVE values versus the distances to the wall or the partitions have similar characteristics. In the case of the curtain, S.D.AVE increases with decreasing distance when the distance to the curtain was less than approximately 1 m, due to sound absorption by the curtain. When the partitions were covered with a thin curtain, S.D.AVE values were lower than those in the uncovered case. In the same manner, as shown in Figure 14b, the room volume was more underestimated when the loudspeaker and robot were located closer than approximately 2 m to the side wall or the partitions. The room volume was underestimated only when the distance to the curtain was less than approximately 1 m. The room volume value was the same as that in the to the wall or the partitions have similar characteristics. In the case of the curtain, S.D.AVE increases with decreasing distance when the distance to the curtain was less than approximately 1 m, due to sound absorption by the curtain. When the partitions were covered with a thin curtain, S.D.AVE values were lower than those in the uncovered case. In the same manner, as shown in Figure 14b, the room volume2016, was Robotics 5, 16more underestimated when the loudspeaker and robot were located closer12than of 15 approximately 2 m to the side wall or the partitions. The room volume was underestimated only when the distance to the curtain was less than approximately 1 m. The room volume value was the center the in room obstacles were the farther than approximately 2 mapproximately from the robot2 and the same asofthat the when center the of the room when obstacles were farther than m from sound source. Especially, the effect of the curtain was negligible when the curtain was located farther the robot and the sound source. Especially, the effect of the curtain was negligible when the curtain than approximately 1 m due to its sound1absorption. was located farther than approximately m due to its sound absorption. (a) (b) Figure 14. Effect of three types of obstacle: a wall, partitions, and the partitions covered with a thin Figure 14. Effect of three types of obstacle: a wall, partitions, and the partitions covered with a thin distance to curtain. (a,b) the same same room: room: (a) (a) the the relation relation between between S.D. S.D.AVE and curtain. (a,b) were were measured measured in in the AVE and distance to obstacle; and (b) the estimated room volume to distance to the obstacle. Red lines indicate the cases cases obstacle; and (b) the estimated room volume to distance to the obstacle. Red lines indicate the that the loudspeaker and robot were located far from the side wall in a relatively wide space. The that the loudspeaker and robot were located far from the side wall in a relatively wide space. The effect effect the obstacle increases with decreasing distance for obstacle lessapproximately than approximately of the of obstacle increases with decreasing distance for obstacle less than 2 m. 2 m. These results imply that the effect of sound reflected from the surface of an obstacle is not These results imply that the effect of sound reflected from the surface of an obstacle is not negligible in the present room volume estimation. The room volume is underestimated when an negligible in the present room volume estimation. The room volume is underestimated when an obstacle is located close to the loudspeaker and robot. obstacle is located close to the loudspeaker and robot. 4. Discussion 4. Discussion In this this section, section,the theresults resultsofofthe thepresent presentpaper paper discussed comparison with results in In areare discussed in in comparison with the the results in the the references. To the best of our knowledge, room volume has never been estimated passively by references. To the best of our knowledge, room volume has never been estimated passively by using using a binaural robotin head in conventional methods. We adapted the binaural cue IPDs, are a binaural robot head conventional methods. We adapted the binaural cue IPDs, whichwhich are ITDs ITDs transformed from the time domain to the frequency domain. The IPDs are intrinsically the same transformed from the time domain to the frequency domain. The IPDs are intrinsically the same as as the ITDs. the ITDs. The binaural The binaural cues, cues, such such as as ILDs, ILDs, ITDs, ITDs, and and the the interaural interaural coherence coherence (IC), (IC), have have been been investigated investigated for auditory detection of spatial characteristics (distance, reverberation, and angle for auditory detection of spatial characteristics (distance, reverberation, and angle of of sound sound source). source). These binaural cues are affected by the reverberant energy as a function of the listener’s location in These binaural cues are affected by the reverberant energy as a function of the listener’s location in the room and the source location relative to the listener [11,22]. Thus, the relations between these binaural cues and the absolute location of the microphones and the source have been a point of contention. Hartmann et al. measured the ICs of sound pressures in various rooms using a binaural head and torso in four rooms smaller than 210 m3 and one room of 3800 m3 [22]. They reported that the ICs markedly depended on the distance between the loudspeaker and microphones and on their absolute locations. A specific feature of the present method is the short time frame for calculating the DFT. The time frame was set at 0.2 s. In conventional methods, the time frame is set longer than 1 s, since a shorter time frame leads to lower performance [10,11]. Though short-term IPDs are ambiguous, unrepeatable, and unreliable for long-distance sound sources, we found that the average standard deviation value of short-term IPDs was more repeatable and reliable than the IPDs even in reverberation rooms, as mentioned in Section 3. Another feature is the adjustable frequency range, as shown in Figure 3b in Section 2. The average value of the spectral standard deviation depends on this frequency range. An adequate frequency range that is most sensitive to the change of room volume can be selected. When the frequency range was adjusted adequately, average standard deviation is proportional to Robotics 2016, 5, 16 13 of 15 each ego-centric distance to the source, as shown in Figures 8 and 9 in Section 3. Conventional research using spectral standard deviation provides no solution for setting the frequency range. Georganti et al. reported that the standard deviation values of binaural room transfer functions (BRTFs) followed a distance-dependent behavior [13]. As shown in Figure 10 in Section 3, we found that the average standard deviation of short-term IPDs depended on not only the ego-centric distance to the source but also the room volume. Consequently, the adjusted average standard deviation of short-term IPDs depended only on the relative locations of the source and microphones, not on their absolute locations, as shown in Figure 12 in Section 3. In the present paper, a binaural cue that depends only on the relative locations of the sound source and the microphones and not on their absolute locations is utilized by using short-term IPDs. The effect of surrounding obstacles such as a side wall, partitions, or a curtain on the average standard deviation of short-term IPDs was discussed in Section 3. The effect of the sound reflected from the surface of the obstacle was not negligible in the present room volume estimation. The room volume was underestimated when the obstacle was located close to the loudspeaker and robot. Both the microphones mounted on the robot head and the loudspeaker were set at a height of 0.8 m from the flour. The distance from the microphones to the ceiling was approximately 2.2 m in lecture rooms B, C, D, and E. The average standard deviations may be affected by the shorter distance to the floor in these rooms, and not by the greater distance to the ceiling. On the other hand, the room volumes in lecture rooms B and C were twice those in lecture rooms D and E. Corresponding to the room volume, the average standard deviations in rooms D and E were almost twice those in rooms B and C at a distance 1.5 m, as shown in Figure 9 in Section 3. The heights of the ceiling were 11.7 m, 3.3 m, and 2.6 m in the gymnasium and lecture rooms A and F, respectively. It seems that the height of the ceiling may not greatly affect the average standard deviations in Figure 9, since the distance between the ceiling and the microphones was sufficient. These results indicate that the average standard deviation of short-term IPDs depends on the volume of the room. It should be emphasized that the effect of reflected sound from not only the floor and the ceiling but also from the far side wall may not be negligible for the average standard deviation. Another realistic problem is background noise. It is very hard to find a room without people in it talking or other noise. This means that there are usually multiple sound sources in a room. If the sounds are sparse over time such as human voices, the robot can continue to measure S.D.AVE and wait for quiet intervals. The minimum value of S.D.AVE may indicate the room volume, since a distant sound source will give a high S.D.AVE . In the case of continuous noise, the performance of this method will be lower. Why does the average standard deviation of short-term IPDs depend on the room volume? One of the reasons may be that the fluctuation of waves over time in an acoustical field corresponds to the shape and size of the room. Such a wave field is made by the synthesis of reflected waves and direct waves. An acoustical field may be similar to the water waves in a pool. The water surface with a floating object in the pool fluctuates in height over time. The height of the fluctuation may be higher in a small pool than in a large pool. 3-D simulations in room acoustics may provide an exact answer in the near future. 5. Conclusions The room volumes of reverberation rooms were estimated from short-term IPDs by a binaural humanoid robot. The robot turned its head toward a sound source with binaural audition and recognized the object that was the sound source with robot vision. Then, the ego-centric distance to the object was estimated with stereovision. The average standard deviation was calculated using short-term IPDs obtained by binaural signals. The relations between the room volumes, the average standard deviations, and the ego-centric distances were experimentally obtained in eight rooms. The volume of a test room was estimated by interpolating these relations. The results are summarized as follows: Robotics 2016, 5, 16 (1) (2) (3) (4) (5) 14 of 15 Short-term IPDs are ambiguous, unrepeatable, and unreliable for distant sound sources. Average standard deviation of short-term IPDs (S.D.AVE ) is more repeatable and reliable than the IPDs even in reverberation rooms. The proposed average standard deviation is proportional to the ego-centric distance to the sound source. The slope of S.D.AVE with respect to the ego-centric distance depends on the room volume. The average standard deviation of short-term IPDs depends on the volume of the room. The effect of the reflected sound from not only the floor and the ceiling but also the far wall may not be negligible in the average standard deviation. The average standard deviation of short-term IPDs increases with decreasing distance near surrounding obstacles, such as a side wall, partitions, or a curtain. Thus, the room volume is underestimated near the obstacles. For eight rooms having different room volumes, the robot could categorize them into four sizes of rooms, namely, small, middle, intermediate, and large, using the average standard deviation values of short-term IPDs. Several assumptions were necessary for estimating ego-centric distance to a sound source by stereovision, even when the object to be recognized was limited to one loudspeaker. The robot could not identify acoustically each of the eight rooms having different volumes. As is true for humans, the robot could only categorize the place where it was as a room of one of four sizes, namely, small, middle, intermediate, and large. The room volume was underestimated when an obstacle was located close to the loudspeaker and robot. This implies the possibility that a robot can acoustically detect the presence of an obstacle by the change of the estimated room volume when the robot moves toward the obstacle. Such ability may be limited in actual performance. Nevertheless, it will be an additional way for a robot to recognize its surrounding environment, since the robot cannot acoustically localize the position of an obstacle that radiates no sound. Computer simulation of room acoustics will help to explain this phenomenon in more detail. Room volume estimation under multiple sound sources remains as future work. Author Contributions: R.S. and R.F. conceived and designed the experiments; R.F. performed the experiments; R.S. and R.F. analyzed the data; R.F. contributed programing/making the robot head/analysis tools; and R.S. wrote the paper. Conflicts of Interest: The authors declare no conflict of interest. References 1. 2. 3. 4. 5. 6. 7. Nakamura, K.; Nakadai, K.; Asano, F.; Hasegawa, Y.; Tsujino, H. Intelligent sound source localization for dynamic environments. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots And Systems, St. Louis, MO, USA, 10–15 October 2009; pp. 664–669. Asano, F.; Asoh, H.; Nakadai, K. Sound source localization using joint Bayesian estimation with a hierarchical noise model. IEEE Trans. Audio Speech Lang. Proc. 2013, 21, 1953–1965. [CrossRef] Shabtai, N.R.; Zigel, Y.; Rafaely, B. Estimating the room volume from room impulse response via hypothesis verification approach. In Proceedings of the 2009 IEEE/SP 15th Workshop on Statistical Signal Processing, Cardiff, UK, 31 August–3 September 2009; pp. 717–720. Kuster, M. Reliability of estimating the room volume from a single room impulse response. J. Acoust. Soc. Am. 2008, 124, 982–993. [CrossRef] [PubMed] Kearney, G.; Masterson, C.; Adams, S.; Boland, F. Approximation of binaural room impulse responses. Proc. ISSC 2009, 2009, 1–6. Jeub, M.; Schafer, M.; Vary, P. A binaural room impulse response database for the evaluation of dereverberation algorithms. In Proceedings of the 2009 16th International Conference on Digital Signal Processing, Santorini, Creece, 5–7 July 2009; pp. 1–5. Larsen, E.; Schmitz, C.D. Acoustic scene analysis using estimated impulse responses. Proc. Signals Syst. Comp. 2004, 1, 725–729. Robotics 2016, 5, 16 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 15 of 15 Shinn-Cunningham, B.G.; Kopco, N.; Martin, T.J. Localizing nearby sound sources in a classroom: Binaural room impulse responses. J. Acoust. Soc. Am. 2005, 117, 3100–3115. [CrossRef] [PubMed] Shabtai, N.R.; Zigel, Y.; Rafaely, B. Feature selection for room volume identification from room impulse response. In Proceedings of the 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 18–21 October 2009. Vesa, S. Binaural sound source distance learning in rooms. IEEE Trans. Audio Speech Lang. Proc. 2009, 17, 1498–1507. [CrossRef] Georganti, E.; May, T.; Par, S.V.D.; Mourjopoulos, J. Sound source distance estimation in rooms based on statistical properties of binaural signals. IEEE Trans. Audio Speech Lang. Proc. 2013, 21, 1727–1741. [CrossRef] Jetzt, J.J. Critical distance measurement of rooms from the sound energy spectral response. J. Acoust. Soc. Am. 1979, 65, 1204–1211. [CrossRef] Bronkhorst, A.W.; Houtgast, T. Auditory distance perception in rooms. Nature 1999, 397, 517–520. [CrossRef] [PubMed] Kuster, M. Estimating the direct-to reverberant energy ratio from the coherence between coincident pressure and particle velocity. J. Acoust. Soc. Am. 2011, 130, 3781–3787. [CrossRef] [PubMed] Georganti, E.; Mourjopoulos, J.; Par, S.V.D. Room statistics and direct-to-reverberant ratio estimation from dual-channel signals. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 42–47. Eaton, J.; Moore, A.H.; Naylor, P.A.; Skoglund, J. Direct-to-reverberant ratio estimation using a null-steered beamformer. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Austrelia, 19–24 April 2015; pp. 46–50. Hioka, Y.; Niwa, K. Estimating direct-to-reverberant ratio mapped from power spectral density using deep neural network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 149–152. Hu, J.S.; Liu, W.H. Location classification of nonstationary sound sources using binaural room distribution patterns. IEEE Trans. Audio Speech Lang. Proc. 2009, 17, 682–692. [CrossRef] Nix, J.; Holmann, V. Sound source localization in real sound fields based on empirical statistics of interaural parameters. J. Acoust. Soc. Am. 2006, 119, 463–479. [CrossRef] [PubMed] Kido, K.; Suzuki, H.; Ono, T. Deformation of impulse response estimates by time window in cross spectral technique. J. Acoust. Soc. Jpn. (E) 1998, 19, 249–361. [CrossRef] Shimoyama, R.; Yamazaki, K. Computational acoustic vision by solving phase ambiguity confusion. Acoust. Sci. Tech. 2009, 30, 199–208. [CrossRef] Hartmann, W.M.; Rakerd, B.; Koller, A. Binaural coherence in rooms. Acta Acust. United Acust. 2005, 91, 451–462. Shimoyama, R.; Fukuda, R. Room volume estimation based on statistical properties of binaural signals using humanoid robot. In Proceedings of the 23rd International Conference on Robotics in Alpe-Adria-Danube Region (RAAD), Smolenice, Slovakia, 3–5 September 2014; pp. 1–6. © 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).