Design and Evaluation of ComputerSimulated Spatial Sound S.H. Kurniawan 1, A.J. Sporka, V. Nemec and P. Slavik 1 Introduction The potential of Virtual Reality (VR), which has been an important and exciting field of HCI for many years, for people with disabilities has slowly been recognised. VR systems had been applied in the areas of education, training, rehabilitation, communication and access to information technology for people with disabilities (Colwell, Petrie, Kornbrot, Hardwick and Furner, 1998). There is a wide range of applications of VR for blind and visually impaired users. These applications share a common feature: the substitution or enhancement of visual information with information in other modalities, i.e., audio or haptic/ tactile/kinaesthetic. One important application of VR is to train blind users to navigate and move around in real environment, also known as the orientation and mobility (O&M) training (Inman and Loge, 1999). O&M training is important because it helps blind people develop the skills and techniques to overcome travel difficulties created by blindness and to maximise their ability to move around in different environments, familiar or unfamiliar, independently, safely and confidently (The Royal Blind School, 2003). Conventional O&M training involves instructing a blind trainee to approach a wall or bring a small obstacle near to the trainee’s face to show the sound variation caused by presence of object (known as the obstacle perception training) or to bring trainees to various environments to train them to detect the sound variation caused by various factors, e.g., the floor textures, the room size, the location of the closest obstacle, etc. (Seki and Ito, 2003). This method is very time consuming and may pose some danger to the trainees (e.g., when training them to cross a busy road). This is an area where VR and virtual sound may be beneficial. Rather than exposing a blind trainee to a real environment, the trainee can stay in a virtual environment and learn to orientate and move around based on the virtual sound he/she heard. 1 Department of Computation, UMIST, PO Box 88, Manchester, M60 1QD, UK 138 Kurniawan, Sporka, Nemec and Slavik However, this also means that the acoustic system used for the training must be able to produce sounds that are natural to the trainees’ ears. This paper reports on the design and evaluation of one component of the O&M training system for blind and visually impaired people: a spatial audio system that is capable of modelling the acoustic response of a closed environment with varying sizes and textures (e.g., a small carpeted room vs. a large plastered hallway). Previous work on room perception includes Suzuki and Martens’ work (2001) on the subjects’ ability to determine the presence of the walls made of different materials in virtual environment. 2 Spatial Sounds in Real and Virtual Environments 2.1 Spatial Sound Sound is essentially vibration of particles of solid or liquid medium around their equilibrium positions. If this vibration falls in the range of 10 and 20000 Hz, the sound is called the audible sound. The vibration of the particles causes small local adiabatic variations of pressure in the medium, referred to as the acoustic pressure. These pressure variations are propagated through the environment by means of waves of acoustic pressure. In the real world with obstacles between the source and the receiver, only some part of the sound wave travels straight between the source and the receiver (hence is called direct sound). Signal 1 in Figure 1 is an example of a direct sound. The shape of the signal of a direct sound is unchanged, except for its intensity – due to the energy conservation law – and its temporal displacement or delay – due to the finite phase velocity of the sound waves. Other parts of the sound will be reflected or diffracted by some obstacles before reaching the receiver. In this case, what the receiver receives is the acoustic response of the environment to the original sound emitted by the source. The combination of the direct and indirect sound is called the spatial or spatialised sound, also called the sound that carries the reverberation of the environment. Figure 1 illustrates the propagation of sound in a closed environment. Figure 1. Propagation of sound in a closed environment. S = sound source, R = sound receiver. The diagram on the right hand side is the corresponding IR diagram. Design and Evaluation of Computer-Simulated Spatial Sound 139 For any configuration of sound source, sound receiver and obstacles of the environment, it is possible to represent the acoustic response using an acoustic impulse response (IR) diagram as shown in the right hand side of Figure 1. Briefly, IR describes the intensity and the time of arrival of all echoes of the emitted sound received by the sound receiver. The phase velocity of the sound waves (i.e., the speed of sound) is different for different media (Kutruff, 1979). For the air, its magnitude is approximately: c = 331.4 + 0.6è, è is the temperature in degree Centigrade (1) As a consequence of the energy conservation law, the amplitude of the sound pressure decreases with its distance from the sound source. Besides the well-known “1/r2 rule” (the intensity of sound decays with the square of distance from the sound source), during the propagation itself, the sound energy is also scattered as heat dissipated to the medium. Sound reflection occurs when a wave hits a surface of an obstacle, as depicted in Figure 2.a. In this case, a reflected wave originates from the place of impact. This reflected wave carries only a part of the energy of the original wave as the energy is lost during interaction with the obstacle. The amount of energy lost is determined by the absorption coefficient á, which is dependent on the material of the surface, the frequency of the sound, and the angle of impact â. Sound diffraction occurs if the wavelength of the sound is similar to the dimension of the obstacles of the environment. In this phenomenon, the sound is deviated around an obstacle, as Figure 2.b shows. a b Figure 2. Indirect sounds: a. sound reflection; b. sound diffraction 2.2 The Human Auditory System The human auditory system is capable of detecting the reverberation in the sound received and analysing the spatial information about the surrounding environment contained in it. This characteristic necessitates the incorporation of the acoustic response of the environment when creating virtual sound in VR systems. 140 Kurniawan, Sporka, Nemec and Slavik As a consequence of the sound interactions with different obstacles in an environment, the sound arriving to a listener contains multiple echoes of the original sound (a combination of sounds with varying delays and magnitudes of attenuation) and information about the directions of arrival. These echoes can be divided into three major parts: • • • the first audible echo received is interpreted by the human auditory system as a direct sound. Its direction of arrival provides the most important information of where the sound source is located. Its intensity provides the information of the distance of the sound source; the early echoes ( 100 ms) are processed separately by the human auditory system. Analysing their incoming direction and intensity allows the position of the nearest obstacles in the environment to be determined (Funkhouser, Jot and Tsingos, 2002); the late reverberation gives the overall information about the environment (the size of the environment, the textures of the floor/wall, etc.) 2.3 Spatial Sound in Virtual Environment Modelling sound propagation in an environment can be done throught an appropriate wave equation (Kutruff, 1979). However, the solution of this equation is very complex even for simple scenes and virtually impossible for more complicated scenes where many obstacles are involved. Therefore, alternative ways to describe sound waves are needed. Generally, there are three approaches to solve the equation: numerical, geometrical and statistical. 2.3.1 The Numerical Approaches These approaches give the solution for the wave equation by reducing the problem to estimating energy transfers among finite elements specified within the modelled scene. There are two methods within the numerical approaches: • • The finite and boundary element methods give the solution of the wave equation by spatial subdivision of the scene into distinct elements, for which the wave equation is expressed as a discrete set of linear equations. The underlying computation of these methods is very complex and consequently when the calculation is performed using a computer, it requires a large memory capacity. The computational complexity increases with the frequency of the sound whose scene is modelled. Therefore, when precise estimation is required, these methods are suitable only for low-frequency energy transfers within simple scenes. The waveguide mesh is a regular array of elements with its neighbours connected by unit delays. Each element describes the sound energy of a finite part of the modelled environment. Each sound source and receiver is represented by one element from the mesh. The simulation itself is iterative. In each iteration, each element updates its energy status based on the previous energy status of all its neighbours following the energy conservation law. The IR is then described by the development of sound Design and Evaluation of Computer-Simulated Spatial Sound 141 energy in the receiver (Lokki, Savioja, Vaananen, Huopaniemi and Takala, 2002). 2.3.2 The Geometrical Approaches These approaches assume that the sound wavelengths are smaller than the size of the obstacles and therefore they are valid only for the sounds of high frequencies, however, their lesser computational costs render their use feasible in VR systems. The key idea common to all geometrical approaches is the simulation of the sound wave propagation through an investigation of the behaviour of its infinitesimal parts, the sound rays. The audibility of a sound source in the position of a listener is investigated by searching rays that represent audible echoes of the emitted sound. There are two methods within the geometrical approaches: Figure 3. Ray tracing. • Ray tracing, similar to the well-known and widely used method with the same name in the 3D computer graphics, is based on the concept of sound rays tracing. Each ray of the initial set of rays emanating from S is traced and compared with the position of sound receivers R1 and R2. Figure 3 shows the ray tracing process. The tracing is stopped when a limit is reached (e.g., maximum order of reflection or minimum level of energy has been exceeded, or the ray hits the receiver). This method is easy to implement, but may carry the risk of space under sampling. As shown in Figure 3, due to insufficient initial set of rays, R2 was considered incapable of sound reception. Figure 4. a. A beam b. A single beam; S – sound source, R 1, R 2 – sound receivers. 142 Kurniawan, Sporka, Nemec and Slavik • Beam tracing, based on the concept of sound beams tracing. A beam is a cone defined by its apex (sound source) and its base (a closed environment) as illustrated in Figure 4.a. A beam consists of all of the rays that originate in the beam's apex and intersect the beam's base. Using this method, larger areas of the space are searched at once as illustrated in Figure 4.b. The calculation of the reflection of a beam is more computationally expensive than that of the ray tracing method but fewer beams are required to get the same precision of calculation as that of the ray tracing method. Nonetheless, since the number of beams increases exponentially in the higher order of reflection, this method is usable only for the early echoes. Consequently, it is impossible to use only the geometrical approaches to simulate long reverberations, for which the reflections of high orders ( 30), needs to be taken into account in reasonable time. 2.3.3 The Statistical Approaches The human auditory system can only distinguish the early echoes. The late reverberation phase only provides information about the size of the environment. Therefore, it is possible to model the late reverberation phase using a statistical model where the echoes contributing to the simulated IR are randomly generated. The requirements for these echoes are (Kutruff, 1979): • • the temporal density of the reflections increases with the square of time; the intensity drops exponentially with time. The statistical approaches are employed in most of the current spatial sound intended for use in the VR systems (Funkhouser, Min and Carlbom, 1999). 2.3.4 The Convolution The process of applying the IR to the sound signal for spatialization is usually modeled as the convolution of the sound signal and the IR. The convolution of two discrete signals in the digital signal processing is usually defined as: f1 [t ] f 2 [t ] t u 0 f1[u ] f 2 [t u] where f1 and f2 are the input signals (2) As the acoustic IR is a list of echoes of the emitted sound, the process of the convolution of the emitted sound signal with the acoustic IR can be thought as the superposition of the delayed, attenuated, and accelerated copies of the emitted sound signal. 3 The Spatial Sound System The designed spatial audio system’s main function is to perform off-line (non-real time) simulations of the sound propagation between a source and a receiver, taking Design and Evaluation of Computer-Simulated Spatial Sound 143 into account the acoustic response of the environment. This system employs a hybrid sound propagation model consisting of a beam tracing algorithm (for the phase of the early echoes) and a statistical model (for the late reverberation). The process of modeling the acoustic response of the environment to the emitted sound consists of two fundamental steps as illustrated in Figure 5: • • IR is computed to simulate the propagation of the sound from the source to the receiver in the environment. This step may also be considered as an enumeration of the sound paths from the source to the receiver along which the echoes of the original sound are transmitted. The sound signal representing the acoustic activity of the sound source is brought to the convolution with the IR generated by the previous step. The result is the spatialized audio signal. Figure 5. The process of simulating a spatialized audio signal 4 User Evaluation To test the fit of the algorithms used in the spatialization process, the system was evaluated by a representative group of its prospective users. 4.1 The Stimuli Development The sounds were recorded using a stereophonic microphone PHILIPS SBC 3050 and a SoundBlaster 16 compatible sound card. Seven distinct sounds: guitar, flute, mobile phone ringing, human voice, cane tapping, glass tinkling and handclapping, were recorded in three different room conditions, coded small (S), medium (M) and large (L). The characteristics of these rooms are listed in Table 1. These recorded stimuli were simply called the recorded scenes. The other stimuli are called the simulated scene stimuli. To create these stimuli, dry sounds (pure sounds, without the effects of the environment) were recorded in a music studio with a very short reverberation (less than .05s) using the AKG C1000S microphone and the Midiman Delta 1010 sound card. The sounds were stored separately into a set of 44.1 kHz PCM files. Then, the effects of the environments were added using the designed spatial audio system. The addition process was performed in two steps. Firstly, a model of the real rooms was created in the ASE format (a 3D graphics file format). Secondly, a batch of separate task description 144 Kurniawan, Sporka, Nemec and Slavik files for each dry sound was combined with each model of the rooms. This batch was finally processed by the system to produce the simulated scene stimuli. Table 1. The approximate characteristics of the real scenes Environments Bedroom (S) Hallway (M) Stairway (L) Dimensions 4 × 4 × 2.5 m 8×3×5m 12 × 12 × 10 m Surfaces Plaster, carpet, wood Plaster, marble Plaster, marble, tiles Reverberation length .2 s 1s 3.5 s 4.2 The Evaluation Method Nine registered blind participants (8M, 1 F; mean age 29.3 with an S.D. of 6.76 years) listened to 42 sound files (7 types of sound x 3 environments for each of the simulated and recorded scene groups) through a headphone. The sequence of the sounds played was controlled so that no adjacent sounds share any similarity (e.g., if the first sound is a simulated flute in a small room, then the next played sound must be in a recorded scene, must not be a flute sound nor it is in a small room). Each participant performed the evaluation with no other participant around. Each listened to one of the two sets of sounds. The order of the second set of sounds is the reversed order of the first set. After listening to each sound, the participants answered in writing three questions: 1. 2. 3. What sound was it? Was that sound more likely to be from a small (S), medium (M) or large (L) room? Was that room more likely to be a real room (R) or simulated using computer (C)? 4.3 Results and Analysis The first question was intended to encourage the participants to listen carefully. Therefore, in this paper the answers were not analysed. The answers to the second question were scored 0, 0.5 or 1. When the participants answered correctly, they were scored 1. A score of 0.5 was given when the difference between the correct and the wrong answers was one room size (e.g., a participant answered S for a sound in an M room). When the difference was two sizes (a participant answered S for L or vice versa) then a score of 0 was given. The answers to third question were scored 0 (wrong) or 1 (correct). The one-way Analysis of Variance (ANOVA) reveals that across all participants, the sum of scores for the room size question were not significantly different between the recorded and simulated scene groups, with F(1,376) = 0.03, p = 0.862. This result might mean that there is evidence that the designed system was able, to a certain degree, to simulate various room sizes. Further analysis, displayed in a Design and Evaluation of Computer-Simulated Spatial Sound 145 % of correct answers 100 78 80 75 78 83 79 69 60 40 20 0 Small Simulated Medium Recorded Large Room size % of correct answers graphical form in Figure 6.a, shows that the difference was not significant in any room size. Figure 6.b shows that it was easier to recognise the recorded scenes in the small room conditions. The participants were correct in 83% of occasions. In the medium room conditions, the participants seemed unsure whether the scenes were real or simulated, indicated by scores that are only slightly above 50% (assuming that random guessing carries a 50% probability of correct answers) in both scene groups. And finally, in the large room conditions, the participants were quite successful in recognising the simulated scenes (they were correct in 71% of the occasions) but were less able to recognise the real scenes. Focusing on the simulated scenes, from these results, it can be inferred that the designed system was unable to simulate the large room conditions perfectly (hence, the simulated and recorded scenes could be easily distinguished). Based on the same argument, it might mean that the designed system was able, to a certain degree, to simulate the small and medium room conditions quite well (hence the participants were unsure whether the scenes were recorded or simulated). 90 80 70 60 50 40 30 20 10 0 83 71 59 57 49 41 Small Simulated Medium Large Nature of scenes Recorded Figure 6. a. The % of correct answers for the room size question. b. The % of correct answers for the nature of the scenes question. 5 Conclusions and Further Work The results of the user studies indicated that the algorithms behind the designed spatial audio system were able to simulate the environments to a certain extent. The system was able, to a certain degree, to simulate the sound variation in different room sizes, indicated by the lack of significant differences between the sum of scores in the simulated and recorded scene groups. However, it seems that when the system simulated the large room condition, the difference between the reverberation of the simulated and recorded scenes were noticeable. Based on these results, we can speculate that the designed audio system is potentially useful as a part of the O&M training suite for blind and visually impaired people, preferrably to simulate sounds in small or medium room conditions. 146 Kurniawan, Sporka, Nemec and Slavik Integrating this system into the training suite and testing the suite with its prospective users is the immediate follow-up work. Further studies are also needed to investigate how users can distinguish between various room sizes and between simulated and recorded scenes. 6 Acknowledgement We would like to thank Dominik Pecka for his willingness to lend us the music studio Fjördström, Prague, and to operate its equipment during the recording of the dry sounds. 7 References Colwell C, Petrie H, Kornbrot D, Hardwick A, Furner S (1998) Haptic virtual reality for blind computer users. In: Proceedings of the 3rd International ACM Conference on Assistive Technologies (ASSETS '98). ACM Press, Marina del Rey, USA, pp 92-93 Funkhouser T, Jot JM, Tsingos N (2002) Sounds Good to Me! Computational Sound for Graphics, Virtual Reality, and Interactive Systems. SIGGRAPH 2002 Course Notes [online]. Available at: http://www.cs.princeton.edu/gfx/papers/funk02course.pdf Funkhouser T, Min P, Carlbom I (1999) Real-time acoustic modelling for distributed virtual environments. In: Computer Graphics Proceedings, Annual Conference Series, SIGGRAPH 99, Los Angeles, CA, pp 365–374 Inman DP, Loge K (1999) Teaching orientation and mobility skills to blind children using simulated acoustical environments. HCI 2: 1090-1094 Kuttruff H (1979) Room Acoustics, 2nd ed. Applied Science Publishers Ltd., London, U.K. Lokki T, Savioja L, Vaananen R, Huopaniemi J, Takala T (2002) Creating Interactive Virtual Auditory Environments. IEEE Computer Graphics & Applications 22: 49-57 Seki Y, Ito K (2003) Study on acoustical training system of obstacle perception for the blind. In: Craddock, McCormack, Rielly & Knops (Eds.), Assistive Technology - Shaping the Future (Proceedings of 7th European Conference for the Advancement of Assistive Technology (AAATE), Dublin, Ireland, 31 Aug – 3 Sept 2003) pp 461-465 Suzuki K, Martens WL (2001) Subjective evaluation of room geometry in multichannel spatial sound reproduction: Hearing missing walls in simulated reverberation. In: Proceedings of the 12th International Conference on Artificial Reality and Telexistence (ICAT’01) [online]. Available at: http://vrsj.t.u-tokyo.ac.jp/ic-at/papers/01090.pdf. The Royal Blind School (2003) Orientation and mobility [online]. Available at: http://www.royalblindschool.org.uk/Departments/Mobility.htm.