PAPERS Interpositional Transfer Function for 3D-Sound Generation* F. P. FREELAND, L. W. P. BISCAINHO, AES Member, AND P. S. R. DINIZ Universidade Federal do Rio de Janeiro, LPS–PEE/COPPE & DEL/POLI, 21945-970, Rio de Janeiro, RJ, Brazil The interpolation of head-related transfer functions (HRTFs) for 3D-sound generation through headphones is addressed. HRTFs, which represent the paths between sound sources and the ears, are usually measured for a finite set of source locations around the listener. Any other virtual position requires interpolation procedures on these measurements. In this work a definition of the interpositional transfer function (IPTF) and an IPTF-based method for HRTF interpolation, recently proposed by the authors, are reviewed in detail. The formulation of the interpolation weights is generalized to cope with azimuth measurement steps which change with elevation. It is shown how to obtain a set of reduced-order IPTFs using tools such as balanced model reduction and spectral smoothing. Detailed comparisons between the accuracy and the efficiency of the IPTF-based and the bilinear interpolation methods (the latter using the HRTFs directly) are provided. A practical set of IPTFs is built and applied to a computationally efficient interpolation scheme, yielding results perceptually similar to those attainable by the bilinear method. Therefore the proposed method is promising for real-time generation of spatial sound. 0 INTRODUCTION The interest in so-called 3D-sound techniques [1] has increased recently. Besides cinema and home theater, they can be found in applications ranging from a simple video game to sophisticated virtual reality simulators. Also, there is an increasing level of interaction between users and the virtual environment in which they are immersed. Contrasting with cases when the location and/or movement of sound sources is predefined (as in a moving picture exhibition), which allow off-line digital signal processing, interactive systems require real-time processing [2] and may impose severe restrictions on the computational complexity. The present work addresses the problem of spatial sound generation employing reduced computational complexity when the 3D sound for a single virtual source is supplied binaurally through headphones. In this case the direction of arrival of the sound can be controlled by filtering the original monaural signal through a proper set of previously measured head-related transfer functions (HRTFs) [3], [4]. These functions model the paths from the virtual sound source to the listener’s ears. In practical systems only a finite number of HRTFs is available. They are usually obtained from their corre*Manuscript received 2003 October 8; revised 2004 June 29. This work was presented in part at the AES 22nd International Conference in Espoo, Finland, 2002 June 15–17. J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September sponding head-related impulse responses (HRIRs) measured under controlled conditions on a spherical surface of constant radius only for a predefined set of elevation angles and azimuth angles with respect to the ears of a target head at its center. In order to place the sound source at an arbitrary position on this surface, it is necessary to estimate the HRTF associated with the desired source location from the set of measured ones via an interpolation scheme. In particular, a realistic rendering of moving sound requires a large set of HRTFs so that no noticeable discontinuity occurs along its path. As a consequence, in real-time systems the computational cost to perform HRTF interpolation must be kept to a minimum without degrading the final perceptual result. In this sense the tradeoff between accuracy and system complexity is critical when designing 3D audio systems. The target here is to render faithful location of a virtual sound source moving rapidly under user control. One could think of applications such as virtual reality simulators, video games, or even films in which the perspective can be controlled by the user. When dealing with a single source,1 a straightforward way to perform HRTF interpolation is the bilinear ap1 Some methods, such as [5], [6], provide alternatives to the HRTF-based model. They split the overall model into two parts, one related to the spectral characteristics, the other to the spatial ones. This interpretation can be more efficient than the direct use of HRTFs when dealing with a multisource case. 915 FREELAND ET AL. PAPERS proach [7], [1], which consists of computing the HRTF associated with a given point on the reference sphere as a weighted mean of the measured HRTFs associated with the four nearest points that circumscribe the desired point. Consider that a set of HRIRs was measured for fixedangle steps grid and grid relative to azimuth and elevation, respectively. An estimate of the HRIR at an arbitrary coordinate (, ) on the spherical surface, as shown in Fig. 1, can be obtained by ĥ共k兲 = 共1 − c兲共1 − c兲ha共k兲 + c共1 − c兲hb共k兲 + cchc共k兲 + 共1 − c兲chd共k兲 (1) where ha(k), hb(k), hc(k), and hd(k) are the HRIRs associated with the four points nearest to the desired position. The parameters c and c are computed as c = C mod grid = , grid grid c = C mod grid = grid grid (2) where C and C are the relative angular positions defined in Fig. 1. Low-cost interpolation methods have been the subject of several works [8]–[11], which usually focus on the reduction of HRTF orders. Surprisingly, the way the HRTF interpolation is carried out is not frequently addressed. In a recent work [12] the authors propose an alternative interpolation procedure, similar in reasoning to the bilinear method, but based on a new auxiliary transfer function, called interpositional transfer functions (IPTFs), which leads to lower complexity than that of the bilinear interpolation without perceptible losses in the modeling accuracy. Since the IPTFs can be represented by models of reduced order without implying substantial deviations in the overall model, the proposed method exhibits reduced computational complexity. For the sake of completeness, the present work starts by reviewing the definition and use of IPTFs for the interpolation of HRTFs [12]. The computation of the interpolation weights is generalized for the case in which the azimuth measurement steps are elevation dependent. Then a practical procedure to obtain a complete set of low-order IPTFs is described. An interpolation scheme based on loworder IPTFs is carefully assessed against the bilinear method. Fig. 1. Bilinear interpolation—graphical interpretation. 916 In the next section the IPTF is defined. In Section 2 the IPTF-based interpolation method is presented in detail. Comparisons between the performances of the IPTF-based and the bilinear interpolations are presented in Section 3. In Section 4 the simplification of IPTFs is achieved through balanced model reduction [13] with the aid of a spectral smoothing tool [14]. Section 5 describes and discusses the experimental results. Conclusions are drawn in Section 6. 1 INTERPOSITIONAL TRANSFER FUNCTION (IPTF) A practical set of HRTFs is composed of pairs of functions, each pair corresponding to one predefined location of the sound source. As seen in Fig. 2(a) the sound paths from the sound source to the listener’s nearer and farther ears are referred to as ipsi- and contralateral HRTFs, respectively. Fig. 2(b) shows the alternative representation for the HRTF pair proposed in [11]. In that work the authors show that it is possible to represent the pair of ipsiand contralateral HRTFs associated with a given position as depicted in Fig. 2, where the ratio between contra- and ipsilateral HRTFs is called interaural transfer function (IATF)2. The ipsilateral HRTF is chosen to be modeled directly since it conveys no information about sound diffraction on the head, thus allowing lower order modeling than the contralateral function [11]. In addition, the IATF is a ratio between two similar transfer functions, which allows for low-order modeling. Therefore the IATF-based configuration yields a more efficient realization of the contralateral HRTF. Intuitively one can expect that HRTFs from near positions to the same ear should be very similar. This observation inspires the idea of defining an IPTF, which allows the calculation of the HRTF associated with a certain position as a cascade of the HRTF from a contiguous position and an incremental error function of reduced order. The IPTF can be defined as IPTFi,f = HRTFf HRTFi (3) 2 In the original work the interaural transfer function is called ITF. Here the notation IATF is adopted to avoid ambiguity. Fig. 2. HRTF modeling. (a) Direct. (b) IATF-based. J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September PAPERS where HRTFi and HRTFf are the HRTFs associated with the initial and final points, respectively, as shown in Fig. 3. Note that all the vectors were drawn opposite to the direction of true sound paths to make the generation of HRTFf starting from HRTFi clearer. In order to guarantee the stability of the IPTFs it suffices that HRTFf and HRTFi have minimum phase. This characteristic does not seem to affect the perception of the direction of arrival of the generated sound [15]. On the other hand, the difference between the time delays of left and right HRTFs—called interaural time difference (ITD) [9]—is crucial for the perceptual determination of sound direction. As a consequence, the inherent delays of the original HRTFs must be preserved, since they represent the propagation time from the sound source to the ears. In a practical implementation these initial delays could be extracted from the impulse responses and saved, whereas the remaining responses could be transformed into their minimum-phase versions. Then the saved delays would be associated back to the minimum-phase transfer function. An alternative approach consists of deriving the minimum-phase versions of HRTFf and HRTFi, and then evaluating the relative delays of the original and minimum-phase versions based only on the linear component of their excess of phase [2]. The availability of an efficient approximation for the IPTFs is assumed here. This topic will be specifically discussed in Section 4. 2 IPTF-BASED INTERPOLATION OF HRTFs The IPTF-based interpolation follows the same basic principle as the bilinear method. However, in the former method just three HRTFs are used for interpolating another one. In addition, two of them can be indirectly approximated using reduced-order IPTFs, which further contributes to reduce the computational cost of the synthesis procedure. Due to the three-function implementation, the spherical surface must be divided into triangular regions. Once the triangular regions are chosen, the HRTF for a given point P will be calculated using the HRTFs related to the three vertices of that region. Fig. 4 depicts two different situations in the same quadrangular region of the reference sphere grid for which measured HRTFs are available for points A, B, C, and D, whereas the HRTFs associated with point P are to be estimated. From Fig. 4(a) one can see that, given the arbitrary position P to be described, the vertices of the P region are denoted by A, B, Fig. 3. IPTF-computed HRTFf—graphical interpretation. J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September IPTFs FOR 3D-SOUND GENERATION and C, with A and B having the same elevation. The coordinates of the nearest point (A in this example) will be associated with the original HRTF, whereas the coordinates of the other two points will be related to the IPTFapproximated HRTFs. Now the HRTF associated with P can be estimated through a weighted mean of the HRTFs associated with A, B, and C. It is important to note that all the points inside the shaded region are interpolated using the HRTFs related to these three points, of which two are approximations and the nearest one is an original HRTF. Another example is shown in Fig. 4(b), where another point P is inside a different ABC region. Notice that the vertices were relabeled to allow referring to the vertices with the same elevation in the shaded region as A and B, like before. In this new figure the nearest vertex is C, which will be associated with an original HRTF. According to the assignment of these vertices, an HRTF related to P can be described by HRTFP = wAHRTFA + wBHRTFB + wCHRTFC (4) where wC = ⌬ ⌬grid (5) wB = ⌬ ⌬grid (6) wA + wB + wC = 1 (7) and the required angular distances are shown in Fig. 5 and defined by ⌬ = P − A (8) ⌬ = P − X (9) ⌬A = P − A (10) ⌬AC = C − A (11) ⌬grid = B − A (12) ⌬grid = C − A. (13) Since ⌬A − ⌬ ⌬ = ⌬grid ⌬AC (14) Fig. 4. Details of reference spherical surface. A, B, C, and D have known HRTFs; P is the position requiring an interpolated approximation to the HRTF. 917 FREELAND ET AL. PAPERS one can get ⌬ = ⌬A − ⌬ ⌬ . ⌬grid AC (15) Denoting by I the vertex nearest to P and by F and F the other two vertices that form the triangular region around P, according to Fig. 5, the IPTF-based interpolation can be written as HRTFP = ␣HRTFI + HRTFF + ␥HRTFF = HRTFI共␣ + IPTFI,F + ␥IPTFI,F兲 (16) where IPTFI,F and IPTFI,F are defined by Eq. (3). The coefficients ␣, , and ␥ are chosen according to the assignment of the labels I, F, and F. As an example, in Fig. 5 A, B, and C are equivalent to I, F, and F, respectively. Therefore Eq. (16) becomes HRTFP = wAHRTFA + wBHRTFB + wCHRTFC = HRTFA共wA + wBIPTFA,B + wCIPTFA,C兲. (17) It is worth mentioning that this triangular configuration is closely related to the formulation of the vector-based amplitude panning method [16] for virtual source positioning, in the specific case of three loudspeakers. This issue will not be discussed here. The block diagram of Fig. 6 illustrates the interpolation procedure for both the left and right channels (identified by superscripts l and r, respectively). In this diagram the HRTF and IPTF blocks represent the transfer functions without their inherent delays. In a simpler approximation l,r procedure, the delay of HRTFl,r I , DI , is merged with those l,r l,r of the IPTFs, dI,F, and dI,F, so that the inherent delay and the minimum-phase version of the desired HRTF can be interpolated separately [3]. The resulting delay l,r l,r ⌬l,r = ␣DIl,r + 共DIl,r + dI,F 兲 + ␥共DIl,r + dI,F 兲 (18) l,r l,r + ␥dI,F = 共␣ +  + ␥兲DIl,r + dI,F (19) l,r l,r + ␥dI,F = DIl,r + dI,F (20) can be implemented using fractional delay filters [17]. It is worth mentioning that the use of low-order IPTFs is the core of the simplification attainable by the IPTFbased interpolation. The gain in computational efficiency that results from applying a triangular configuration directly to the measured HRTFs is trivial and, with no further consideration, amounts roughly to 25% compared to the quadrangular geometry associated with the bilinear method. In fact, three-point linear combination methods were attempted earlier [2]. In addition to the triangular formulation, the possibility of designing IPTFs using loworder models is what makes the proposed method appealing with regard to computational complexity. This issue is discussed in greater detail in the following sections. 3 IPTF-BASED VERSUS BILINEAR METHOD In this section the performance of the IPTF-based interpolation method is compared to that of the bilinear method by evaluating their frequency response for a defined range of azimuth and elevation angles. Initially only differences in geometry and in weight calculations are considered, that is, the IPTF-based interpolation takes the IPTFs in Eq. (3) as the direct ratio between the corresponding HRTFs without any simplification, which can be considered as a triangular interpolation applied to the HRTFs, in the form Fig. 5. Angular distances used to obtain parameters wA, wB, and wC. HRTFP ⳱ ␣HRTFI + HRTFF + ␥HRTFF. (21) Fig. 6. Block diagram of IPTF-based interpolation procedure. 918 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September PAPERS IPTFs FOR 3D-SOUND GENERATION The underlying idea behind the tests consists in confronting the spatial frequency response surfaces (SFRSs) [18], [19] of the results from both interpolation methods. Each SFRS shows, for a given frequency, the dependence of the magnitude response on the changes in elevation and azimuth. The equations of the bilinear interpolation are originally defined for equally spaced grids in both and . In order to provide a fair comparison between the two methods, the range of evaluation of the bilinear interpolation was restricted to −20° < < 20°, where the available set of HRTFs is spaced uniformly (see Table 1). All simulations in this work used a set of 128-samplelong measured HRIRs sampled at 44.1 kHz, originated from [3]. They are available for elevation steps of 10° and variable azimuth steps, as shown in Table 1. Figs. 7–10 provide a visual comparison of performances of the IPTF and bilinear methods. The first and second columns show the SRFSs related to the bilinear (|HB(, , )|) and the IPTF-based (|HIPTF(, , )|) interpolations, respectively. Different magnitude scales were adopted according to the dynamic range at each frequency bin. Only left-ear HRTFs are considered, since the opposite functions can be inferred by symmetry. The frequency bins are chosen according to [18]. To improve readability, the third column illustrates the difference in dB between the two SFRSs in the same row, given by 共, , 兲 = 20 log10 冉 冊 |HIPTF共, , 兲| . |HB共, , 兲| (22) This time a lighter region means small differences (near zero) between correspondent SFRSs. The same range was chosen for all figures in this column in order to make the comparison between different frequency bins easier. In spite of the occurrence of more noticeable differences around the azimuth angle of 90° (the region more strongly affected by head diffraction), the equivalence between the two methods is confirmed by the small differences observed. It can also be noticed that the worst cases are related to low-energy regions of the SFRSs. Of course, the mere use of the IPTFs, computed as the direct ratio between the corresponding HRTFs without any simplification, does not yield an improvement in computational complexity. Since the purpose of this work is to propose efficient spatial sound generation, the next step is to investigate whether the IPTF model can be reduced, as expected. Table 1. Azimuth measurement steps. Elevation (deg) Azimuth resolution grid (deg) −20 to 20 −30 and 30 −40 and 40 50 60 70 80 5 6 45/7 8 10 15 30 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September Assuming that each HRTF directly used in the interpolation procedure requires N multiplications per sample (mps), the complexity of the bilinear interpolation is Cbilinear ⳱ 4N + 12 mps. (23) On the other hand, if each approximated IPTF requires M mps, the IPTF-based interpolation scheme requires CIPTF-based ⳱ 2M + N + 7 mps. (24) Therefore the latter exhibits lower computational complexity if M⬍ 3N + 5 . 2 (25) For the HRTFs used in the simulations [3] (which have N ⳱ 128), that means M ⱕ 194. On the other hand, it is also possible to take advantage of efficient approximations of HRTFs [9], which allow for model-order reduction. For example, assuming N ⳱ 33, which is the lowest order attained in [9], making M < 52 would suffice to turn the IPTF-based interpolation less expensive computationally. It should be noticed that since the IPTF-based interpolation procedure utilizes a triangular grid, the number of regions visited by the virtual sound source is increased. As a result, if any special procedure is necessary at the edge crossings3, it can result in additional computational load. Of course, at high sampling rates (such as 44.1 kHz) and for not too fast movements, this overhead becomes irrelevant. The remaining issue is how much can the IPTFs be simplified so that the approximation errors are acceptable? The next section describes some tools to approximate a complete set of IPTFs. 4 ORDER REDUCTION OF IPTFs 4.1 Balanced-Model Truncation In [8] the authors apply the balanced-model truncation or reduction (BMR) to decrease the order of HRTFs. Since the measured HRTFs are in FIR form, the specialized BMR algorithm that approximates FIR filters by low-order IIR filters described in [13] is used. In the present work the IPTF is the ratio of two HRTFs, and thus it is originally an IIR system. Therefore the simpler formulation of [13] cannot be used. Nevertheless, the balanced-model reduction method, in its general formulation, is applicable to the IIR case and is found to be effective in reducing the order of the IPTFs. In the following, a review of the balancedmodel reduction is presented. The state-space representation of a system in which the grammians of its observability and controllability matrices (see the discussion leading to Eqs. (29) and (30) below) are both diagonal and equal is called a balanced model. The main advantage of this type of model is the possibility of ordering the system states based on their significance to the representation of the system characteristics. This is a very important feature, which allows model simplification by direct truncation, that is, by discarding its “less important” states. 3 More on this topic in Section 5. 919 FREELAND ET AL. PAPERS Consider the state-space representation of an IPTF, x共k + 1兲 = Ax共k兲 + bu共k兲 y共k兲 = cTx共k兲 + du共k兲 (26) where A, b, c, and d are the transition matrix, input vector, output vector, and feedforward coefficient, respectively. x(k) is the state vector at instant k, and u(k) and y(k) are the input and output signals at instant k, respectively. Fig. 7. Comparison of SFRSs.—I. (a) Bilinear. (b) IPTF-based. (c) Difference in dB. 920 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September PAPERS IPTFs FOR 3D-SOUND GENERATION Define i Wo ⳱ [c ATc . . . (AT) c . . . ]T and Wc ⳱ [b Ab . . . Aib . . . ] and (27) (28) for i ⳱ 1, 2, 3, . . . , as the observability and controllability matrices of the system, respectively. Their respective grammian matrices are given by (29) P ⳱ WcWTc Q ⳱ WTo Wo. (30) If rank [A] ⳱ J, the rank-J Hankel matrix H = WoWc = 冤 h1 h2 · · · h2 h3 ·· · ·· · 冥 (31) Fig. 8. Comparison of SFRSs—II. (a) Bilinear. (b) IPTF-based. (c) Difference in dB. J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September 921 FREELAND ET AL. PAPERS can be constructed, where hi ⳱ c A T i−1 b, i ⳱ 1, 2, 3, . . . (32) are the samples of the system impulse response. Now H can be decomposed as H ⳱ V1⌫V2 (33) where ⌫ is the matrix containing the singular values 1, . . . , J of H along the first J positions of its main diagonal, by convention arranged in descending order of their magnitudes [20]. Let ∑ be the upper left J × J submatrix of ⌫. Since the square of each singular value of H, 2i , is equal to a corresponding eigenvalue of the product PQ [21], [13], denoted by i, for i ⳱ 1, 2, 3, . . . , then PQ ⳱ K∑2K−1. (34) Fig. 9. Comparison of SFRSs—III. (a) Bilinear. (b) IPTF-based. (c) Difference in dB. 922 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September PAPERS IPTFs FOR 3D-SOUND GENERATION Hence one can show that the grammians Pb and Qb of the desired balanced system must be such that Pb = Qb = 兺 = diag共公1, . . . , 公J兲 = diag共1, . . . , J兲. (35) Then the transformation that turns the original system (A, b, c, and d) into its balanced form (Ab, bb, cb, and d) through Ab ⳱ T−1 AT, bb ⳱ T−1 b, cb ⳱ TT c (36) is T ⳱ S−1 U∑1/2 (37) where S results from decomposing the matrix Q into the form Q ⳱ STS. (38) Matrices U and ∑ can be computed from the following decomposition: SPST ⳱ U∑2UT (39) where UUT ⳱ I. (40) Note that Eq. (39) holds since SPST ⳱ SPQS−1 is symmetric and has the same eigenvalues as the product PQ. Consider the case where, according to some error criterion, only the j < J larger singular values are judged to be sufficient to describe the IPTF. In this case matrix ∑ can be written in the form 兺= 冋 册 兺1 0 0 兺2 (41) Accordingly the balanced-model matrices can be decomposed into Ab = 兺1 = diag共1, . . . , j兲 (42) 兺2 = diag共j+1, . . . , J兲. (43) A1,1 A1,2 A2,1 A2,2 册 , bb = 冋册 b1 b2 , cb = 冋册 c1 c2 (44) such that (A1,1, b1, c1, and d) are the state-space matrices of the subsystem of order j that can approximate the original IPTF accurately. The last step, if desired, consists of converting the statespace representation back to the transfer function form. 4.2 Spectral Smoothing The design of IPTFs through the balanced-model reduction method has some peculiarities. Since the target frequency response results from the ratio of two HRTFs, it may include a number of sharp spectral peaks and valleys at some locations. As a consequence, numerical problems may arise during the simplification procedure. On the other hand, psychoacoustics shows that frequency discrimination in human hearing varies with frequency in such a way that it becomes coarser as the frequency increases. To some extent the logarithmic western musical scale is a consequence of this phenomenon [22], [23]. Furthermore, measured HRIRs tend to be less reliable at higher frequencies where, due to the sound attenuation, measurements have lower signal-to-noise ratios. Considering the aforementioned facts, one can think of applying a spectral smoother to the HRTF response in such a way that the degree of smoothness increases toward the higher frequencies of IPTFs. For instance, the smoothed spectral response at any frequency f can be computed as the average spectral response within the range delimited by the frequencies f1 = where 0 are null matrices and 冋 f 公K , f2 = f 公K (45) which are geometrically equidistant from f by a factor of √K. In the DFT domain with the dc frequency indexed as Fig. 10. Comparison of SFRSs—IV. (a) Bilinear. (b) IPTF-based. (c) Difference in dB. J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September 923 FREELAND ET AL. PAPERS n ⳱ 0, if f is the frequency associated with the spectral bin n, the corresponding indices of the approximated limit frequencies can be obtained by n1 = 公 n n2 = n 公K , K (46) respectively, denoting ⭈ as the lowest integer greater than or equal to (⭈) and ⭈ as the greatest integer lower than or equal to (⭈). The value of K ⳱ 2a gives a window width of a octaves. For example, using K ⳱ 21/3 one defines a one-third-octave range [9], [24]. However, a fixed value for K may not smooth the high-frequency range of the HRTFs sufficiently without degrading the low-frequency range. Then a linearly varying value K(n) given by K共n兲 = 1 − n 1−K LⲐ2 − 1 (47) was defined, where L is the FFT length. A constant K ⳱ 21/6 was chosen, resulting in a one-sixth-octave window width at n ⳱ L/2 − 1. Two smoothing methods are considered, namely, the square magnitude [9] and the complex frequency response [14] methods, where in each case an average is computed on a sliding window. In the former, smoothing is attained by computing the arithmetic average of the squared magnitude response in the window range and taking the square root of the result, that is, |Hs共n兲| = 冑 1 n2 − n1 + 1 n2 兺 |H 共l兲| 2 o (48) l=n1 where Ho(n) and Hs(n) are, respectively, the original and the smoothed frequency responses. The latter method takes into account the phase information by computing the mean over the complex frequency response, which can be expressed as Hs共n兲 = 1 n2 − n1 + 1 n2 兺H 共l兲. o (49) l=n1 These two methods are based on the arithmetic mean, which treats peak and valley occurrences differently. An alternative solution is to replace the arithmetic mean by the geometric mean in Eq. (49). Thus Eq. (49) becomes Hs共n兲 = 冑兿 n2−n1+1 n2 Ho共l兲. (50) l=n1 In each case n1 and n2 are expressed by Eq. (46). When using spectral smoothing in the design of loworder IPTFs it was verified that the smoother design given by Eq. (50) is more suitable. Smoothing was applied to the spectrum of each HRTF during the minimum-phase transformation algorithm [4], [24] prior to log-magnitude operation (see [25, p. 784]). 4.3 Order Reduction Algorithm The techniques mentioned in the preceding are the basic tools to build a complete set of low-order IPTFs. In the end these IPTFs are used together with measured HRTFs to implement an efficient moving sound system. 924 The procedure to obtain a complete set of IPTFs consists of three stages. • Each HRTF on the reference spherical surface grid is transformed into its smoothed minimum-phase version. • For each smoothed HRTF one IPTF is defined with respect to each of its four contiguous HRTFs. • Balanced-model reduction is applied to each previously defined IPTF in order to obtain a reduced-order model by which the IPTF can be well represented. In fact, this strategy implies redundancy since two IPTFs can be defined for each two contiguous HRTFs, depending on the choice of HRTFf and HRTFi in Eq. (3). Obviously only one low-order IPTF per HRTF pair needs to be saved. However, redundant functions allow the choice of the function that matches the response of the original HRTF ratio best. This flexibility also helps to deal with cases that exhibit numerical problems. 5 EXPERIMENTAL RESULTS In Section 3 a graphical comparison between the performances of the IPTF-based and the bilinear interpolation methods was provided. In the following, the results of IPTF-based interpolation with reduced-order IPTFs are compared with those obtained through IPTFs with no order reduction. Once more, several SFRSs and their corresponding differences are used for visualization. This time elevation angles from −40° to 90° are considered, thus covering the entire region with measured HRTFs. Figs. 11–14 refer to left-ear HRTFs; right-ear surfaces can be obtained by simply inverting the azimuth axis. Each row contains the SFRSs related to interpolation through nonreduced-order (a) and tenth-order (b) IPTFs, as well as their differences in dB (c). The scales and frequency bins were chosen according to the criteria explained in Section 3. One can observe again that corresponding SFRSs are quite similar, exhibiting small differences. Even though some discontinuities appear in these differences, they can be considered negligible when compared to the overall energy of each frequency bin, thus attesting to the reliability of low-order IPTF modeling. Like before, the worst case tends to occur around the azimuth angle of 90°. A complete set of tenth-order IIR IPTFs was obtained via the procedure described in Section 4.3, and all the examples of model-order reduction shown refer to this set. Each IPTF in this set corresponds to M ⳱ 21, satisfying the efficiency bounds for N ⳱ 33 from Section 3 (M ⱕ 52). Thus the proposed interpolation method is more computationally efficient than the bilinear method. Since the complexity of the HRTFs around the reference sphere varies with the location, one could consider using IPTFs of variable orders—lower order wherever possible and higher order wherever needed. As a consequence the processing time would vary with the position. In real-time systems the worst-case time should be considered. Based on the aforementioned results, it is feasible to use the IPTF-based interpolation in a simple moving sound J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September PAPERS IPTFs FOR 3D-SOUND GENERATION system restricted only to angular positioning of a single sound source. In this case the system receives successive samples of the monaural signal as well as the information related to the virtual source location on the reference sphere. Then based on considerations about timing and position accuracy, the reference points where the virtual source must be located are defined. When performing HRTF interpolation through the IPTF-based method, the following procedure is followed. For each point P in the source trajectory, the nearest measured HRTF is found as well as the two IPTFs that complete the HRTF triangle around P. Then the HRTF in P is interpolated according to Eq. (16) and used to process the audio signal. Fig. 11. Comparison of SFRSs—I. (a) Using nonreduced IPTFs. (b) Using tenth-order IPTFs. (c) Difference in dB. J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September 925 FREELAND ET AL. Finally it must be remembered that using the proposed technique, switching between successive interpolated HRTFs involves the switching of IIR filters whenever the source crosses the boundary between triangles. With no special treatment this introduces spurious transients that may cause audible artifacts in the synthesized signal. IIR filters involve the feedback of past outputs. When the filter coefficients change, the saved outputs related to the old filter make no more sense to the new one, thus creating the PAPERS transients. This problem has been addressed successfully in [26], which proposes the redundancy of old and new filter states as a way to minimize these effects. This strategy should be included in the system. Discussions on other time-varying issues, such as the rate of position updates, will be addressed in future work. A system with the characteristics described here was implemented. Performance evaluation through informal listening tests revealed that using the IPTF-based interpo- Fig. 12. Comparison of SFRSs—II. (a) Using nonreduced IPTFs. (b) Using tenth-order IPTFs. (c) Difference in dB. 926 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September PAPERS lation with the designed set of tenth-order IPTFs yields good results, comparable to those of a similar system based on bilinear interpolation. Considering N ⳱ 33 [9] and tenth-order IPTFs (M ⳱ 21) from Eq. (23), the bilinear method requires 144 mps, whereas from Eq. (24) the proposed method achieves 82 mps, saving up to 62 mps, or more than 40% of the computational effort required by the bilinear method. Notice, however, that comparing the tri- IPTFs FOR 3D-SOUND GENERATION angular interpolations performed directly on HRTFs and through IPTFs, the gain in computational complexity is 2 × (N − M), which means 2 × 12 ⳱ 24 mps for the values suggested here. Therefore the use of IPTFs alone saves up to 24 mps over 106 mps, or more than 22%.4 4 Some audio samples can be found at www.lps.ufrj.br/sound/ journals/JAES04. Fig. 13. Comparison of SFRSs—III. (a) Using nonreduced IPTFs. (b) Using tenth-order IPTFs. (c) Difference in dB. J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September 927 FREELAND ET AL. 6 CONCLUSION A recently proposed method for the interpolation of head-related transfer functions for binaural audio applications [12] was reviewed and extended. It combines a triangular interpolation scheme with an auxiliary function called interpositional transfer function, which can lead to efficient implementations. Extensive simulations compared IPTF-based and bilinear interpolations in terms of accuracy and computational complexity. A direct comparison between the bilinear interpolation and the IPTF method indicates that the latter is more efficient in terms of computational complexity. The IPTF method saves more than 40% of the operations required by the bilinear one, thanks to the use of a triangular geometry and also a reduced-order model to represent the IPTFs— which alone accounts for up to 22% of the savings. For the order-reduction task the balanced-model reduction and spectral smoothing methods were proposed. The IPTFbased interpolation method was used in a system that generates moving spatial sound, and the results were encouraging. Further investigations consist in proposing more effective ways to simplify the IPTFs and the realization of systematic listening tests to confirm the objective and informal subjective results obtained. 7 ACKNOWLEDGMENT The authors wish to thank the CNPq Brazilian agency for funding this research and B. Gardner and K. Martin for kindly providing the set of HRTFs used in this work. 8 REFERENCES [1] D. R. Begault, 3D Sound for Virtual Reality and Multimedia (Academic Press, Cambridge, MA, 1994). [2] J. M. Jot, O. Warusfel, and V. Larcher, “Digital Signal Processing Issues in the Context of Binaural and Transaural Stereophony,” presented at the 98th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 43, p. 396 (1995 May), preprint 3980. PAPERS [3] B. Gardner and K. Martin, “HRTF Measurements of a KEMAR Dummy-Head Microphone,” Tech. Rep. 280, MIT Media Lab., Cambridge, MA (1994 May). [4] C. I. Cheng and G. H. Wakefield, “Introduction to Head-Related Transfer Functions (HRTFs): Representations of HRTFs in Time, Frequency, and Space,” J. Audio Eng. Soc., vol. 49, pp. 231–249 (2001 Apr.). [5] J. Chen, B. D. Van Veen, and K. E. Hecox, “External Ear Transfer Function Modeling: A Beamforming Approach,” J. Acoust. Soc. Am., vol. 92, pp. 1933–1944, (1992 Oct.). [6] V. Larcher, O. Warusfel, J. M. Jot, and J. Guyard, “Study and Comparison of Efficient Methods for 3D Audio Spatialization Based on Linear Decomposition of HRTF Data,” presented at the 108th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 48, p. 351 (2000 Apr.), preprint 5097. [7] L. Savioja, J. Huopaniemi, T. Lokki, and R. Vään̈änen, “Creating Interactive Virtual Acoustic Environments,” J. Audio Eng. Soc., vol. 47, pp. 675–705 (1999 Sept.). [8] J. Mackenzie, J. Huopaniemi, V. Välimäki, and I. Kale, “Low-Order Modeling of Head-Related Transfer Functions Using Balanced Model Truncation,” IEEE Signal Process. Letts., vol. 4, pp. 39–41 (1997 Feb.). [9] J. Huopaniemi, N. Zacharov, and M. Karjalainen, “Objective and Subjective Evaluation of Head-Related Transfer Function Filter Design,” J. Audio Eng. Soc., vol. 47, pp. 218–239 (1999 Apr.). [10] C. Avendano, R. O. Duda, and V. R. Algazi, “Modeling the Contralateral HRTF,” in Proc. AES 16th Int. Conf. (Rovaniemi, Finland, 1999 Apr.), pp. 313–318. [11] G. Lorho, J. Huopaniemi, N. Zacharov, and D. Isherwood, “Efficient HRTF Synthesis Using an Interaural Transfer Function Model,” in Proc. EUSIPCO’2000 (Tampere, Finland, 2000 Sept.), pp. 2213–2216. [12] L. W. P. Biscainho, F. P. Freeland, and P. S. R. Diniz, “Using Inter-Positional Transfer Functions in 3DSound,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (Orlando, FL, 2002 May). [13] B. Beliczynsky, I. Kale, and G. D. Cain, “Approxi- Fig. 14. Comparison of SFRSs—IV. (a) Using nonreduced IPTFs. (b) Using tenth-order IPTFs. (c) Difference in dB. 928 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September PAPERS IPTFs FOR 3D-SOUND GENERATION mation of FIR by IIR Digital Filters: An Algorithm Based on Balanced Model Reduction,” IEEE Trans. Signal Process., vol. 40, pp. 532–542 (1992 Mar.). [14] P. D. Hatziantoniou and J. N. Mourjopoulos, “Generalized Fractional-Octave Smoothing of Audio and Acoustic Responses,” J. Audio Eng. Soc., vol. 48, pp. 259–280 (2000 Apr.). [15] A. Kulkarni, S. K. Isabelle, and H. S. Colburn, “On the Minimum-Phase Approximation of Head-Related Transfer Functions,” in IEEE Proc. WASPAA’1995 (New Paltz, NY, 1995 Oct.). [16] V. Pulkki, “Virtual Sound Source Positioning Using Vector Base Amplitude Panning,” J. Audio Eng. Soc., vol. 45, pp. 456–466 (1997 June). [17] T. I. Laakso, V. Välimäki, M. Karjalainen, and U. K. Laine, “Splitting the Unit Delay,” IEEE Signal Process. Mag., vol. 13, pp. 30–59 (1996 Jan.). [18] C. I. Cheng and G. H. Wakefield, “Spatial Frequency Response Surfaces (SFRS’s): An Alternative Visualization and Interpolation Technique for Head-Related Transfer Functions (HRTF’s),” in Proc. AES 16th Int. Conf. (Rovaniemi, Finland, 1999 Apr.), pp. 147–159. [19] K. Hartung, J. Braasch, and S. J. Sterbing, “Comparison of Different Methods for the Interpolation of Head-Related Transfer Functions,” in Proc. AES 16th Int. Conf. (Rovaniemi, Finland, 1999 Apr.), pp. 319–329. [20] G. Strang, Linear Algebra and Its Applications, 3rd ed. (Saunders, Orlando, FL, 1988). [21] K. Glover, “All Optimal Hankel-Norm Approximation of Linear Multivariable Systems and Their L⬁-Error Bounds,” Int. J. Cont., vol. 39, pp. 1115–1117 (1984). [22] J. G. Roederer, The Physics and Psychophysics of Music–An Introduction, 3rd ed. (Springer, New York, 1994). [23] D. Butler, The Musician’s Guide to Perception and Cognition (Schirmer, New York, 1992). [24] J. Huopaniemi and J. O. Smith, III, “Spectral and Time-Domain Preprocessing and the Choice of Modeling Error Criteria for Binaural Digital Filters,” in Proc. AES 16th Int. Conf. (Rovaniemi, Finland, 1999 Apr.), pp. 301–312. [25] A. V. Oppenheim and R. W. Shafer, Discrete-Time Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1989). [26] V. Valimäki and T. I. Laakso, “Suppression of Transients in Time-Varying Recursive Filters for Audio Signals,” in Proc. IEEE ICASSP’98, vol. 6 (Seattle, WA, 1998 May), pp. 3569–3572. THE AUTHORS F. P. Freeland L. W. P. Biscainho Fábio P. Freeland was born in Rio de Janeiro, Brazil. He received an Electrical Engineering degree from the Federal University of Rio de Janeiro (UFRJ) in 1999 and an M.Sc. degree from COPPE/UFRJ in 2001. Since 1998 Mr. Freeland has been with the Signal Processing Laboratory (LPS/COPPE-UFRJ), where he developed his undergraduate project on audio restoration and his M.Sc. dissertation on spatial audio. He is currently pursuing a Ph.D. degree from COPPE/UFRJ and teaches at UERJ. His research interests include digital audio processing, auditory scene synthesis, audio effects, among others. ● Luiz Wagner Pereira Biscainho was born in Rio de Janeiro, Brazil. He received an M.Sc. degree in 1990 and a D.Sc. degree in 2000, both from the Federal University of Rio de Janeiro (UFRJ), Brazil. From 1985 to 1993 Dr. Biscainho worked for the telecommunications industry. Since 1993 he has been with the Department of Electronics and Computer Engineering of J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September P. S. R. Diniz the Polytechnic (DEL/POLI) at UFRJ, where he is currently an associate professor of telecommunications. His main research interests include digital signal processing of music and audio. He is a member of the AES and IEEE. ● Paulo S. R. Diniz was born in Niterói, Brazil. He received an Electronics Engineering degree (cum laude) from the Federal University of Rio de Janeiro (UFRJ), Brazil, in 1978, an M.Sc. degree from the COPPE/UFRJ in 1981, and a Ph.D. degree from the Concordia University, Montreal, Que., Canada, in 1984, all in electrical engineering. Since 1979 Dr. Diniz has been with the Department of Electronic Engineering (the undergraduate dept.) at UFRJ. He has also been with the Program of Electrical Engineering (the graduate studies dept.), COPPE/UFRJ, since 1984, where he is a professor. He has served as undergraduate course coordinator and as chair of the Graduate Department. He is one of the three senior researchers and coordinators of the National Excellence Center in Signal Processing. 929 FREELAND ET AL. From 1991 to 1992 Dr. Diniz was a visiting research associate in the Department of Electrical and Computer Engineering, University of Victoria, B.C., Canada. He also holds a docent position at the Helsinki University of Technology, Finland. From 2002 January to 2002 June he was a Melchor chair professor in the Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN. His teaching and research interests include analog and digital signal processing, adaptive signal processing, digital communications, wireless communications, multirate systems, stochastic processes, and electronic circuits. He has published several refereed papers in some of these areas. Dr. Diniz was the technical program chair of the 1995 MWSCAS held in Rio de Janeiro, Brazil. He has been on the technical committee of several international conferences including ISCAS, ICECS, EUSIPCO, and MWSCAS. He has served as vice president for region 9 of the IEEE Circuits and Systems Society and as chair of the IEEE DSP technical committee. He has also served as associ- 930 PAPERS ate editor for the following Journals: IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing from 1996 to 1999, IEEE Transactions on Signal Processing from 1999 to 2002, and Circuits, Systems and Signal Processing from 1998 to 2002. He was a distinguished lecturer of the IEEE Circuits and Systems Society from 2000 to 2001. Currently he is serving as a distinguished lecturer of the IEEE Signal Processing Society and has received the 2004 Education Award from the IEEE Circuits and Systems Society. He has also received the Rio de Janeiro State Scientist award from the governor of Rio de Janeiro. Dr. Diniz wrote the books ADAPTIVE FILTERING: Algorithms and Practical Implementation, 2nd ed. (Kluwer Academic Publishers, 2002) and DIGITAL SIGNAL PROCESSING: System Analysis and Design, with E. A. B. da Silva and S. L. Netto (Cambridge University Press, Cambridge, UK, 2002). He is a fellow of the IEEE (for fundamental contributions to the design and implementation of fixed and adaptive filters and for electrical engineering education). J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September