Interpositional Transfer Function for 3D-Sound Generation

advertisement
PAPERS
Interpositional Transfer Function for
3D-Sound Generation*
F. P. FREELAND, L. W. P. BISCAINHO, AES Member, AND P. S. R. DINIZ
Universidade Federal do Rio de Janeiro, LPS–PEE/COPPE & DEL/POLI, 21945-970, Rio de Janeiro, RJ, Brazil
The interpolation of head-related transfer functions (HRTFs) for 3D-sound generation
through headphones is addressed. HRTFs, which represent the paths between sound sources
and the ears, are usually measured for a finite set of source locations around the listener. Any
other virtual position requires interpolation procedures on these measurements. In this work
a definition of the interpositional transfer function (IPTF) and an IPTF-based method for
HRTF interpolation, recently proposed by the authors, are reviewed in detail. The formulation
of the interpolation weights is generalized to cope with azimuth measurement steps which
change with elevation. It is shown how to obtain a set of reduced-order IPTFs using tools such
as balanced model reduction and spectral smoothing. Detailed comparisons between the
accuracy and the efficiency of the IPTF-based and the bilinear interpolation methods (the
latter using the HRTFs directly) are provided. A practical set of IPTFs is built and applied to
a computationally efficient interpolation scheme, yielding results perceptually similar to those
attainable by the bilinear method. Therefore the proposed method is promising for real-time
generation of spatial sound.
0 INTRODUCTION
The interest in so-called 3D-sound techniques [1] has
increased recently. Besides cinema and home theater, they
can be found in applications ranging from a simple video
game to sophisticated virtual reality simulators. Also,
there is an increasing level of interaction between users
and the virtual environment in which they are immersed.
Contrasting with cases when the location and/or movement of sound sources is predefined (as in a moving picture exhibition), which allow off-line digital signal processing, interactive systems require real-time processing
[2] and may impose severe restrictions on the computational complexity.
The present work addresses the problem of spatial
sound generation employing reduced computational complexity when the 3D sound for a single virtual source is
supplied binaurally through headphones. In this case the
direction of arrival of the sound can be controlled by filtering the original monaural signal through a proper set of
previously measured head-related transfer functions
(HRTFs) [3], [4]. These functions model the paths from
the virtual sound source to the listener’s ears.
In practical systems only a finite number of HRTFs is
available. They are usually obtained from their corre*Manuscript received 2003 October 8; revised 2004 June 29.
This work was presented in part at the AES 22nd International
Conference in Espoo, Finland, 2002 June 15–17.
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
sponding head-related impulse responses (HRIRs) measured under controlled conditions on a spherical surface of
constant radius only for a predefined set of elevation
angles ␾ and azimuth angles ␪ with respect to the ears of
a target head at its center. In order to place the sound
source at an arbitrary position on this surface, it is necessary to estimate the HRTF associated with the desired
source location from the set of measured ones via an interpolation scheme. In particular, a realistic rendering of
moving sound requires a large set of HRTFs so that no
noticeable discontinuity occurs along its path. As a consequence, in real-time systems the computational cost to
perform HRTF interpolation must be kept to a minimum
without degrading the final perceptual result. In this sense
the tradeoff between accuracy and system complexity is
critical when designing 3D audio systems. The target here
is to render faithful location of a virtual sound source
moving rapidly under user control. One could think of
applications such as virtual reality simulators, video
games, or even films in which the perspective can be
controlled by the user.
When dealing with a single source,1 a straightforward
way to perform HRTF interpolation is the bilinear ap1
Some methods, such as [5], [6], provide alternatives to the
HRTF-based model. They split the overall model into two parts,
one related to the spectral characteristics, the other to the spatial
ones. This interpretation can be more efficient than the direct use
of HRTFs when dealing with a multisource case.
915
FREELAND ET AL.
PAPERS
proach [7], [1], which consists of computing the HRTF
associated with a given point on the reference sphere as a
weighted mean of the measured HRTFs associated with
the four nearest points that circumscribe the desired point.
Consider that a set of HRIRs was measured for fixedangle steps ␪grid and ␾grid relative to azimuth and elevation, respectively. An estimate of the HRIR at an arbitrary
coordinate (␪, ␾) on the spherical surface, as shown in Fig.
1, can be obtained by
ĥ共k兲 = 共1 − c␪兲共1 − c␾兲ha共k兲 + c␪共1 − c␾兲hb共k兲
+ c␪c␾hc共k兲 + 共1 − c␪兲c␾hd共k兲
(1)
where ha(k), hb(k), hc(k), and hd(k) are the HRIRs associated with the four points nearest to the desired position.
The parameters c␪ and c␾ are computed as
c␪ =
C␪ ␪ mod ␪grid
=
,
␪grid
␪grid
c␾ =
C␾ ␾ mod ␾grid
=
␾grid
␾grid
(2)
where C␪ and C␾ are the relative angular positions defined
in Fig. 1.
Low-cost interpolation methods have been the subject
of several works [8]–[11], which usually focus on the
reduction of HRTF orders. Surprisingly, the way the
HRTF interpolation is carried out is not frequently addressed. In a recent work [12] the authors propose an
alternative interpolation procedure, similar in reasoning to
the bilinear method, but based on a new auxiliary transfer
function, called interpositional transfer functions (IPTFs),
which leads to lower complexity than that of the bilinear
interpolation without perceptible losses in the modeling
accuracy.
Since the IPTFs can be represented by models of reduced order without implying substantial deviations in the
overall model, the proposed method exhibits reduced computational complexity.
For the sake of completeness, the present work starts by
reviewing the definition and use of IPTFs for the interpolation of HRTFs [12]. The computation of the interpolation weights is generalized for the case in which the azimuth measurement steps are elevation dependent. Then a
practical procedure to obtain a complete set of low-order
IPTFs is described. An interpolation scheme based on loworder IPTFs is carefully assessed against the bilinear
method.
Fig. 1. Bilinear interpolation—graphical interpretation.
916
In the next section the IPTF is defined. In Section 2 the
IPTF-based interpolation method is presented in detail.
Comparisons between the performances of the IPTF-based
and the bilinear interpolations are presented in Section 3.
In Section 4 the simplification of IPTFs is achieved
through balanced model reduction [13] with the aid of a
spectral smoothing tool [14]. Section 5 describes and discusses the experimental results. Conclusions are drawn in
Section 6.
1 INTERPOSITIONAL TRANSFER FUNCTION (IPTF)
A practical set of HRTFs is composed of pairs of functions, each pair corresponding to one predefined location
of the sound source. As seen in Fig. 2(a) the sound paths
from the sound source to the listener’s nearer and farther
ears are referred to as ipsi- and contralateral HRTFs, respectively. Fig. 2(b) shows the alternative representation
for the HRTF pair proposed in [11]. In that work the
authors show that it is possible to represent the pair of ipsiand contralateral HRTFs associated with a given position
as depicted in Fig. 2, where the ratio between contra- and
ipsilateral HRTFs is called interaural transfer function
(IATF)2. The ipsilateral HRTF is chosen to be modeled
directly since it conveys no information about sound diffraction on the head, thus allowing lower order modeling
than the contralateral function [11]. In addition, the IATF
is a ratio between two similar transfer functions, which
allows for low-order modeling. Therefore the IATF-based
configuration yields a more efficient realization of the
contralateral HRTF.
Intuitively one can expect that HRTFs from near positions to the same ear should be very similar. This observation inspires the idea of defining an IPTF, which allows
the calculation of the HRTF associated with a certain position as a cascade of the HRTF from a contiguous position and an incremental error function of reduced order.
The IPTF can be defined as
IPTFi,f =
HRTFf
HRTFi
(3)
2
In the original work the interaural transfer function is called
ITF. Here the notation IATF is adopted to avoid ambiguity.
Fig. 2. HRTF modeling. (a) Direct. (b) IATF-based.
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
PAPERS
where HRTFi and HRTFf are the HRTFs associated with
the initial and final points, respectively, as shown in Fig.
3. Note that all the vectors were drawn opposite to the
direction of true sound paths to make the generation of
HRTFf starting from HRTFi clearer.
In order to guarantee the stability of the IPTFs it suffices that HRTFf and HRTFi have minimum phase. This
characteristic does not seem to affect the perception of the
direction of arrival of the generated sound [15]. On the
other hand, the difference between the time delays of left
and right HRTFs—called interaural time difference (ITD)
[9]—is crucial for the perceptual determination of sound
direction. As a consequence, the inherent delays of the
original HRTFs must be preserved, since they represent
the propagation time from the sound source to the ears. In
a practical implementation these initial delays could be
extracted from the impulse responses and saved, whereas
the remaining responses could be transformed into their
minimum-phase versions. Then the saved delays would be
associated back to the minimum-phase transfer function.
An alternative approach consists of deriving the minimum-phase versions of HRTFf and HRTFi, and then
evaluating the relative delays of the original and minimum-phase versions based only on the linear component
of their excess of phase [2].
The availability of an efficient approximation for the
IPTFs is assumed here. This topic will be specifically
discussed in Section 4.
2 IPTF-BASED INTERPOLATION OF HRTFs
The IPTF-based interpolation follows the same basic
principle as the bilinear method. However, in the former
method just three HRTFs are used for interpolating another one. In addition, two of them can be indirectly approximated using reduced-order IPTFs, which further contributes to reduce the computational cost of the synthesis
procedure. Due to the three-function implementation, the
spherical surface must be divided into triangular regions.
Once the triangular regions are chosen, the HRTF for a
given point P will be calculated using the HRTFs related
to the three vertices of that region. Fig. 4 depicts two
different situations in the same quadrangular region of the
reference sphere grid for which measured HRTFs are
available for points A, B, C, and D, whereas the HRTFs
associated with point P are to be estimated. From Fig. 4(a)
one can see that, given the arbitrary position P to be described, the vertices of the P region are denoted by A, B,
Fig. 3. IPTF-computed HRTFf—graphical interpretation.
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
IPTFs FOR 3D-SOUND GENERATION
and C, with A and B having the same elevation. The
coordinates of the nearest point (A in this example) will be
associated with the original HRTF, whereas the coordinates of the other two points will be related to the IPTFapproximated HRTFs. Now the HRTF associated with P
can be estimated through a weighted mean of the HRTFs
associated with A, B, and C. It is important to note that all
the points inside the shaded region are interpolated using
the HRTFs related to these three points, of which two are
approximations and the nearest one is an original HRTF.
Another example is shown in Fig. 4(b), where another
point P is inside a different ABC region. Notice that the
vertices were relabeled to allow referring to the vertices
with the same elevation in the shaded region as A and B,
like before. In this new figure the nearest vertex is C,
which will be associated with an original HRTF.
According to the assignment of these vertices, an HRTF
related to P can be described by
HRTFP = wAHRTFA + wBHRTFB + wCHRTFC
(4)
where
wC =
⌬␾
⌬␾grid
(5)
wB =
⌬␪
⌬␪grid
(6)
wA + wB + wC = 1
(7)
and the required angular distances are shown in Fig. 5 and
defined by
⌬␾ = ␾P − ␾A
(8)
⌬␪ = ␪P − ␪X
(9)
⌬␪A = ␪P − ␪A
(10)
⌬␪AC = ␪C − ␪A
(11)
⌬␪grid = ␪B − ␪A
(12)
⌬␾grid = ␾C − ␾A.
(13)
Since
⌬␪A − ⌬␪
⌬␾
=
⌬␾grid
⌬␪AC
(14)
Fig. 4. Details of reference spherical surface. A, B, C, and D have
known HRTFs; P is the position requiring an interpolated approximation to the HRTF.
917
FREELAND ET AL.
PAPERS
one can get
⌬␪ = ⌬␪A −
⌬␾
⌬␪ .
⌬␾grid AC
(15)
Denoting by I the vertex nearest to P and by F␪ and F␾
the other two vertices that form the triangular region
around P, according to Fig. 5, the IPTF-based interpolation
can be written as
HRTFP = ␣HRTFI + ␤HRTFF␪ + ␥HRTFF␾
= HRTFI共␣ + ␤IPTFI,F␪ + ␥IPTFI,F␾兲
(16)
where IPTFI,F␪ and IPTFI,F␾ are defined by Eq. (3). The
coefficients ␣, ␤, and ␥ are chosen according to the assignment of the labels I, F␪, and F␾. As an example, in Fig.
5 A, B, and C are equivalent to I, F␪, and F␾, respectively.
Therefore Eq. (16) becomes
HRTFP = wAHRTFA + wBHRTFB + wCHRTFC
= HRTFA共wA + wBIPTFA,B + wCIPTFA,C兲.
(17)
It is worth mentioning that this triangular configuration
is closely related to the formulation of the vector-based
amplitude panning method [16] for virtual source positioning, in the specific case of three loudspeakers. This issue
will not be discussed here.
The block diagram of Fig. 6 illustrates the interpolation
procedure for both the left and right channels (identified
by superscripts l and r, respectively). In this diagram the
HRTF and IPTF blocks represent the transfer functions
without their inherent delays. In a simpler approximation
l,r
procedure, the delay of HRTFl,r
I , DI , is merged with those
l,r
l,r
of the IPTFs, dI,F␪, and dI,F␾, so that the inherent delay and
the minimum-phase version of the desired HRTF can be
interpolated separately [3]. The resulting delay
l,r
l,r
⌬l,r = ␣DIl,r + ␤共DIl,r + dI,F
兲 + ␥共DIl,r + dI,F
兲
␪
␾
(18)
l,r
l,r
+ ␥dI,F
= 共␣ + ␤ + ␥兲DIl,r + ␤dI,F
␪
␾
(19)
l,r
l,r
+ ␥dI,F
= DIl,r + ␤dI,F
␪
␾
(20)
can be implemented using fractional delay filters [17].
It is worth mentioning that the use of low-order IPTFs
is the core of the simplification attainable by the IPTFbased interpolation. The gain in computational efficiency
that results from applying a triangular configuration directly to the measured HRTFs is trivial and, with no further consideration, amounts roughly to 25% compared to
the quadrangular geometry associated with the bilinear
method. In fact, three-point linear combination methods
were attempted earlier [2]. In addition to the triangular
formulation, the possibility of designing IPTFs using loworder models is what makes the proposed method appealing with regard to computational complexity. This issue is
discussed in greater detail in the following sections.
3 IPTF-BASED VERSUS BILINEAR METHOD
In this section the performance of the IPTF-based interpolation method is compared to that of the bilinear
method by evaluating their frequency response for a defined range of azimuth and elevation angles.
Initially only differences in geometry and in weight calculations are considered, that is, the IPTF-based interpolation takes the IPTFs in Eq. (3) as the direct ratio between
the corresponding HRTFs without any simplification,
which can be considered as a triangular interpolation applied to the HRTFs, in the form
Fig. 5. Angular distances used to obtain parameters wA, wB,
and wC.
HRTFP ⳱ ␣HRTFI + ␤HRTFF␪ + ␥HRTFF␾.
(21)
Fig. 6. Block diagram of IPTF-based interpolation procedure.
918
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
PAPERS
IPTFs FOR 3D-SOUND GENERATION
The underlying idea behind the tests consists in confronting the spatial frequency response surfaces (SFRSs)
[18], [19] of the results from both interpolation methods.
Each SFRS shows, for a given frequency, the dependence
of the magnitude response on the changes in elevation and
azimuth.
The equations of the bilinear interpolation are originally
defined for equally spaced grids in both ␾ and ␪. In order
to provide a fair comparison between the two methods, the
range of evaluation of the bilinear interpolation was restricted to −20° < ␾ < 20°, where the available set of
HRTFs is spaced uniformly (see Table 1).
All simulations in this work used a set of 128-samplelong measured HRIRs sampled at 44.1 kHz, originated
from [3]. They are available for elevation steps of 10° and
variable azimuth steps, as shown in Table 1.
Figs. 7–10 provide a visual comparison of performances
of the IPTF and bilinear methods. The first and second
columns show the SRFSs related to the bilinear (|HB(␻, ␪,
␾)|) and the IPTF-based (|HIPTF(␻, ␪, ␾)|) interpolations,
respectively. Different magnitude scales were adopted according to the dynamic range at each frequency bin. Only
left-ear HRTFs are considered, since the opposite functions can be inferred by symmetry. The frequency bins are
chosen according to [18].
To improve readability, the third column illustrates the
difference in dB between the two SFRSs in the same row,
given by
␰共␻, ␪, ␾兲 = 20 log10
冉
冊
|HIPTF共␻, ␪, ␾兲|
.
|HB共␻, ␪, ␾兲|
(22)
This time a lighter region means small differences (near
zero) between correspondent SFRSs. The same range was
chosen for all figures in this column in order to make the
comparison between different frequency bins easier.
In spite of the occurrence of more noticeable differences
around the azimuth angle of 90° (the region more strongly
affected by head diffraction), the equivalence between the
two methods is confirmed by the small differences observed. It can also be noticed that the worst cases are
related to low-energy regions of the SFRSs.
Of course, the mere use of the IPTFs, computed as the
direct ratio between the corresponding HRTFs without any
simplification, does not yield an improvement in computational complexity. Since the purpose of this work is to
propose efficient spatial sound generation, the next step is
to investigate whether the IPTF model can be reduced, as
expected.
Table 1. Azimuth measurement steps.
Elevation ␾
(deg)
Azimuth resolution ␪grid
(deg)
−20 to 20
−30 and 30
−40 and 40
50
60
70
80
5
6
45/7
8
10
15
30
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
Assuming that each HRTF directly used in the interpolation procedure requires N multiplications per sample
(mps), the complexity of the bilinear interpolation is
Cbilinear ⳱ 4N + 12 mps.
(23)
On the other hand, if each approximated IPTF requires M
mps, the IPTF-based interpolation scheme requires
CIPTF-based ⳱ 2M + N + 7 mps.
(24)
Therefore the latter exhibits lower computational complexity if
M⬍
3N + 5
.
2
(25)
For the HRTFs used in the simulations [3] (which have N
⳱ 128), that means M ⱕ 194. On the other hand, it is also
possible to take advantage of efficient approximations of
HRTFs [9], which allow for model-order reduction. For
example, assuming N ⳱ 33, which is the lowest order
attained in [9], making M < 52 would suffice to turn the
IPTF-based interpolation less expensive computationally.
It should be noticed that since the IPTF-based interpolation procedure utilizes a triangular grid, the number of
regions visited by the virtual sound source is increased. As
a result, if any special procedure is necessary at the edge
crossings3, it can result in additional computational load.
Of course, at high sampling rates (such as 44.1 kHz) and for
not too fast movements, this overhead becomes irrelevant.
The remaining issue is how much can the IPTFs be
simplified so that the approximation errors are acceptable?
The next section describes some tools to approximate a
complete set of IPTFs.
4 ORDER REDUCTION OF IPTFs
4.1 Balanced-Model Truncation
In [8] the authors apply the balanced-model truncation
or reduction (BMR) to decrease the order of HRTFs. Since
the measured HRTFs are in FIR form, the specialized
BMR algorithm that approximates FIR filters by low-order
IIR filters described in [13] is used. In the present work the
IPTF is the ratio of two HRTFs, and thus it is originally an
IIR system. Therefore the simpler formulation of [13] cannot be used. Nevertheless, the balanced-model reduction
method, in its general formulation, is applicable to the IIR
case and is found to be effective in reducing the order of
the IPTFs. In the following, a review of the balancedmodel reduction is presented.
The state-space representation of a system in which the
grammians of its observability and controllability matrices
(see the discussion leading to Eqs. (29) and (30) below)
are both diagonal and equal is called a balanced model.
The main advantage of this type of model is the possibility
of ordering the system states based on their significance to
the representation of the system characteristics. This is a
very important feature, which allows model simplification
by direct truncation, that is, by discarding its “less important” states.
3
More on this topic in Section 5.
919
FREELAND ET AL.
PAPERS
Consider the state-space representation of an IPTF,
x共k + 1兲 = Ax共k兲 + bu共k兲
y共k兲 = cTx共k兲 + du共k兲
(26)
where A, b, c, and d are the transition matrix, input vector,
output vector, and feedforward coefficient, respectively.
x(k) is the state vector at instant k, and u(k) and y(k) are the
input and output signals at instant k, respectively.
Fig. 7. Comparison of SFRSs.—I. (a) Bilinear. (b) IPTF-based. (c) Difference in dB.
920
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
PAPERS
IPTFs FOR 3D-SOUND GENERATION
Define
i
Wo ⳱ [c ATc . . . (AT) c . . . ]T
and
Wc ⳱ [b Ab . . . Aib . . . ]
and
(27)
(28)
for i ⳱ 1, 2, 3, . . . , as the observability and controllability
matrices of the system, respectively. Their respective
grammian matrices are given by
(29)
P ⳱ WcWTc
Q ⳱ WTo Wo.
(30)
If rank [A] ⳱ J, the rank-J Hankel matrix
H = WoWc =
冤
h1 h2 · · ·
h2 h3
··
·
··
·
冥
(31)
Fig. 8. Comparison of SFRSs—II. (a) Bilinear. (b) IPTF-based. (c) Difference in dB.
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
921
FREELAND ET AL.
PAPERS
can be constructed, where
hi ⳱ c A
T
i−1
b,
i ⳱ 1, 2, 3, . . .
(32)
are the samples of the system impulse response.
Now H can be decomposed as
H ⳱ V1⌫V2
(33)
where ⌫ is the matrix containing the singular values
␴1, . . . , ␴J of H along the first J positions of its main
diagonal, by convention arranged in descending order of
their magnitudes [20]. Let ∑ be the upper left J × J submatrix of ⌫.
Since the square of each singular value of H, ␴2i , is
equal to a corresponding eigenvalue of the product PQ
[21], [13], denoted by ␭i, for i ⳱ 1, 2, 3, . . . , then
PQ ⳱ K∑2K−1.
(34)
Fig. 9. Comparison of SFRSs—III. (a) Bilinear. (b) IPTF-based. (c) Difference in dB.
922
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
PAPERS
IPTFs FOR 3D-SOUND GENERATION
Hence one can show that the grammians Pb and Qb of the
desired balanced system must be such that
Pb = Qb = 兺 = diag共公␭1, . . . , 公␭J兲
= diag共␴1, . . . , ␴J兲.
(35)
Then the transformation that turns the original system (A,
b, c, and d) into its balanced form (Ab, bb, cb, and d)
through
Ab ⳱ T−1 AT,
bb ⳱ T−1 b,
cb ⳱ TT c
(36)
is
T ⳱ S−1 U∑1/2
(37)
where S results from decomposing the matrix Q into the
form
Q ⳱ STS.
(38)
Matrices U and ∑ can be computed from the following
decomposition:
SPST ⳱ U∑2UT
(39)
where
UUT ⳱ I.
(40)
Note that Eq. (39) holds since SPST ⳱ SPQS−1 is symmetric and has the same eigenvalues as the product PQ.
Consider the case where, according to some error criterion, only the j < J larger singular values are judged to be
sufficient to describe the IPTF. In this case matrix ∑ can
be written in the form
兺=
冋 册
兺1
0
0
兺2
(41)
Accordingly the balanced-model matrices can be decomposed into
Ab =
兺1 = diag共␴1, . . . , ␴j兲
(42)
兺2 = diag共␴j+1, . . . , ␴J兲.
(43)
A1,1 A1,2
A2,1 A2,2
册
,
bb =
冋册
b1
b2
,
cb =
冋册
c1
c2
(44)
such that (A1,1, b1, c1, and d) are the state-space matrices
of the subsystem of order j that can approximate the original IPTF accurately.
The last step, if desired, consists of converting the statespace representation back to the transfer function form.
4.2 Spectral Smoothing
The design of IPTFs through the balanced-model reduction method has some peculiarities. Since the target frequency response results from the ratio of two HRTFs, it
may include a number of sharp spectral peaks and valleys
at some locations. As a consequence, numerical problems
may arise during the simplification procedure.
On the other hand, psychoacoustics shows that frequency discrimination in human hearing varies with frequency in such a way that it becomes coarser as the frequency increases. To some extent the logarithmic western
musical scale is a consequence of this phenomenon [22],
[23]. Furthermore, measured HRIRs tend to be less reliable at higher frequencies where, due to the sound attenuation, measurements have lower signal-to-noise ratios.
Considering the aforementioned facts, one can think of
applying a spectral smoother to the HRTF response in
such a way that the degree of smoothness increases toward
the higher frequencies of IPTFs. For instance, the
smoothed spectral response at any frequency f can be computed as the average spectral response within the range
delimited by the frequencies
f1 =
where 0 are null matrices and
冋
f
公K
,
f2 = f 公K
(45)
which are geometrically equidistant from f by a factor of
√K. In the DFT domain with the dc frequency indexed as
Fig. 10. Comparison of SFRSs—IV. (a) Bilinear. (b) IPTF-based. (c) Difference in dB.
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
923
FREELAND ET AL.
PAPERS
n ⳱ 0, if f is the frequency associated with the spectral bin
n, the corresponding indices of the approximated limit
frequencies can be obtained by
n1 =
公 n
n2 = n 公K
,
K
(46)
respectively, denoting ⭈ as the lowest integer greater than
or equal to (⭈) and ⭈ as the greatest integer lower than or
equal to (⭈). The value of K ⳱ 2a gives a window width of
a octaves. For example, using K ⳱ 21/3 one defines a
one-third-octave range [9], [24]. However, a fixed value
for K may not smooth the high-frequency range of the
HRTFs sufficiently without degrading the low-frequency
range. Then a linearly varying value K(n) given by
K共n兲 = 1 − n
1−K
LⲐ2 − 1
(47)
was defined, where L is the FFT length. A constant K ⳱
21/6 was chosen, resulting in a one-sixth-octave window
width at n ⳱ L/2 − 1.
Two smoothing methods are considered, namely, the
square magnitude [9] and the complex frequency response
[14] methods, where in each case an average is computed
on a sliding window. In the former, smoothing is attained
by computing the arithmetic average of the squared magnitude response in the window range and taking the square
root of the result, that is,
|Hs共n兲| =
冑
1
n2 − n1 + 1
n2
兺 |H 共l兲|
2
o
(48)
l=n1
where Ho(n) and Hs(n) are, respectively, the original and
the smoothed frequency responses. The latter method
takes into account the phase information by computing the
mean over the complex frequency response, which can be
expressed as
Hs共n兲 =
1
n2 − n1 + 1
n2
兺H 共l兲.
o
(49)
l=n1
These two methods are based on the arithmetic mean,
which treats peak and valley occurrences differently. An
alternative solution is to replace the arithmetic mean by
the geometric mean in Eq. (49). Thus Eq. (49) becomes
Hs共n兲 =
冑兿
n2−n1+1
n2
Ho共l兲.
(50)
l=n1
In each case n1 and n2 are expressed by Eq. (46).
When using spectral smoothing in the design of loworder IPTFs it was verified that the smoother design given
by Eq. (50) is more suitable. Smoothing was applied to the
spectrum of each HRTF during the minimum-phase transformation algorithm [4], [24] prior to log-magnitude operation (see [25, p. 784]).
4.3 Order Reduction Algorithm
The techniques mentioned in the preceding are the basic
tools to build a complete set of low-order IPTFs. In the end
these IPTFs are used together with measured HRTFs to
implement an efficient moving sound system.
924
The procedure to obtain a complete set of IPTFs consists of three stages.
• Each HRTF on the reference spherical surface grid is
transformed into its smoothed minimum-phase version.
• For each smoothed HRTF one IPTF is defined with
respect to each of its four contiguous HRTFs.
• Balanced-model reduction is applied to each previously
defined IPTF in order to obtain a reduced-order model
by which the IPTF can be well represented.
In fact, this strategy implies redundancy since two IPTFs
can be defined for each two contiguous HRTFs, depending
on the choice of HRTFf and HRTFi in Eq. (3). Obviously
only one low-order IPTF per HRTF pair needs to be saved.
However, redundant functions allow the choice of the
function that matches the response of the original HRTF
ratio best. This flexibility also helps to deal with cases that
exhibit numerical problems.
5 EXPERIMENTAL RESULTS
In Section 3 a graphical comparison between the performances of the IPTF-based and the bilinear interpolation
methods was provided. In the following, the results of
IPTF-based interpolation with reduced-order IPTFs are
compared with those obtained through IPTFs with no order reduction. Once more, several SFRSs and their corresponding differences are used for visualization. This
time elevation angles from −40° to 90° are considered,
thus covering the entire region with measured HRTFs.
Figs. 11–14 refer to left-ear HRTFs; right-ear surfaces can be obtained by simply inverting the azimuth
axis. Each row contains the SFRSs related to interpolation
through nonreduced-order (a) and tenth-order (b) IPTFs,
as well as their differences in dB (c). The scales and frequency bins were chosen according to the criteria explained in Section 3.
One can observe again that corresponding SFRSs are
quite similar, exhibiting small differences. Even though
some discontinuities appear in these differences, they can
be considered negligible when compared to the overall
energy of each frequency bin, thus attesting to the reliability of low-order IPTF modeling. Like before, the worst
case tends to occur around the azimuth angle of 90°.
A complete set of tenth-order IIR IPTFs was obtained
via the procedure described in Section 4.3, and all the
examples of model-order reduction shown refer to this set.
Each IPTF in this set corresponds to M ⳱ 21, satisfying
the efficiency bounds for N ⳱ 33 from Section 3 (M ⱕ
52). Thus the proposed interpolation method is more computationally efficient than the bilinear method.
Since the complexity of the HRTFs around the reference sphere varies with the location, one could consider
using IPTFs of variable orders—lower order wherever
possible and higher order wherever needed. As a consequence the processing time would vary with the position. In real-time systems the worst-case time should be
considered.
Based on the aforementioned results, it is feasible to use
the IPTF-based interpolation in a simple moving sound
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
PAPERS
IPTFs FOR 3D-SOUND GENERATION
system restricted only to angular positioning of a single
sound source. In this case the system receives successive
samples of the monaural signal as well as the information
related to the virtual source location on the reference
sphere. Then based on considerations about timing and
position accuracy, the reference points where the virtual
source must be located are defined.
When performing HRTF interpolation through the
IPTF-based method, the following procedure is followed.
For each point P in the source trajectory, the nearest measured HRTF is found as well as the two IPTFs that complete the HRTF triangle around P. Then the HRTF in P is
interpolated according to Eq. (16) and used to process the
audio signal.
Fig. 11. Comparison of SFRSs—I. (a) Using nonreduced IPTFs. (b) Using tenth-order IPTFs. (c) Difference in dB.
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
925
FREELAND ET AL.
Finally it must be remembered that using the proposed
technique, switching between successive interpolated
HRTFs involves the switching of IIR filters whenever the
source crosses the boundary between triangles. With no
special treatment this introduces spurious transients that
may cause audible artifacts in the synthesized signal. IIR
filters involve the feedback of past outputs. When the filter
coefficients change, the saved outputs related to the old
filter make no more sense to the new one, thus creating the
PAPERS
transients. This problem has been addressed successfully
in [26], which proposes the redundancy of old and new
filter states as a way to minimize these effects. This strategy should be included in the system.
Discussions on other time-varying issues, such as the
rate of position updates, will be addressed in future work.
A system with the characteristics described here was
implemented. Performance evaluation through informal
listening tests revealed that using the IPTF-based interpo-
Fig. 12. Comparison of SFRSs—II. (a) Using nonreduced IPTFs. (b) Using tenth-order IPTFs. (c) Difference in dB.
926
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
PAPERS
lation with the designed set of tenth-order IPTFs yields
good results, comparable to those of a similar system
based on bilinear interpolation. Considering N ⳱ 33 [9]
and tenth-order IPTFs (M ⳱ 21) from Eq. (23), the bilinear method requires 144 mps, whereas from Eq. (24) the
proposed method achieves 82 mps, saving up to 62 mps, or
more than 40% of the computational effort required by the
bilinear method. Notice, however, that comparing the tri-
IPTFs FOR 3D-SOUND GENERATION
angular interpolations performed directly on HRTFs and
through IPTFs, the gain in computational complexity is 2
× (N − M), which means 2 × 12 ⳱ 24 mps for the values
suggested here. Therefore the use of IPTFs alone saves up
to 24 mps over 106 mps, or more than 22%.4
4
Some audio samples can be found at www.lps.ufrj.br/sound/
journals/JAES04.
Fig. 13. Comparison of SFRSs—III. (a) Using nonreduced IPTFs. (b) Using tenth-order IPTFs. (c) Difference in dB.
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
927
FREELAND ET AL.
6 CONCLUSION
A recently proposed method for the interpolation of
head-related transfer functions for binaural audio applications [12] was reviewed and extended. It combines a triangular interpolation scheme with an auxiliary function
called interpositional transfer function, which can lead to
efficient implementations. Extensive simulations compared IPTF-based and bilinear interpolations in terms of
accuracy and computational complexity.
A direct comparison between the bilinear interpolation
and the IPTF method indicates that the latter is more efficient in terms of computational complexity. The IPTF
method saves more than 40% of the operations required by
the bilinear one, thanks to the use of a triangular geometry
and also a reduced-order model to represent the IPTFs—
which alone accounts for up to 22% of the savings. For the
order-reduction task the balanced-model reduction and
spectral smoothing methods were proposed. The IPTFbased interpolation method was used in a system that generates moving spatial sound, and the results were encouraging. Further investigations consist in proposing more
effective ways to simplify the IPTFs and the realization of
systematic listening tests to confirm the objective and informal subjective results obtained.
7 ACKNOWLEDGMENT
The authors wish to thank the CNPq Brazilian agency
for funding this research and B. Gardner and K. Martin for
kindly providing the set of HRTFs used in this work.
8 REFERENCES
[1] D. R. Begault, 3D Sound for Virtual Reality and
Multimedia (Academic Press, Cambridge, MA, 1994).
[2] J. M. Jot, O. Warusfel, and V. Larcher, “Digital
Signal Processing Issues in the Context of Binaural and
Transaural Stereophony,” presented at the 98th Convention of the Audio Engineering Society, J. Audio Eng. Soc.
(Abstracts), vol. 43, p. 396 (1995 May), preprint 3980.
PAPERS
[3] B. Gardner and K. Martin, “HRTF Measurements of
a KEMAR Dummy-Head Microphone,” Tech. Rep. 280,
MIT Media Lab., Cambridge, MA (1994 May).
[4] C. I. Cheng and G. H. Wakefield, “Introduction to
Head-Related Transfer Functions (HRTFs): Representations of HRTFs in Time, Frequency, and Space,” J. Audio
Eng. Soc., vol. 49, pp. 231–249 (2001 Apr.).
[5] J. Chen, B. D. Van Veen, and K. E. Hecox, “External Ear Transfer Function Modeling: A Beamforming Approach,” J. Acoust. Soc. Am., vol. 92, pp. 1933–1944,
(1992 Oct.).
[6] V. Larcher, O. Warusfel, J. M. Jot, and J. Guyard,
“Study and Comparison of Efficient Methods for 3D Audio Spatialization Based on Linear Decomposition of
HRTF Data,” presented at the 108th Convention of the
Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 48, p. 351 (2000 Apr.), preprint 5097.
[7] L. Savioja, J. Huopaniemi, T. Lokki, and R. Vään̈änen, “Creating Interactive Virtual Acoustic Environments,” J. Audio Eng. Soc., vol. 47, pp. 675–705 (1999
Sept.).
[8] J. Mackenzie, J. Huopaniemi, V. Välimäki, and I.
Kale, “Low-Order Modeling of Head-Related Transfer
Functions Using Balanced Model Truncation,” IEEE Signal Process. Letts., vol. 4, pp. 39–41 (1997 Feb.).
[9] J. Huopaniemi, N. Zacharov, and M. Karjalainen,
“Objective and Subjective Evaluation of Head-Related
Transfer Function Filter Design,” J. Audio Eng. Soc., vol.
47, pp. 218–239 (1999 Apr.).
[10] C. Avendano, R. O. Duda, and V. R. Algazi,
“Modeling the Contralateral HRTF,” in Proc. AES 16th
Int. Conf. (Rovaniemi, Finland, 1999 Apr.), pp. 313–318.
[11] G. Lorho, J. Huopaniemi, N. Zacharov, and D.
Isherwood, “Efficient HRTF Synthesis Using an Interaural
Transfer Function Model,” in Proc. EUSIPCO’2000
(Tampere, Finland, 2000 Sept.), pp. 2213–2216.
[12] L. W. P. Biscainho, F. P. Freeland, and P. S. R. Diniz, “Using Inter-Positional Transfer Functions in 3DSound,” in Proc. IEEE Int. Conf. on Acoustics, Speech and
Signal Processing (Orlando, FL, 2002 May).
[13] B. Beliczynsky, I. Kale, and G. D. Cain, “Approxi-
Fig. 14. Comparison of SFRSs—IV. (a) Using nonreduced IPTFs. (b) Using tenth-order IPTFs. (c) Difference in dB.
928
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
PAPERS
IPTFs FOR 3D-SOUND GENERATION
mation of FIR by IIR Digital Filters: An Algorithm Based
on Balanced Model Reduction,” IEEE Trans. Signal Process., vol. 40, pp. 532–542 (1992 Mar.).
[14] P. D. Hatziantoniou and J. N. Mourjopoulos, “Generalized Fractional-Octave Smoothing of Audio and
Acoustic Responses,” J. Audio Eng. Soc., vol. 48, pp.
259–280 (2000 Apr.).
[15] A. Kulkarni, S. K. Isabelle, and H. S. Colburn, “On
the Minimum-Phase Approximation of Head-Related
Transfer Functions,” in IEEE Proc. WASPAA’1995 (New
Paltz, NY, 1995 Oct.).
[16] V. Pulkki, “Virtual Sound Source Positioning Using Vector Base Amplitude Panning,” J. Audio Eng. Soc.,
vol. 45, pp. 456–466 (1997 June).
[17] T. I. Laakso, V. Välimäki, M. Karjalainen, and
U. K. Laine, “Splitting the Unit Delay,” IEEE Signal Process. Mag., vol. 13, pp. 30–59 (1996 Jan.).
[18] C. I. Cheng and G. H. Wakefield, “Spatial Frequency Response Surfaces (SFRS’s): An Alternative Visualization and Interpolation Technique for Head-Related
Transfer Functions (HRTF’s),” in Proc. AES 16th Int.
Conf. (Rovaniemi, Finland, 1999 Apr.), pp. 147–159.
[19] K. Hartung, J. Braasch, and S. J. Sterbing, “Comparison of Different Methods for the Interpolation of
Head-Related Transfer Functions,” in Proc. AES 16th Int.
Conf. (Rovaniemi, Finland, 1999 Apr.), pp. 319–329.
[20] G. Strang, Linear Algebra and Its Applications, 3rd
ed. (Saunders, Orlando, FL, 1988).
[21] K. Glover, “All Optimal Hankel-Norm Approximation of Linear Multivariable Systems and Their L⬁-Error
Bounds,” Int. J. Cont., vol. 39, pp. 1115–1117 (1984).
[22] J. G. Roederer, The Physics and Psychophysics of
Music–An Introduction, 3rd ed. (Springer, New York,
1994).
[23] D. Butler, The Musician’s Guide to Perception and
Cognition (Schirmer, New York, 1992).
[24] J. Huopaniemi and J. O. Smith, III, “Spectral and
Time-Domain Preprocessing and the Choice of Modeling
Error Criteria for Binaural Digital Filters,” in Proc. AES
16th Int. Conf. (Rovaniemi, Finland, 1999 Apr.), pp.
301–312.
[25] A. V. Oppenheim and R. W. Shafer, Discrete-Time
Signal Processing (Prentice Hall, Englewood Cliffs, NJ,
1989).
[26] V. Valimäki and T. I. Laakso, “Suppression of
Transients in Time-Varying Recursive Filters for Audio
Signals,” in Proc. IEEE ICASSP’98, vol. 6 (Seattle, WA,
1998 May), pp. 3569–3572.
THE AUTHORS
F. P. Freeland
L. W. P. Biscainho
Fábio P. Freeland was born in Rio de Janeiro, Brazil. He
received an Electrical Engineering degree from the Federal University of Rio de Janeiro (UFRJ) in 1999 and an
M.Sc. degree from COPPE/UFRJ in 2001.
Since 1998 Mr. Freeland has been with the Signal Processing Laboratory (LPS/COPPE-UFRJ), where he developed his undergraduate project on audio restoration and
his M.Sc. dissertation on spatial audio. He is currently
pursuing a Ph.D. degree from COPPE/UFRJ and teaches
at UERJ. His research interests include digital audio processing, auditory scene synthesis, audio effects, among
others.
●
Luiz Wagner Pereira Biscainho was born in Rio de
Janeiro, Brazil. He received an M.Sc. degree in 1990 and
a D.Sc. degree in 2000, both from the Federal University
of Rio de Janeiro (UFRJ), Brazil.
From 1985 to 1993 Dr. Biscainho worked for the telecommunications industry. Since 1993 he has been with the
Department of Electronics and Computer Engineering of
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
P. S. R. Diniz
the Polytechnic (DEL/POLI) at UFRJ, where he is currently an associate professor of telecommunications. His
main research interests include digital signal processing of
music and audio. He is a member of the AES and IEEE.
●
Paulo S. R. Diniz was born in Niterói, Brazil. He received
an Electronics Engineering degree (cum laude) from the
Federal University of Rio de Janeiro (UFRJ), Brazil, in
1978, an M.Sc. degree from the COPPE/UFRJ in 1981,
and a Ph.D. degree from the Concordia University, Montreal, Que., Canada, in 1984, all in electrical engineering.
Since 1979 Dr. Diniz has been with the Department of
Electronic Engineering (the undergraduate dept.) at UFRJ.
He has also been with the Program of Electrical Engineering (the graduate studies dept.), COPPE/UFRJ, since 1984,
where he is a professor. He has served as undergraduate
course coordinator and as chair of the Graduate Department. He is one of the three senior researchers and coordinators of the National Excellence Center in Signal
Processing.
929
FREELAND ET AL.
From 1991 to 1992 Dr. Diniz was a visiting research
associate in the Department of Electrical and Computer
Engineering, University of Victoria, B.C., Canada. He
also holds a docent position at the Helsinki University of
Technology, Finland. From 2002 January to 2002 June he
was a Melchor chair professor in the Department of Electrical Engineering, University of Notre Dame, Notre
Dame, IN. His teaching and research interests include analog and digital signal processing, adaptive signal processing, digital communications, wireless communications,
multirate systems, stochastic processes, and electronic circuits. He has published several refereed papers in some of
these areas.
Dr. Diniz was the technical program chair of the 1995
MWSCAS held in Rio de Janeiro, Brazil. He has been on the
technical committee of several international conferences
including ISCAS, ICECS, EUSIPCO, and MWSCAS. He
has served as vice president for region 9 of the IEEE
Circuits and Systems Society and as chair of the IEEE
DSP technical committee. He has also served as associ-
930
PAPERS
ate editor for the following Journals: IEEE Transactions
on Circuits and Systems II: Analog and Digital Signal Processing from 1996 to 1999, IEEE Transactions on Signal
Processing from 1999 to 2002, and Circuits, Systems and
Signal Processing from 1998 to 2002. He was a distinguished lecturer of the IEEE Circuits and Systems Society
from 2000 to 2001. Currently he is serving as a distinguished
lecturer of the IEEE Signal Processing Society and has received the 2004 Education Award from the IEEE Circuits
and Systems Society. He has also received the Rio de Janeiro
State Scientist award from the governor of Rio de Janeiro.
Dr. Diniz wrote the books ADAPTIVE FILTERING:
Algorithms and Practical Implementation, 2nd ed. (Kluwer Academic Publishers, 2002) and DIGITAL SIGNAL
PROCESSING: System Analysis and Design, with
E. A. B. da Silva and S. L. Netto (Cambridge University
Press, Cambridge, UK, 2002). He is a fellow of the IEEE
(for fundamental contributions to the design and implementation of fixed and adaptive filters and for electrical
engineering education).
J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September
Download