>> Mark Thomas: Hello, everyone. I'd like to... working with me for the past 12 weeks. Archontis...

advertisement
>> Mark Thomas: Hello, everyone. I'd like to introduce Archontis Politis, who has been
working with me for the past 12 weeks. Archontis joins us from Aalto University, in Finland,
where he works with problems on spatial audio capture and reproduction as part of his PhD
work, and it seemed only natural that we should give him an image-processing problem to be
working with here. So without any further ado, please, you have the floor.
>> Archontis Politis: Thank you, Mark. So, yes, I will jump straight into the presentation, and
what was my task here for this internship? So I was working on an audio acoustics problem
through a kind of graphics problem route, and that was quite interesting, because I haven't done
stuff like that before in a way. But yes, so the topic of the presentation is Applications of 3D
Spherical Transforms to Acoustics and Personalization of Head-Related Transfer Functions. So
before I go to the actual task and the method that we used, I will speak a little bit about the
motivation, what got us into the -- to do what we finally used. And the motivation for this work
was HRTF personalization. So HRTFs or head-related transfer functions are filters that they are
direction dependent, and they model the acoustical response from a sound source to the listener's
ears. So, basically, they are the transfer functions for a source at some distance from the listener
to the eardrums of the listener, and these are crucial for high-quality special sound reproduction
over headphones, since if we convolve a set of these filters for a certain direction with, for
example, a monophonic recording, we can create the sensation that the sound is coming from
that specific direction. Excuse me. One property of these filters is that they are quite much
individualized, so we have to measure them for a certain individual, and this individual's set of
HRTFs, they encode important special cues, mainly cues that are having to do with elevation, but
also with azimuth of sound source. And this creates a problem, because it means that ideally we
have to measure a set of HRTFs for each specific individual that we want to deploy an
immersive application that uses full special sound over headphones and measurement of HRTFs
is a very lengthy and costly process. You have to build a room, an anechoic space like these
ones in the photo, and you also need some special apparatus that is specifically designed for the
task of measuring HRTFs around the listener. This is an example of the anechoic space here at
Microsoft Research, and the apparatus with a rotating arc that measures the HRTFs around the
head of the listener that sits in the middle. And the lower picture, I think it's some older setup in
either NASA or some Air Force facility, that they have a full grid of loudspeakers, and the
listener has to get inside that grid. So this is not really an option for general deployment of
immersive audiovisual applications. So the way the research world has treated this problem is
there are two or three different approaches. One of them is -- one of the things people have
proposed is that you somehow model the HRTFs as a set of filters with a full parameters, and
then you present a task to the user that they can tune these parameters in order to make these
HRTFs kind of match whether they would like to hear or use or match their own. The second
approach is an approach based on numerical simulation, so you get some kind of a rough model
of the head and the ears, and then you put it an acoustic simulator on a computer, and you try to
model the scattering effect that the head has on a sound wave up to 20 kilohertz, up to the
audible frequency range, and this is also a quite nice solution. The problem is that it's also
extremely lengthy. Normally, for most numerical simulation methods, it will take days to
compute the full audible frequency range. The third approach, and the one that relates mostly to
what we are doing, is to measure a database of HRTFs for many subjects and then try to find for
a specific user the one that would match him well, based on some features of him, some
morphological features, mostly. And so what people have done is they have measured tens or
hundreds of subjects, and they have kept a directory of anthropometric features, and then they
have tried to -- and then for any new user, all they need to do is measure their own
anthropometric features, find the ones in the database that are closer to the users, and then pick
these HRTFs or a set of HRTFs, hoping that this would match the user quite well. And the
problem with this approach up until now is that there is no direct connection between an HRTF - or at least a very clear direct connection between an HRTF and all these specific
anthropometric features. It's very hard and it's still unclear and an open problem how, for
example, these individual features affect specific features in the HRTF response. So even though
this approach has been studied quite a lot, it's still an open problem, and many times, the
relations that have been used are quite arbitrary. So the main motivation of this research was,
first, we know that the HRTFs clearly depend on the head shape and the head size on some way,
since basically they model the scattering of the sound around the head. And what we were
thinking is that if we can determine a similarity measure between heads in a database, that is
based directly on the total head shape rather than individual features, and the way we approach
this problem was to see it as a three-dimensional shape or a surface or distribution and perform a
spectral or harmonic decomposition on the head shape, so kind of try to find spatial frequency
components that they model this head shape in some way and try to determine similarity based
on these spatial frequency components. So the first step towards that was to consider one quite
popular spatial transform, popular in many fields, also in acoustics, since it has been used in
many acoustic applications, and that's the spherical harmonic transform, which basically, it's like
a two-dimensional Fourier transform for data or functions defined on the unit sphere, and that
has been used in acoustics for modeling of loudspeaker radiation patterns, microphone directivity
patterns, also for spherical array beamforming and spatial filtering, which it's a very active field
of research the last decades. Also, for immersive 3D sound recording and reproduction, with a
family of methods that have been popularized as ambisonics. And finally, for interpolation of
HRTFs. Since HRTFs are measured around the listener on a spherical grid, the spherical
harmonic transform is a pretty natural way to first compress all this data and also use the inverse
transform to interpolate between the measurement points. So that's a look on a little bit of the
math of the spherical harmonic transform. So we get the spherical harmonic coefficients by
projecting our function onto the basis, spherical harmonic basis functions, which are also called
the spherical harmonics, and these spherical harmonics have a dependency on the azimuth and
the elevation, with a normalization term that keeps them ortho-normal over integration over the
unit sphere, and then using these coefficients, these harmonic coefficients, we can reconstruct
our function or interpolate our data for any direction that we want. And the domain of the
function is the unit sphere, so it depends on azimuth and elevation, and even though theoretically
the transform should be infinite, for most practical functions, especially in acoustics, this
summation can be described perfectly well with a finite sum, so actually the transform is
practical. An example of the use of the spherical harmonic transform is, for example, in HRTF
interpolation, as I mentioned, so these are HRTF magnitudes for the left ear of a user measured
here in MSR at 1 kilohertz, and this grid of points is actually the measurement grid, and we can
see this -- yes, the magnitude variation across different directions, and if we perform a spherical
harmonic transform on this data, we get the spherical harmonic spectra. And using this spectra,
by performing the inverse transform, we get a very nice, smooth surface at any direction we want
that models very well this HRTF directivity function.
>>: I have a question about that. Since the title of the parameter is L equals 15, what does that
mean?
>> Archontis Politis: Yes. That's actually the order of the transform, meaning that we use more
and more harmonic components to capture the variation across angles of the function. And this
order is determined by a few factors. First of all, it's determined by how smooth our original let's
say HRTF is across different directions, but also, it's limited by how many measurements we
have done, so by kind of a sampling condition. And in this case, the order of the transform was
15, which actually means it used 256 coefficients, so basically, with 256 coefficients, we could
interpolate for any direction we wanted for this single frequency. So yes, that's also a quick
example of how this basis function looks like, the spherical harmonics, and we can see that they
have more angular variation at increasing orders, and also different basis functions, different
spherical harmonics, have different symmetry properties, meaning that they can capture different
kind of properties of the shape or the distribution that we are transforming. So of special interest
to our work here was the fact that the spherical harmonic transform is also quite popular in the
graphics community. It's used mainly for tasks that have to do with lightening of 3D objects and
rendering, but it has also been used for, for example, fast approximation of 3D heads, so this is
an example of a 3D image captured with laser scanning, and this is what the spherical harmonic
transform gives with an order of 25 and 676 coefficients. So you can see that it captures most
features of the head in a smooth representation, and the same for the other head, too. One thing
we cannot hear is that very complex surfaces, like for example, the back of the ear, cannot really
be captured by a transform like that. Because it's a two-dimensional transform, it can only detect
the furthest points on the surface of the head, and it cannot model variations with -- radial
variations like that. And of even more interest to our work in this internship was the fact that the
spherical harmonic transform has been used also in the problem, again, from the graphics
community, in the problem of trying to detect similar 3D objects in a 3D model database, so if
we have a database of thousands of 3D models and we do a query with a new 3D model, the
problem is to try to find similar models in the database. And as I mentioned, the spherical
transform, as it is, it has its limitation, but it cannot capture radial variations, so it's not by itself
suited for this problem very well, for complex 3D shapes. Essentially, it requires that if the
origin is the inside of the 3D model, it requires that every point of the 3D model is visible from
the origin, so there are no occlusions or stuff like that. So the way the people used it before for
this similarity problem is that they took the 3D model. Then they created concentric spheres that
they were growing outwards from the center of the object, and then they were taking the
intersection of these concentric spheres with a 3D model, and they were applying the spherical
harmonic transform for these spheres separately, and then they were ending with a twodimensional matrix of spherical harmonic coefficients, which was then used as a template to find
the similarity between the 3D models. One other nice detail about this study and similar studies
is that it makes sense for this kind of problem not to use directly the spherical harmonic
coefficients but actually to use the energy of the coefficients for each order L, which has the
important property that it doesn't depend on the rotation of the object. So if instead of directly
the coefficients, we use the spectral energies, we end up with a smaller vector, and that vector
doesn't change if the object is rotated or not, and that's a very nice property for this similarity
matching problem. So finally, we saw that the spherical harmonic transform can be quite good
for our task, which relates also to this HRTF personalization task. However, the way that people
used it before is a bit ad hoc. It's in a way somewhere between a spatial transform and
somewhere between a spatial representation and a harmonic representation. It breaks down the
object in these spheres, and then you have a harmonic representation for each one of these
spheres. What we thought is that we can go one step further and use a full three-dimensional
transform, so instead of only capturing the angular variation across each one of these spheres, we
can also try to capture the radial variation, and then we end up with a full three-dimensional
transform that models both angular and radial frequency components. So we focused on two 3D
spherical transforms. One is a spherical Fourier-Bessel transform, which has been used quite
much in physics and chemistry and I think also in some image processing problems, and the
second one is a spherical harmonic oscillator transform, which is not very popular by this name.
There are only a few works that mention it like that, but I think it has been used quite much in
quantum physics or quantum engineering. So the first one, the spherical Fourier-Bessel
transform, just a few notes here. First of all, this transform is not any more defined like the
spherical harmonic transform only on the surface of a sphere, but is defined -- also, it's defined
on 3D space. It has also the radial component, and here, the domain of integration is a solid
sphere going from zero up to some radius that we want, and basically, it encloses completely our
shape. And this radius of the domain determines many properties of the wave functions, like the
scaling of the radial functions and the normalization term, and we can also see that the basis
functions for this transform contain basically the spherical harmonics, so the angular components
are still captured by the spherical harmonics, while the radial component is captured by the
spherical Bessel functions. This is an example of how the basis functions look like. This is not
very clear, but what we can say at least is that this is a specific wave function, and apart from the
angular variation, it's now a full three-dimensional function. So it's defined in the full 3D space,
and we can say at least here that it has variation, both across the angular dimension and across
the radial dimension. The second transform is the spherical harmonic oscillator transform,
similar to the spherical Fourier-Bessel transform, it's also three dimensional. The difference with
the spherical Fourier-Bessel is that for the radial component, it uses this associated Laguerre
polynomials, and now the domain of integration is not a solid sphere, but it goes from zero to
infinity. However, the transform is still suitable to capture and model shapes that are
concentrated at some region. Again, that's a picture of an example of the harmonic oscillator
transform basis functions. And next, after we implemented the basic form of these transforms,
we had to determine -- we had to think of what we could use them for, and naturally, one good
application is compression interpolation not only of data on a sphere but data on a 3D space.
Especially data that are sampled spherically are very well suited for these transforms, and a
natural application for that can be the 3D interpolation of near-field HRTFs. Near-field HRTFs
refer to the fact that even though HRTFs are known to change apart from a scaling factor for
source distances beyond 1 to 1.5 meters, that fact doesn't hold true for close distances to the
head, so if we want to make a sound appear like it's coming really close to the head, we have
actually to measure HRTFs at different distances from the head. So we have a radial variation,
and this is something that can be compressed and also expressed by use of the full 3D
transforms, and also, after we have gotten the harmonic coefficients, it's very easy and fast to
interpolate at any point we want. The second task, and this is the task that motivated this
research, is that we can use these 3D transforms to capture the harmonic components of head
shapes, get, for example -- get quickly a 3D scan of a user by using, for example, the Kinect.
And then, by use of the Fourier-Bessel transform or the SHOT, we can get the harmonic
components, or the spectrum of the user's head, and then compare it with the spectra of the heads
in a database that we have also measured the HRTFs, and then just pull the HRTF that
corresponds to the closest head to the user. And as I mentioned before, a nice property of the
three-dimensional transforms similar to the spherical harmonic transform is that we can use this
spectral energy vector, which is rotational invariant, so the method will be robust, for example, if
there is a head in the database that is similar to the user's, but due to the measurement procedure,
it's rotated or non-aligned.
>>: What translation, if you hadn't properly found the center or if you misaligned the center
somehow, is it ->> Archontis Politis: So translation is a problem. The transform is not robust translation. Of
course, that depends on how much translated it is. Yes, it seems like you cannot have both. For
example, you can use normal three-dimensional FFT, and that will be kind of translation
invariant if you take the energy, but it won't be rotationally invariant at all, while these
transforms are rotationally invariant but not translation invariant. For this problem, for example,
on this study, we did some rough alignment of the heads first by using -- by detecting the ends of
the ears, so all of the heads were pretty much aligned to have the inter-aural axis here, and the
center was put in the middle of the inter-aural axis. So all of them had kind of a common
reference point, and we were hoping after that that deviations would be small enough that it
would still capture similarities between two heads without huge errors, for example. So after we
implemented the basic transforms, we wanted to apply them to the head scans, and we needed to
perform a series of steps to manage to do that, starting from a 3D mesh coming from laser scans.
So the way we did that is we sampled the head scans by using first a ray tracer that was shooting
rays in uniformly arranged directions all over the 3D sphere. And then we could find the
collision points from the ray tracer with a head mesh, and based on these collision points, we can
determine again in a grid of concentric spheres -- we can determine if points were inside the head
or close to the boundary of the head and then use these sampled points as input to the transform.
We used two cases of sampling. One was a solid sampling, so considering the head as a solid
object, meaning that any point that was detected inside the head had a binary value of 1, and all
the other points had zero. And the second case was considering the head as a cell with some
width, meaning that all the points that were around with some margin to the head shape had a
value of 1 until the other points were zero. The difference within these two conditions was we
couldn't predict from the beginning which one would perform the best. Also, the cell sampling
has the advantage that it's more robust to any errors or irregularities like holes, so it wouldn't care
about that, while for the solid sampling, we need to make sure that the mesh is closed to detect if
the points are within the head or outside the head.
>>: Are you going to talk more about your sampling rates, like how dense from the center?
>> Archontis Politis: Yes. We tried a few things. So in the beginning, I took very high
sampling rates, so we had -- and I can actually show a little bit here, too. So we had a number of
rays that were shooting from the center, and they were uniformly distributed, and these were
around 15,000, and then they had concentric spheres that were at about 1 millimeter difference in
radius. And finally decided that we would like to keep this radial resolution, but we reduced
quite a lot the angular resolution by reducing the rays to around 5,000 without significant
changes, at least not seen in this -- yes, in this study we did. But that was something that we
found important also to optimize the transform, because if you do just a direct naive
implementation of it, it's quite slow and intensive, costly, both in memory and in processing, but
if we observe the fact that inside the transform, there is a spherical harmonic transform, which
becomes just a single summation, if you make sure that your rays are as much uniformly
distributed on the sphere as possible, so this integration becomes just a sum, then you can have a
good spin up. And also, there were a few other optimizations, like for all the spheres that they
were completely inside the head, because all the sums had just the value of 1 or 0 if it was the
solid condition or the cell condition, we could also very quickly determine the coefficients of the
transform. Yes. So this is an example of how the algorithm was -- how the method was -- what
it was seeing, basically, so that's a head measure of my scary mentor, Mark, and this is an
example of the ray tracer collision points, where here different colors model how many times the
ray has exited the head and then it has entered the head again, if it hits, for example, the neck or
the ear. And this is a case of the solid sub-link, so here, we have a very coarse condition of
around 15 spheres, I think, only, which captures just the very basic shape of the head, while here
I think I had around 200 spheres. And we can see that it captures the head quite well. And for
the case of the cell sampling, so keeping only the points that are around the mesh, this is a case
of 5,000 points, and this is a case of 15,000 points. Again, that has to depend with how much
stuff you are trying to model, so in the beginning, we were really trying to capture all the
variations on the ear, too, but then on the way we decided that it's probably better to separate the
two things, first try to capture just the head shape and then maybe try to capture several of the ear
shape and use that somehow. So that allowed us to reduce the number of sampling points quite a
lot. And this is an example of how this head spectra looks like. This is the case for the SHOT.
This is the case for the Fourier-Bessel transform, and these are the representations after we have
computed the spectral energies, so these are these rotationally invariant vectors that we are
actually using to compare similarities between the heads. And you can see that there are various
periodicities. This has to do with the way the basis functions are indexed in the transform, and
basically, what they mean is that every few coefficients, it comes the coefficient that corresponds
to a wave function that grows only radially and it doesn't have any angular variation, and these
coefficients integrate with the interior of the head, which is a solid, and they have a quite high
value, basically. The coefficients that have angular variation, they get suppressed. So yes, first,
by getting the spectra, then we are computing these energy spectra, which are also reduced in
length, and then to compute the similarity between the heads, we just take the Euclidean distance
between these energy spectra, the Euclidean distance between these vectors for two different
heads, and we have a metric that determines somehow how closely they are in shape.
>>: So before you cut to the results, I had wanted to ask about why you've chosen to evaluate
two certain transforms, the SHOT and the Fourier-Bessel one, because are they both on a
complete orthogonal basis function sets?
>> Archontis Politis: Yes, yes.
>>: So then theoretically, if you used ->> Archontis Politis: It wouldn't matter.
>>: Then it shouldn't matter, right? So are you going to be comparing them in terms of the
length of coefficients you need, or is there one you would expect --
>> Archontis Politis: There were a few results, and actually, they were kind of -- the original
motivation for that was dropped in the midway, so the original idea was to go with a SHOT
transform. It's novel and it has some nice properties. And one of them was that you could have
data on a rectangular grid, and there was like perfectly defined transformation to go from this -to make the SHOT work on data on a rectangular grid without error, while in the case of a
Fourier-Bessel transform, you would need interpolation of this data on the rectangular grid to a
spherical grid, which means that you would lose some information there. So in that case, we are
going with a SHOT, but also trying the Fourier-Bessel transform, because it's most commonly
used in a way. But then we realized that since we can define the sampling grid we want
ourselves and in a way that can speed up the computations a lot, we used spherical sampling
grids, so in that case it wouldn't matter anymore which one of the two. But then I decided to use
both of them, mainly because of this different boundary condition. So the Fourier-Bessel
transform is determined in a solid sphere without seeing anything outside of it, and that sounds
perform for a head, say, for example, because you can enclose it completely in a sphere, while
the SHOT, you have to consider kind of the size of the shape, too, to be sure that it's captured
enough -- yes, that the transform works, basically. And that wasn't clear, how it would work. So
we decided to keep both of them. And then in terms of implementation, at least after you had the
math done right in some functions, there was no difference in applying one or the other. Yes.
And they were also comparable in terms of speed, quite much.
>>: That would be the other consideration, if one of them is much fatter than the other, but
they're kind of comparable?
>> Archontis Politis: Yes. It wasn't. It wasn't in this case, at least. If somebody looks much
more in detail on them, maybe there are stuff that can be optimized. Maybe the polynomials of
one have some nicer formula to compute than on the other, but it wasn't anything apparent, at
least. So yes, after we managed to apply the transforms to the heads, we wanted to see if we get
anything reasonable out of these transforms, and one way to do that was to try to reconstruct the
heads back from the coefficients, and since we could define this reconstruction in any kind of
grid we wanted, this is an example of trying to reconstruct the head in a horizontal plane, passing
through the ears, and we can see that for both transforms, the head shape is captured quite well,
so it has a clear boundary going from very close to 1 down to 0, and also, here I have plotted
directly the points that are coming from the ray trace or collision points with the mesh, the true
mesh, and you can see that they are very well aligned along the boundary that the transform
gives. And you can also see that things like the pinna lobe here is also captured, something, for
example, the spherical harmonic transform wouldn't be able to model. Then, we could also
reconstruct the full head on a 3D volumetric grid and get a 3D representation of it, so that's an
example of the transform and how it performs going from a very low-order presentation, so using
just a few coefficients, to a pretty high representation. And we can see that it starts from looking
like a rough bowl to something that starts to get quite accurate with regards to the head. All
right. Finally, after this work on the transforms and the applications to the head, we wanted
finally to see if it was any useful to the problem of this HRTF matching, so using the head scan
of a user, trying to detect the closest head scan in the database, and then pulling the HRTF and
see if it would match that user at all. However, we decided as a first step to check not the full
HRTF but the ITDs, mainly for the reason that the ITDs are better understood in the way that we
know that they depend mainly on the head shape and the head size, while the HRTF magnitudes,
they depend both on the head shape and size but also quite strongly on the ear and the ear shape.
So the magnitude seemed that it needed some more work in order to separate it in a way that you
would have head-related features and ear-related features. And ITD anyway is one of the two
major components of the HRTFs, and we thought that it would be a nice first step for that. So
the ITD basically models -- ITD means the role of time difference, and it models the time
difference that it takes for a sound to propagate from one ear to the other, coming from a certain
direction. And it looks something like that if you plot it in 3D space, measured for a single
subject, and that's in milliseconds, I think, and it's almost zero in the medial plane, because there
are no time differences between the two ears, and it gets its maximum on the complete extreme
lateral directions, on the left and the right. So what we did to apply the method and evaluate it
was first to apply the transform to all the heads in the database, get the spectra for each head.
Then, since we also had the measured HRTFs for each head, we extracted the ITDs for each
subject. We determined similarity between directly the ITDs for all of the subjects in the
database by taking the Euclidean distance between ITDs for all directions for each subject, and
having these two similarity metrics, one for the ITDs and one for the head spectra, we
constructed a distance matrix first for the heads and then for the ITDs themselves. Then, we also
used as baselines two quite popular approaches to using non-personalized HRTFs. One was to
use an average ITD from the whole database, and the other was to use an ITD measured from an
anthropomorphic mannequin, meaning that you take kind of a doll that models kind of an
average person and you put microphones on its hear and you measure its HRTFs and then you
use these HRTFs or ITDs for the user. So these are the baseline conditions of average and
generic. The generic corresponds to the mannequin ITDs, and for each subject, we also
compared the ITD distances between the subjects' own ITDs and the average and the generic
ones. Before I move to the results, this is an example of what this similarity, head similarity,
looks like. So in this case, I plot the three most similar heads for two original heads, and we can
see that even though it's very hard to say visually anything about the transform itself, I think that
it seems like it's getting something about the shape and the size of the heads, at least. So about
the results, this is a pretty confusing plot, but what it shows, basically, is that for each subject in
the database -- yes, so I forgot to mention that we used 144 subjects in the database, so for each
subject in the database, this plot shows the ITD difference between their own ITDs and the ITDs
that were returned from our method, which is the head, the blue line. The ITD difference
between the subject's own and the average ITD and the same for the subject's own ITDs and the
mannequin, which is this HAT. Also, the bottom line, the purple one, shows the ITD distance
between its subject and its closest match in terms of ITD distance itself. So if we take one
subject, which other subject is closer, has the least ITD distance to that subject, and how much is
that? And we can see that this line basically defines a lower bound in performance in what we
can get with our method, because if we can select a person from the database randomly, we can
never go below that line, or by using any kind of algorithm. And for the rest of the lines, it's
pretty hard to see what is going on. We can say that the HATS, the generic head, doesn't seem to
not perform too well. It has the most variation, the yellow line, while the average on the head,
they look pretty close. So in order to evaluate that a bit further, we created some scores that they
say basically for how many subjects the method was performing better, so the blue line was
better than the average, the orange line, or the yellow, the generic ITD. And the scores were
basically this, so what they say is that for 64% of the subjects, the method based on the SHOT
transform was performing better than the average ITD, and for 71% of the subjects, it was
performing better than the generic ITD. And also, the Fourier-Bessel transform looks to be quite
close but performs slightly worse, and the spherical harmonic transform, so we also included that
for comparison and evaluation, so just applying the more traditional approach of using the
spherical harmonic transform to take the similarity of the heads, performs significantly worse
than the full three-dimensional transforms, which is a good motivation to keep looking at these
transforms, even though they require some extra work compared to the simplest approach. One
comment about the average and ITD is that there seems to be a high number of subjects here that
the orange line is pretty low, so it seems that there are subjects inside the database that are very
close to the average ITD. And actually, for some cases, the orange line goes even below the
purple line, which means that there is not a single subject in the whole database that has a closer
ITD to the user, to the specific user, than the average ITD. So it seems that it would be
advantageous to make some preliminary study on the ITDs or the HRTFs themselves to cluster
somehow the users inside them, and that can probably give more information for the matching
problem, since it seems like there are subjects that are very average in a way, very close to the
average HRTF. But then, there are also subjects that are very far from that, so they're highly
individualized, in a way. Some comments on future work or potential next steps, so of course,
the ITD is one part of the HRTF. The second part is the HRTF magnitudes, but as I mentioned,
this doesn't work directly because of the effect of the pinna, so one idea was to apply this shape
similarity on the ear shapes too, and in that way, we could end up with two sets of similarity -sorry, of spectra, one for the head and one for the ear, and then somehow use these two
similarities to pick up HRTF magnitudes from the database. However, that somehow requires
some kind of factorization or decomposition of the HRTF magnitudes into head-related and
pinna-related components. If there is a way to do that, then it also means that you can also
probably pull up the head-related part from one subject and the pinna-related part from a
different subject if the algorithm looks that this is a good way to go, if the method shows that.
Finally, to conclude, this was a study on some 3D spherical transforms that they have not been
used before in acoustics, and they seem to have interesting properties with potential to
interpolation, registration or similarity finding or matching of 3D data. And it looks to us and we
did some preliminary steps to validate that, that they're suitable for some larger-scale
applications, such as 3D interpolation of near-field HRTFs and HRTF personalization, and we've
got some promising results on personalization of HRTFs by application of these transforms on
head meshes and on ITDs. And that should be it. Thank you very much, and I would like to
thank also especially the guys on the team, my mentor, Mark Thomas, Hannes, David and Ivan
that is not here today, and also the great interns that they left already and they left me alone here,
Matt, Long and Supreeth. I hope they will watch this presentation in the future, or probably not.
Yes, anyway, thank you very much.
>>: So one quick question. You said you'd talked about, you'd thought about using Kinect or
whatever to do the 3D head scans. Did you actually do much of that, or did you just use the
scans from the head database that you already had?
>> Archontis Politis: Yes, no, this study didn't use this lower kind of quality scans. It was based
on pretty high-quality laser scans that they were measured by the guys here. So that would be a
natural progression, to check it with lower resolution scans or scans of the same person but
measured in different conditions, different setups, different methods, maybe, to generate the
scan, and hopefully, the method should be robust to that. So the same person measures with
different setups and ways, after some alignment for this translation problem, based on some very
rough features, for example. They should be detected as very similar, compared to other heads,
hopefully. But yes, we haven't checked that yet.
>>: Can you give an idea of how long it would take to process one head? Say you have
everything set up, and then I scan somebody.
>> Archontis Politis: Well, for the resolution that we used, at least for these results, that would
be around a minute per head, per transform. So the transforms were pretty much the same time.
Using the same sample grid, it was about a minute each. And this was a pretty high-resolution
way that maybe 1 millimeter for each sphere is not required, but these things need to be checked
individually with some similar cases, not just through a full head scanning and check what is
going on, because it's pretty easy to probably detect sampling conditions for the angular
variation, but at least to me, it's not apparent to detect the condition across the radial dimension.
There is a radius squared on the integration there and some stuff that makes it a bit more
involved.
>>: You didn't mention the other motivation for using SHOT.
>> Archontis Politis: Oh, yeah, actually, that was the main one. How could I forget it.
>>: We could take it after we're done.
>> Archontis Politis: Yes, so to your question about why using one or the other, SHOT sounded
great, because you combined with head lots of functions, and you have head SHOT, and that
narrows it down.
>>: We have no further questions. No one online?
>> Archontis Politis: Sorry?
>>: No one online has asked any questions?
>> Archontis Politis: I don't see anything here.
>>: Thanks. Thank you.
Related documents
Download