>> Ivan Tashev: Good afternoon everyone, those who are in the room and those who are watching this talk remotely. Today we have Piotr Bilinski. He's a PhD candidate in INRIA, France in the Spatio-Temporal Activity Recognition Systems research group, but today he's going to talk about his work during the last three months in HRTF here at Microsoft Research. Without further ado, Piotr, you have the floor. >> Piotr Bilinski: Thank you, Ivan, for the introduction. Good afternoon. Thank all of you for coming for this presentation, for those who are here and for those who are watching this presentation online. Today I would talk about my internship project, head related transfer functions and their personalization using anthropometric features. This work is done together with my supervisor, Ivan Tashev, Jens Ahrens and Mark Thomas. At the beginning I would like to say thanks for the wonderful summer. It's been really a great pleasure being here. I thank Ivan as my supervisor for bringing me here and for collaboration. I thank Jens Ahrens also for his co-supervising me, also Mark Thomas. I would like to thank people from the Interactive Entertainment Business Division, especially Alex Kipman for founding my internship project and Jeremy Sampson for helping with the data collection. I would like to thank John Platt and Misha Bilenko from machine learning team for their consultation on the machine learning. I would like to thank Andrzej Pastusiak for his consultation about the TLC library and Chris Burges for letting me use his boosted decision trees. Also I would like to thank Jasha Droppo from the CSRC team for the consultations. Here is the overview of my talk. I will talk about HRTF, what is an HRTF and why do we need them? I will talk about data collection and what we did, and anthropometric feature extraction which I work on. I will talk about universal HRTFs, if they exist and I will talk then about HRTF recommendation via learning from user studies, and then I will talk about two approaches we propose, one based on sparse representation and another based on neural networks. Finally, I will compare all of these techniques over the proposed techniques and I will conclude and talk about future directions. So when I start, what is an HRTF? HRTF is a head related transfer function that represents the acoustical transfer function between a sound source and the entrance of the blocked ear canal. The HRTF describes the complex frequency response as a function of a sound source position. It means azimuth and elevation. In this particular work I will focus on modeling magnitudes as a solution for the phases as it exists. Here is a sample of an HRTF in the horizontal plane. We can see that on the horizontal axis we have azimuth of sound source and on the vertical axis we have frequencies. This block shows the magnitudes of the fourier transform of the input responses. We can see on this block markings of the HRTF on the scale and we can see that at low frequencies below 1000 Hz HRTFs are independent of the direction of the sound source but they are dependent at higher frequencies have both 2000 and 3000 Hz and the differences can be on an order of around 30 decibels. Why do we care? Why do we want HRTF? Basically, we would like to have 3-D audio over headphones and imposing HRTF onto a non-spatial audio signal and playing back the results over headphones evokes the perception of a virtual 3-D auditory space. In other words, having someone's HRTF allows us to control the perception of sound source localization over headphones. So there are many potential applications. It can be used in games, so we would like to know where is the opponent and how to move in a game. Maybe events streaming or music performances, also in virtual reality or like in the movie Minority Report where we would like to not only see what other people see but also to hear the same as what other people hear. So why don't we have it? We don't it because HRTFs are highly individual and using someone else's HRTFs , using HRTFs other than the user’s own HRTF can significantly impair the results. If we choose incorrectly an HRTF, the perception of sensor location might be also incorrect. Also, because measurement of HRTF is an expensive process; it requires specialized equipment and an anechoic chamber. What do they depend on? HRTFs are highly dependent on human anthropometric features, so they are dependent on ear features like the pinna height, pinna width and they are dependent on head features like head depth, head height et cetera, and also torso features and shoulder. What can we do? There are indications that HRTFs are correlated with human anthropometric features. We try to select and synthesize HRTFs for a given person from a database. In order to do that, in order to start working, we first had to collect data. So let's start with the audio data. Here in Microsoft Research there is an anechoic chamber. We asked participants to enter and to sit in this chair. The chair is in the middle of this arc. The head of the participant is stabilized so the person's head should be in the middle of this arc. We plug microphones into the participant’s ear. On this arc you can see 16 evenly distributed loudspeakers and the arc is moving to 25 different directions. Basically we played specific measurement signals from which we can calculate the HRTFs. As you can see on this image we have 16 azimuth and 25 elevations, so that gives us 400 directions. What we don't have here is we don't have some directions. So we basically extrapolate them using spherical harmonics and then we have 32 elevations and 16 azimuths and that gives us 512 different directions. For all of the 512 directions we have HRTFs. Once we have the audio data I would like to extract anthropometric features and in order to do that we first do the head scans. We have a different room where we asked the participant to sit on a chair and put on a swimming cap so we can pick up his skull geometry. Participants sitting on this rotating chair and there are two capturing units. The participant is sitting at an angle more or less approximately 90 degrees from these capturing units. Each of these capturing units has two cameras so from each capturing unit we get a cloud of points, the representation of the human head. As an example, we have a participant, we rotate the chair, we take pictures from different angles, so then we can align images and create the 3-D head model of a person. So there are lots of preprocessing steps like image alignment, filling of the holes, smoothing of the mesh and I would like to thank Jeremy for his help in doing this. Then we get a 3-D head model of the person. Once we have this model I would like to extract some features from it, so we have the model which is represented by the cloud of points by the 3-D triangles and from this we can extract some contours of the head and so that's the features I extracted from the head. I implemented several algorithms which automatically extracted features. I'm not going to go and say step-by-step how we do each of them. It would just take too much time. I will just say that we extract some features from head like head width, head length. We extract features from the neck, features related to pinna. We also extract from these head scans features from the ear, so there are features depicting the height and width and other features. Apart, we also take measurements by hand from participants. We have features like inter pupillary distance using this for example the pupil meter. We use a measuring tape and we use other measuring devices to capture information from shoulder, from torso, from head and neck. You can see that some of them are both extracted from the head scans and some of them are inserted by hand and that is correct. There are two reasons for that. Basically because in the 3-D model sometimes the neck is just not visible so we cannot extract what is the width or the depth of the neck and also basically sometimes the boundary between head and neck is not clear, not visible. The second reason is that we need some features to scale the image pixels, the distances in model dimension to the real-world dimension. As cameras are not fixed and chairs are not fixed and actually participants on the chairs are rotating, that's why we need to scale the image pixels to the real-world dimensions. Here are some examples of the screenshots of the software that I did over the summer. It's to extract some anthropometric features as I described before for one and another participant. Here are some more samples. What we also do is we ask the participant to fill out a short questionnaire and there are 12 different questions about gender, age, race, height, weight and the questions are all necessary and some participants basically might not be willing to provide all of the information or all of the details like age or weight, so that's why we have another less personal questions that are correlated with our original one. In total we collected 115 people, people’s HRTFs and for 36 out of this 115 people we have full measurements. We have the head features from head scans. We have ear features from 3-D head scans. We have measurements extracted by hand and we have questionnaires, so in total we have 93 anthropometric features per person. Up to this point I create a dataset. I created algorithms for anthropometric features extraction. There are scripts for data extraction and validation for measurements and questionnaires. There are many converters like participants are coming from different regions, so some would like to put data in feet, meters, some converters for weight, shoe size et cetera. Finally, you can count on the topic of HRTF recommendation, so my first question is is there any universal HRTFs which can once be one-size-fits-all. I took the head and torso simulator from the Bruel and Kjaer company and this is mannequin with a removable artificial mouth and pinna. And they provide this mannequin with average human head size so it's supposed to be correct for kids and adults and for females and males. I wanted to see how far is the HRTF from this HATS model to the people’s HRTFs. I'm using the log spectral distortion as the most commonly used distance in the literature and I would compare the log spectral distortion distance between a person's own HRTF and an HRTF from the HATS model. And I would just like to mention that the perceptual meaning of log spectral distortion is not clear. Here are the results. We have results for the straight direction, so basically when the sound is coming from in front of the person and we have for all the direction around the persons, so all the HRTFs, all the 512 HRTFs. And we also created the perfect and worst classifier, so basically in perfect classifier I don't look into the anthropometric features. I basically always choose the closest HRTF in the log spectral distortion distance. That doesn't exist; it just shows that this is the range of results we can receive and we see that HATS model gives us results very close to the worst classifier which we can create. Basically the conclusion is that the HATS model is not suitable and we cannot create one universal HRTF for everyone. If we cannot create one universal HRTF, let's try to select one of them from our database. That's our goal in this part. I would like to identify the best HRTF for a given person from our database. There are, however, two problems. One is to select the best HRTF we need an HRTF distance, and as I already said perceptual meaning of LSD is not clear. The idea was let's do the user studies and learn from them how do they rank HRTFs, so we can find correlation between people’s anthropometric features and their personal HRTFs ranking. So we designed user studies. Here's one of our participants. We provide them laptop with headphones. The headphones have the head tracker and we design the HRTF comparisons and we experiment. We asked people to compare each time two HRTFs. He can switch as many times, the participant can switch as many times as he wants between A and B and then he has to provide his preference. He can strongly prefer A, slight preference to strong preference towards another one and this is a slider so he can obviously put it wherever he wants. In this experiment we had in the training phase we show participants 12 different pairs of stimulus showing the range of HRTF and testing the 156 pairs to compare. I would just like to note that we use [indiscernible] speech for this listening experiment. >>: [indiscernible] better based on what it's supposed to be in a given direction or just better for a particular reason? >> Piotr Bilinski: They are supposed to make the decisions why the sound is coming from the screen, so they are evaluating maybe the straight direction. >>: We intentionally didn't tell them what properties to look for because the priorities that the people set on different properties might be individual, so we just asked them to give their general impression of whether this sounds better than the other one. >>: They are doing this in order to measure kind of spatial difference. >>: Yeah. Ambience or whatever and they can just say oh it sounds nice [indiscernible] and they give a judgment occluding the final [indiscernible] >>: [indiscernible] true and also it would maintain consistent because people might use different properties to make their decisions so we discussed many options and for each option that we saw there were arguments supporting it and arguments against it so eventually we just had to go for one and [indiscernible] whatever choice. >>: Is there an assumption that if you use the wrong HRTF it would just sound bad regardless of directionality? >>: Yes, yes. The two main perceptual dimensions are the vocalization and the timber and not necessarily independent. One can influence the other, but you can, hearing the sound source from the right direction doesn't mean that it's a good set of HRTFs because the timber might be too dull or too bright or so. We wanted to give people the freedom to input their priorities as they wanted. >> Piotr Bilinski: Yes, so it's up to the people to decide what they prefer was their, how do they judge. So how do we choose the stimulus for the experiment? We take basically all of the available HRTF and [indiscernible] after them and for the training we classed it into three groups and for the testing to 12 different groups and for each class that we that we select a representative person and obviously the people selected for training and testing are different. So then for training we select three HRTF and four testing 12 HRTF and then we asked each participant to compare a full matrix of selected HRTFs and each pair of HRTFs the participant has to compare twice, so we have for example 12 selected HRTFs, we will have 12 x 12+12 for the [indiscernible] so that gives us 156 comparisons. So how to select these representative HRTFs which gives us the range of different HRTFs. For this we use log spectral distortion. I said the perceptual meaning of log spectral distortion is not clear, but it definitely contains some true information. If the distance is big the HRTF should sound different and if the distance is small they should also sound similar. And here we want to select a representative HRTF which somehow cover the full range of HRTF. That's why we believe that using LSD for the selection of the present HRTFs is correct. And we can arrive at this distance in this form and now we can apply some clustering algorithms like K-Means and that's what we actually do. We had 23 people which participated in our experiment. For every participant and every pair of stimulus we calculated the difference between the responses because every participant has to respond twice to the HRTFs, to the same pair of HRTFs. If the participants reply strongly firstly to the A and strongly to the B that gives us the difference between his response before, so basically we put them on the scale and the values range from -2 to 2. So we can see what is the consistency between participant’s responses. Zero means that they reply exactly the same for the same pair of stimulus. Four means that they reply entirely opposite directions. So we can see that there is some inconsistency between a participant’s responses. It might be that they were tired. It might be their brain asked to see the difference when there was actually no difference because in this experiment they were looking for the differences. It also might be that people participated with a different level of engagement and actually one participant fall asleep. We asked him what was the reason. Actually, it was not because of the comfortable chair and it was not because of the experiment but because he was at a party up until 4 A.M.. So participants participated with different levels of engagement and that's why there is some inconsistency in the database. Some participants spent like one hour 20 minutes to do this experiment and other participant's only 20 minutes. We also plotted the same results in a different form. We see the representative HRTFs that were the selected HRTFs. This shows how many times one HRTF is preferred over other HRTFs, so the range obviously is from 0 to 24 and we can see here that some HRTFs are very preferable by this person and some are not. So this gives us a conclusion that this representation provided better data for the analysis and we can create ranking from this. That was the idea. Let's learn HRTF ranking from a person from user studies. And I already mentioned several times ranking and some of you probably thought about search engines like Bing and Google and indeed there is a similar ranking and that's how we treated this problem, as a learning to rank task. How does it work? We have 23 people and each person is described by 93 anthropometric features. For each person I indirectly ranked 12 representative HRTFs and each HRTF is represented by 29 Mel frequency cepstrum coefficients and for ranking we use boosted decision trees. What I did is to create a ranking formula, how to create a ranking from our experiment and basically to say it short, the idea is like let's try to give the high ranked values for the less often and the low values more often, so basically only putting the high values to only the best HRTF and only to few of them. To evaluate our results we follow the metrics from learning to rank domain and we use the normalize discounted cumulative gain metric. We can see that zero to one here is treated like a classification problem, and here are the results. However, I believe that this audience would be more appropriate with log spectral distortion so we also use the log spectral distortion here to see the results. We can see that the results are better than the HATS model which was 13.77 and we can also compare it to the perfect and the worst classifier and that's the result. However, I believe that to evaluate this technique much better would be to do another user study and that's what I wish for later. This shows that it is already better [indiscernible] than the HATS model. Also this technique is gives us information which features are more important for this task and which features are entirely uncorrelated, so there are features rated to the head width and the pinna than the head. Excuse me. Now I would talk about synthesis. We tried to generate the HRTF for a person. And we have two approaches which we propose. One is based on the sparse representation and the other on neural networks. In the sparse representation and the goal was the synthesis of the HRTF using anthropometric features, and the idea was we model a person's anthropometric features as a sparse linear combination of anthropometric features from other people, and we assume that the person’s HRTFs are in the same relation as anthropometric features. We have a full range of people in our database and we would like to generate, synthesize HRTF for this girl. We'd like to combine a few people and say that her anthropometric features are a combination of anthropometric features from these people, and ideally we would like to obviously only one person, the closest person and maybe two, maybe three people and not use any other people. So we would like to create a sparse representation. That's the idea. Let's learn a sparse vector alpha from anthropometric features and apply it to HRTFs. That's basically our idea. So that's the problem definition. It is a minimization problem and here we minimize the sum of the square arrow over all anthropometric features and we also added a regularizer to this and we solve this minimization problem using the lasso technique. I just would like to note that we learned the sparse representation on people, so selecting people and not features as usually is done. Once we learned these parameters from anthropometric data we applied it on HRTFs. We again, computed the results in the log spectral distortion distance and you can see that now the results are much better compared to other techniques and the results are actually very close to the perfect classifier which we created. Again, you can see the distance for the straight direction, for the left and the right here separately, and altogether for all directions together and basically in all cases the results are very close to the perfect classifier. So that was the first I think and we also had another technique based on neural networks. The idea is the same to synthesize an HRTF using anthropometric features and here we basically tried to map anthropometric features directly from HRTFs anthropometric features to HRTF using neural networks. Here I was using the radial basis function neural networks, so they contained an input layer, hidden layer and output layer and in the hidden layer there are radial basis functions and after this mapping we get such results, which are actually even better than sparse representation. They are also very close to the perfect classifier. Let's compare them. So we already have several techniques and let's see which one is the best. Throughout all of our techniques that we created so there is this perfect classifier which is as a reference. There is a sparse representation. There are neural networks. There is learning to rank. We also try a ridge regression which I haven't mentioned before which is like a sparse representation but without constraint of the sparsity. And we have HATS model and the worst classifier. Using log spectral distortion we see that sparse representation is mostly preferred especially in the frequencies which are most important for the 50 to 8 kilohertz. Neural networks are also performing very good and they are very close to each other, these techniques. We also believe that the performance of the sparse representation can be further improved with future selections. But also when you evaluate the results using the user studies, so we run a small user study with seven participants and we asked participants to compare their own HRTFs with selected and synthesized. Basically results are very similar to, the procedure is similar to before. For distraction we also present other people's HRTFs. And here are the results. You can see five different techniques, learning to rank, sparse representation, ridge regression, neural networks and HATS. On the left -2 means that the person strongly prefer his own personal HRTF, and towards the right means the person preferred the synthesized HRTF. Basically, we would like to have obviously everything from 0 to the right. Here we see the techniques that give the worst results are the HATS; that gets the worst results, and the second one is ridge regression, which is also quite natural. So let's remove these two techniques for a second and let's analyze the other techniques. We have learning to rank, sparse representation and neural networks. From this plot we can see that neural networks actually works the best and also, one person said once firstly that he preferred his own HRTF, but actually when he compare a second time, the same pair, he said that he strongly prefer the generated one. So actually, I would say neural networks are on the all positive side. For the learning to rank, that's the blue color the person said that firstly he, one person said I prefer, slightly prefer my own and then I slightly prefer the generated one. And the second participant said I don't see a difference and then said I slightly prefer my own. Sparse representation, actually two people said that they prefer their own. One was consistent, so he twice said that he preferred his own HRTF. Another person said he slightly prefer his own and later said he slightly preferred the generated one, so there's again inconsistency in the result. We should probably run a bit more user studies to try to understand, but from this what we can see here is basically neural networks was the best and all of the techniques are actually quite close to each other and also work relatively good. >>: It looks like the sparse representation creates exactly the same HRTF which is more precise, while the neural network create a better agent than their own, because the maximum restraints have shifted towards the right. >>: There's one thing to keep in mind. The 0 means the subject has no preference. It doesn't mean they don't need to sound identical. They can both sound bad or both sound good or both sound bad in a different way. >>: Yeah, but the reference is always their own. >>: Yeah, but no preference doesn't mean they don't hear a difference. It just says that they sound both equally pleasant. One maybe in terms of [indiscernible] and the other has a more natural timber. That is also included in the 0. >> Piotr Bilinski: But making an assumption that your own is relatively good, this will show they both sound relatively good. >>: We don't have an explanation yet for why we can't create an HRTF that sound better than people's own, assuming that their own are the best ones and that this is what their auditory system is calibrated to, but I'm sure that Piotr will mention this point in future work. >>: And the other thing is also you can't measure HRTFs such that they are directly useful for oralization, so you always have to do some sort of compensation or equalization of microphones and loudspeakers and all these things and we haven't managed to have an automatic way that spits out the perfect equalizer or calibrated HRTF so there is some sort of manual tuning involved and you can never say that this is the truth, what you are you doing. So it can happen that there is some flaw in our equalization and that the synthesis algorithm happens to correct all that. >>: It's also based on initial impressions. It doesn't take into account listener fatigue, so in the same way a lot of people might turn up the contrast on the TV straightaway, but after a while they realize that it starts to be unnatural, and the same thing can happen here. Sometimes things that are boom in your face well sound more impressive, but not necessarily more realistic. It might get irritating after a while. >> Piotr Bilinski: That's actually what was happening. Lots of people decided firstly that they prefer some HRTF and then after some time they said maybe for a short while it's very nice; I would like to have it, but the second decision they made against another one. >>: And that says they sometimes refer one and so we repeatedly present the same pair of HRTFs so sometimes they say I prefer this one and sometimes they prefer the other one. That suggests that either they are kind of equally good or users are unreliable maybe because their priorities shift. These are things we can't say really say what is going on, so this is all subject to future work. There are certainly indications that subjects were overwhelmed with, especially those 150, comparing those 150 HRTFs. It's so fatiguing and it takes such a long time that even I caught myself, and I noticed that my priorities shifted. Sometimes I put a higher weight on the externalization and sometimes more on the timber and so it's a bit of a, it's finesse we'll be working on definitely. >> Piotr Bilinski: That is definitely true. There's still lots of stuff that should be investigated. And the conclusion. We created a new dataset with HRTFs and anthropometric measurements. Over the summer we created algorithms for anthropometric feature extraction. We created four different techniques for HRTF personalization recommendation. We evaluated our techniques using both log spectral distortion and user studies and the results are encouraging. And based on the results, the best technique is the sparse representation when we are using the log spectral distance and neural networks based on user studies. As future work, definitely we should collect more data, more extensive user studies to assess the proposed techniques. We should collect more data to cover a wider range of people like more females, more kids more elderly people. For the techniques, for the sparse representation-based approach it would be very nice to add feature selection so we can find useful features and easier to measure features that give good results and remove all useless features. For the learning to rank maybe it's a good idea also to learn ranking of HRTFs from LSD distance and also definitely we can try in the direction of matrix factorization for the recommendation of HRTFs. Thank you for your attention. Are there any questions? [applause] >> Ivan Tashev: Are there any other questions by chance? >>: It seems like you are generating these things [indiscernible] put in some features that spits out and HRTF, right, is the goal, the synthesis goal, but it seems like these things work, are only really evaluated by people in pairs. Your classifier has no, my understanding of your classifier says there's nothing that says that features to the left here and the features to the right here; there's nothing that says these two actually have to agree in any way, so your HTRFs could be sort of anthropomorphically inconsistent. Do you know if that's ever a, can you ever like, for example, mess up one person, give them their real left ear and the wrong right ear and see what happens? I mean does that matter? [laughter] >> Piotr Bilinski: Actually, what we are doing is the synthesis of both ears at the same time. >>: From one classifier? >> Piotr Bilinski: Yes. >>: You put [indiscernible] >> Piotr Bilinski: They would be equally bad or equally good. >>: Both features go in and to ears,? >> Piotr Bilinski: Yes. All HRTFs, all anthropometric features go into and they say that is the HRTF that you should use. It generates both for the left and the right ear. >>: [indiscernible] >> Ivan Tashev: How many numbers is one HRTF? >> Piotr Bilinski: How many numbers is one HTRF. >>: Yes. >> Piotr Bilinski: We have 512 directions and for each direction we have 512 values. >> Ivan Tashev: So it's a substantial amount. So you can have a neural network with 96 inputs and you managed to synthesize 500 x 500, a quarter of a million points. >>: [indiscernible] >>: I mean, at the end of the line subspace is a lot smaller than that. [indiscernible] the neural network has [indiscernible] >> Piotr Bilinski: They are basically very close. >>: Given [indiscernible] generate [indiscernible] >>: No. Actually, no. You can generate in one direction and apply the weights to all of the other directions. >> Piotr Bilinski: I mean like, for example, when we are using which technique, for example, sparse representation, we learn the weights and we don't care which direction it is. We basically apply the same weights for the HRTFs. And the same… Neural networks basically the idea is we put all the direction the same in the same spirit. >>: [indiscernible] all the measurement locations, did you, all the measurement locations, not only just [indiscernible] directions [indiscernible] >> Piotr Bilinski: Yes. All the directions, not only one, because to evaluate we need all of the directions. >>: [indiscernible] directions not future [indiscernible] >> Ivan Tashev: It's just a backup number. >>: Yeah, I know. But I imagine two classifiers, one where you say these are the anthropometric features and this is the direction and elevation. Give me the HRTFs for these anthropometric features. Give me all of the HRTFs. >> Piotr Bilinski: All of the HRTFs. >>: So there are two very, so there are fewer parameters to learn in the first case than the second case. In one case there are only 512; the other one it's 512 squared. >> Ivan Tashev: Yes. >> Piotr Bilinski: Basically we can also just create 512 separate classifiers for generating algorithms. >> Ivan Tashev: [indiscernible] are completely independent. They are related [indiscernible] >>: But if you use them as features then you still have the same network sharing all of the weights and it could learn the relationship between the features. >> Piotr Bilinski: They are not exactly the same. That's why we like learn from every separation. >> Ivan Tashev: More questions? If not let's thank our speaker. Thank you Piotr. >> Piotr Bilinski: Thank you. [applause].