1 >> Zhengyou Zhang: Okay. So let's get started. Welcome to the [indiscernible] public seminar. I am Zhengyou Zhang, [indiscernible] multimedia introduction and the communication group. So this is a part of our group's public seminar series. So this is a tradition started by a few, which idea is to give internal talks before we present to the external world. So we think this is a good chance to practice, also, for us. So today we have four mini talks. Those talks will be presented in about two weeks to the [indiscernible] international shop on multimedia [indiscernible] processing. And we'll have four talks. First, Dinei will give the first one. And Sanjeev will give the next two talks, and I will give the last one. So Dinei, just go. >> Dinei Florencio: All right. So thanks for the introduction. So I've been presenting our work in crowdsourcing for determining a region of interest in video, and this is joint work with Flavio Ribiero at University of Sao Paulo. So the first, like, crowdsourcing has attracted a lot of attention, and crowdsourcing, exactly what is crowdsourcing? So it's essentially the idea of actually using a crowd, using a large number of people to actually achieve some purpose. And that purpose could be, like, funding or could be like [indiscernible] some knowledge in. And essentially, the crowdsourcing like has been there for a while, but with the advent of the internet, it's much easier to achieve and reach a much higher number of people. So very successful examples of that is Wiki and things. Like Wikipedia is a very huge phenomenon, crowdfunding, where you actually apply to the community and say we want to fund some specific project or anything. And then you actually collect like a small amount for a number of people. We're more interested in particular what we want to call human computation, which is the idea of using humans to do processing tasks which are not easy for computers to do. And in particular, you're trying to do that with crowdsourcing. So one of the benefits of actually doing that? First, the scaleability. You 2 can actually get very large crowds and recruit and. >> Zhengyou Zhang: That very quickly. So if you need like a thousand people for five minutes each, it's like, it will be essentially a huge operation to do this in real life. If you can actually do that through crowdsourcing, using the web, you can actually get the thousand people and dismiss them in five minutes very easily. Typically, the cost on that is being below market wages, in particular because of this facility of recruiting and dismissing of that. And you can recruit, like, a very diverse work force. So there's a number of reasons it's a very simple, scale able, cost-effective way of actually getting people to work on particular tasks for you. So the most widespread, the most common platform for crowdsourcing is the Amazon Mechanical Turk, which was launched in 2005. Has about half a million people registered. And at any time, there's like 50 to 100 thousand tasks offered out there available for people to actually take those tasks and do. So the typical human intelligence task, as they call like HITs, are typically very simple tasks. Preferably 15 in a page and requires like typical one or two minutes to perform. And typically, you're going to be paying like five to 25 minutes -- five to 25 cents per that task. So Microsoft does have now an internal offering in crowdsourcing, which is what we call, like, the universal human relevance system, which actually we've been using right now for internal things. So we're actually using that for a lot of work relating to search and relating to some of the other things. But that's a recent offering from Microsoft and is being mostly used internally right now. So we have done some previous work on crowdsourcing, and that's essentially been the idea of using the crowdsourcing to actually do MOS studies. So when doing like mean opinion scores, like subjective testings off audio and other things, we actually need to ask people. So typically, you actually will bring people to the lab, do some experiment, ask their opinions and so on. So we're trying to [indiscernible] to previous work, one applied to audio and speech, the other one applied to image, essentially to reproduce that without the controlled environment of the lab. So essentially just recruit workers 3 from the Mechanical Turk, have them rate the files, the same way that someone would rate at the lab, except that you do not have the control. You do not know if they are the right distance from the screen, if they have the correct headphones and so on. And then find ways to actually filter that process and get the results you would get from more controlled studies. So essentially, the results were very encouraging. We essentially can get pretty much the same quality that you would get a lab study, and with a much, much faster and lower cost than you would get. So however, that's a very batch sort of processing. So the idea was there's a study in the lab, and I you to do that study using crowdsourcing. But it's like a task, and the end is the end of the task, which is a number. So our vision in that was more tightly integrated. So it was like can we use human computation as a processing block. So have like some algorithm and there's some particular piece of that algorithm which is hard for the computer to do, and can then we use like crowdsourcing for that. And then essentially, for that, the delay has to be much smaller, and the task has to be a task that can actually, that fits into this process, like a processing sort of task. So in order to try to go in [indiscernible], we essentially have like this particular work which is try to find region of interest in video. So when you have a video code, we haven't done that part, and so that's what's in future work here. So if you have like a video codec, where you're actually going to use a region of interest, you're going to use the parts of the video which are more relevant, and the parts of the video where people actually look at and then put more bits there as opposed to the rest of the video, then you need to know which parts of the video are those parts of interest. So the problems of doing that by automatic means is very hard. And essentially, you're going to see a few examples later. But like it's very hard to actually figure out what's important or not in a piece of video. So one way of doing that is asking people. So if you ask people to actually watch a movie, and that's a traditional way it's done so you actually get people put the movie in front of them and put on an eye tracker and you see 4 where they're looking at. You say people are looking here so that's the place most people are looking at. So that's where we have to it the bits. However, that requires the same thing, bringing people in the lab and then having them with an eye tracker, which costs like $20,000 or so, and then going out. >>: So that application, does the screen have to be big enough? Because if it's a small screen, can you figure out where I'm looking? It probably doesn't matter. >> Dinei Florencio: Yes, and no. So the bigger the angle of view, right, the bigger that, the more your human vision system will decay. So yes, there would be more difference. If what you're looking at is very small, then probably the resolution of your eye all fits in the fovea and there isn't much to do. Yes. But typically, so for HDTV kind of scenarios and even for most of the PC viewing thing, there's actually almost like a 30 degree field of view, which is actually very significant. So essentially what we wanted was like, well, if you can do that in a video. Say you have like, say, YouTube video, for example. So somebody upload a video, and then you want to optimize the coding for that thing. So I got that video, I can actually, for example, crowdsource this thing and get what people are looking at and then insert that into your video coding and recode that video with that particular result of that. So let me do an experiment. The problem is like okay, I need to know what people are looking at. And the problem is in the lab, we use an eye tracker, but people at home don't have an eye tracker and I cannot provide them with eye tracker. So how do I make that part? How do I make you at home figure out where you're looking at and how do I force you to tell me? So essentially, we say okay, we could ask people to point with their mouse, like I'm looking here, right? But then it's like how do I force you to actually do that and how do I guarantee that you do? So essentially what we tried to do was like, okay, like if this is like a video, and I want to know where you're looking at, saying we can do an 5 experiment and probably like people in the back, exactly for the reasons Sanjeev was mentioning, the people in the back is not going to get the same impression. But what I want you to do is to look at this particular eye of the squirrel, right. So when you look at that particular eye, and don't look anywhere else. So look at the right eye of the squirrel. If you look in there and I do this, so what happened is the image change. You can see that the image changed, but it didn't hurt your eye. Right? Now, let's do the same thing and look at this squirrel eye, and keep looking at only that squirrel eye, and I'm going to toggle between the same two slides that I did before. So essentially, if you're looking at that, it's like you don't want to look there, right? So essentially, the idea is what we're doing here is we sort of simulating the visual blur that's typically for the human visual system, which has like this gradual blur. So wherever you're looking at, you have like full resolution, and the resolution sort of diminishes across that. So if we simulate the same response with a filter, so essentially, if I sort of blur the image as you go from the point you're looking at, and you look at the right point, then it doesn't make that much difference. However, if you're trying to look somewhere else, then it actually bothers you. And then by bringing them out to this position, you would actually be able to actually look at that and wouldn't sort of hurt your eyes. So it's a very intuitive thing. So the problems I've actually, if we could do that in realtime, then that would be very nice. You could move of the mouse, except for the constraints that we actually have for the system, which essentially means you have to play that on a flash player, it actually doesn't work, because I would have to do this filtering in realtime. So what we actually do is an approximation of that, which is actually what we saw in the previous slide, which is we actually compute like a blurred image and an unblurred original image and a blurred image in a 10x10 box, and then we use like an exponential of a blend between those two images. So essentially, we're not progressively blurring more. We're essentially doing alpha mapping between the blurred and the unblurred thing. So essentially we can actually do that in realtime as long as the video is not too big. So for a 6 640x360 at 24 frames per second, can actually get that on a 2.5 gigahertz. And actually apply that and we actually measure the frame rate and the frame drops and the users in this thing. So it turns out that about like 10% of the workers experience some frame drops. So there's a limit which actually say wow if you have too many frame drops, just stop, say wow, your computer is not fast enough, we can't do this task. But even one single drop -- and that happens about like one user, as I recall, but otherwise, like, less than 10% of user got any frame drops at all. So essentially paid like 25 cents to actually do a hit, which turns out to be like five dollars per an hour. And we have to filter those results. First, we have to filter for random results, someone just like do any junk. And have to filter for distractions. So a lot of times, you start looking at something else and just don't move the mouse or you look anywhere else. That also happens with the eye tracker. So when we're actually doing the eye tracker. So on our eye tracker experiment, actually the screen is not using the full screen, and sometimes people will look at outside the LCD display, which is actually expected. So we also look to watch like how good results do we actually get? Do people -- can people, how quickly can they follow the point of interest of the mouse. So actually did an experiment, which is essentially this. So we asked them to actually track that ball and then we moved the ball at random places and see how long they actually take to actually get there. What it's trying to do is estimate the delay. So how long after you move your eye to a point of interest, how long it takes for you to actually move the mouse there. In this case, there's no question we know there's only one thing to look at the screen, so actually, you know what you are supposed to do. So these are the results that we actually get. So actually, you can see here, so that the black bar, the black line is actually the ground truth. So people essentially take sort of almost a reasonably constant delay. Some people overshoot, some people undershoot this thing, the motion to the correct place, right. And when I actually average them and then compensate for the average delay, so your shift is by the average delay and then you average out the things, it turns out that the accuracy is actually very impressive. So we can 7 actually see that the number of people who overshoot and undershoot is probably about the same. So when you average, you probably get very good results. >>: Just a quick question. >> Dinei Florencio: >>: So the ground truths result is that red ball? Yes. [inaudible] mouse to do it? >> Dinei Florencio: Yes. So like when the circle move here, you going to have to move the mouse quickly. So you're trying to move the mouse as quickly as possible to the point you're looking at. And some people will ->>: The eyes are quicker than the mouse. >> Dinei Florencio: Yes, yes. So essentially, this is -- yeah. So there's experiments that actually show that we spend about like 100 to 150 milliseconds to actually look at some other point. So when you actually have something -so you see something moving here and then your eyes sort of try to track that. And it takes about 100 to 150 milliseconds. So it turns out that what's going on here is like your eye first see that ball, and then you move the mouse, right. So you're taking another, like, 300 to 400 milliseconds extra to actually move the mouse there. >>: [inaudible]. >> Dinei Florencio: this one. >>: So that one is about like 15 subjects. And the same with [inaudible]. >> Dinei Florencio: No, but the screen, like you don't want to have your finger on it. Then it would require people to have the touch screen at home, right, which is not always. Another experiment we wanted to do was like okay. So that was sort of different in the sense that I moved the ball to a random place and then you have to track. But if you can actually, you know, in a movie, a lot of times things are moving. So can you actually track the thing when it's moving. 8 This is the experiment. So this is the experiment we did. So the ball now is just moving at constant speed and then changing directions. And I would say wow, okay, how can you actually compensate, because now you know the ball is going to move in that direction, can you actually compensate, right? So in this experiment, actually, as you can see here, it's like the people did much better. So the first, the delay, the average delay is actually 70 milliseconds [indiscernible] there's no [indiscernible] delay. And again, when you average the results, you actually get very, very good results on the thing. Okay. So that was, those two are like pie examples. Can we get a better pie example? Which might still be a pie example, but was more useful, more typical of the scenarios we're looking at. So we did another experiment with a movie trailer, which is the Ice Age 3. So it's two and a half minute trailer. And then we about two experiments. We did the same experiment that I was describing before with actually [indiscernible] on the web. People actually track with the mouse. And we actually had 40 workers actually running that. And then we get like 12 volunteers and actually bring them into the lab with an eye tracker and some of the volunteers are actually here. So for comparison, keeping the people to the lab and stuff in this particular case took about four hours total person time. And the Turk experiment takes about ten dollars. So thanks for the people for the usability lab for the help with this. So here's the thing. So what I'm going show is I'm going to try to synchronize the two. But the upper one will have the Mechanical Turk results and the bottom one has the eye tracking results. So the balls, each of these balls represent where someone is actually looking at. Okay. So I think most of you actually have seen the trailer so I'm going to stop the trailer. But I wanted to show, actually, one thing here. Actually, I will go back to that in a sec. So let me show what else is interesting here. So this is one particular frame, except I'm doing a different plot of the circle. So you actually can't see actually what people are looking at more 9 easily. So in this particular frame, as you saw from this story [indiscernible] it's like fighting for the nut, right? And then because of the story line, actually, the story line starts with him and then switches like what they're fighting for in there. And in this particular frame, there's still like a change between them, and some people actually going to be looking here, some people are going to be looking at the other squirrel. Some people are going to be looking at the nut. And essentially like any -- so when people try to do region of interest automatic computer thing, essentially you're trying to look at the frame, right? So you cannot look at the story. So essentially, it's like you're never going to be able to figure out exactly where the story's shifting, where the focus of the story is shifting, right. So that's actually important because of two reasons. First, we can actually refine and use this as an upper layer on something -- okay. There's something here which is interesting. And what exactly is interesting in that one, you can actually use some computer vision, some salience estimation to actually figure out what exactly is looking at. So this is some results on what the X and Y position, by using on exactly that clip, by using the Mechanical Turk and by using the eye tracker, and you can see the number of things. The first one is the Mechanical Turk results seems cleaner. So if you look at this and you look the noise, essentially, regarding each of these, you're going to conclude that the eye tracker has more noise. So this actually has two reasons. First, the eye tracker does have more noise. Second, it's like moving your mouse is harder than actually moving your eyes. So moving your eyes, much more natural. And the -- and actually brings me to one point I wanted to show on that video, which is so the only place that actually there is a difference, significant difference is -- hold on. So if you look at this particular frame, I'm not wanting to spend the time on 10 the other one. But it's like that the story actually tells that the squirrel is looking at something and then you can actually see that the people on the eye tracker, they try to see what the squirrel is looking at. But they look, and there is nothing, right? So it turns out that people who actually mouth with the mouse, they actually, when they look at something, you keep looking at different places. When you look at something, there's nothing there, you actually don't move the mouse there. So that actually was the one place, so this is one place actually significantly different, which is not necessarily bad, but essentially it's like when you look at something, then you look and there's nothing, you don't move the mouse there. So there was one of the conclusions. So essentially, we can actually do the region of interest determination at reasonably low cost and very quickly. So we don't have to assemble and disassemble everybody. The quality overall is good, and there are some differences as I just point out, particularly in these instinctive saccades, when you actually look at something and come back to the thing, a lot of times interactions and stuff. And there is the small -- there's a small delay on moving the mouse. And as I said before, it can actually be combined with all the salience modeling. So that's essentially the thing. And if you have any questions. >>: [inaudible] studios reformat white screen format to forest to tree, they have to [indiscernible] region of interest. I don't think [indiscernible] because in many scenes, the director wans you to look at something, and everything else is in the background, and it may not be in the center. So there, the region of interest is much bigger, right? The forest to tree is a [indiscernible] of the 16 is to 9. So can those scenarios be automated? I guess it's mostly done manually currently? >> Dinei Florencio: Yes, so it could be automated. It's probably not the target prime for that, because typically, when you actually coding a DVD, like an actual commercial DVD product, actually paying like a professional to actually see what's the best framing is probably more reliable and probably give better results than actually try to [indiscernible] that. So I think all the applications which are either like lower volume, so you're going to use this in like a lower volume kind of thing, instead of printing, like, million 11 of DVD copies. >>: [indiscernible] the region of interest is bigger, do automated techniques work? So I think here you're trying to go to [indiscernible] region of interest. >> Dinei Florencio: It should, because essentially, you can't really look. You typically just look at one place. So what's going to happen is if you have -- and this is one of the examples on what you're talking about, which is hard to frame. Like there's a dialogue and the director put them like in the very opposite corners, right. So what would happen is that if you actually use this, some people are going to be looking here and they're going to put the mouse there. And then as other people going to be looking at the other person. So in the dialogue, they would be moving back and forth, right. So you would know that some people are looking at this side of the screen. Some people are looking at that side of the screen. Therefore, I cannot -- it's hard to cut either one, right. But the final decision, what I was saying about the professional kind of thing, the final decisions, like okay, people are looking at both sides of the screen. What do I do then? That decision requires some professional sort of thing. But yeah, you would know that people are looking at both sides. Which means if everyone is looking at one side, then you know you can chop off the other side, right? But if not, then it doesn't tell you what to do. >>: So [indiscernible] definitive results from my tracking, this is automated results for like background segmentation, motion detection? >> Dinei Florencio: Yeah, so typically, and typically you can get some results from [indiscernible] and there's a lot of research on sort of trying to get that. The problem of that, of those automated salience detection region of interest things is that they don't -- they can't follow the story line, and that's what I was trying to say here, right. So a lot of times, like all they can do is they can say, well, this looks like relevant and they do a lot of focal kind of thing. So they know that everything that's blurred. If this is blurred, it's probably not of interest, right? 12 But then they would not be able to actually tell you where the focus of interest is, as the story progresses, right. So and that's one of the things. But as I said, is like one of the things that would be relevant, actually, to combine that with some lower level techniques to actually see in this region, what is the interest. >>: [inaudible]. >> Dinei Florencio: So that was -- so when I did those two experiments, we're trying to figure out, okay, so when there's a saccade, what is the delay and can we track? And what we concluded was exactly those two times are different. So when you actually, when you're tracking something, you can actually compensate for that. So we would need better modeling of what's actually going down. So and then yes. If you actually do the modeling, you would actually -- we didn't do that, but you would [indiscernible], but the speed you're moving the mouse, I know you're not looking, you're just moving somewhere else and I assume that the delay is going to be the delay that we got, like the 500 millisecond delay, right? If you're actually moving the mouse continuously, then I know you're actually tracking something so the delay is probably 70 milliseconds or so delay. But we haven't actually done it. So we -- if the delay was like more uniform than was easy just to shift, but we didn't. The other thing is like the eye tracker results also has a delay because they actually have to do some filtering, because the initial results is very noisy. And then they do some smoothing, which means it also has a delay, and it's also hard to -- we didn't do the experiment of the balls and stuff with the eye tracker with so we don't know what the delay is, but we know there is a delay. Okay? >>: Thank you, Dinei. [applause] >> Zhengyou Zhang: I'm just about to introduce Dinei. Dinei is an associate here and has done a lot of work from audio, video, image, and to security. And he was a technical [indiscernible] of this year, 2011. 13 So next, Sanjeev will talk. Sanjeev is principal software architect here, and he was [indiscernible] manager of Windows Media Group in charge of audio [inaudible]. >> Sanjeev Mehrotra: Yes. Hello. Good morning. I'm going to start with this talk, the first one. And the first one is regarding basically audio spatialization. And the technology that I'm going to present here is basically a method to do low complexity audio spatialization and allow you to arbitrarily place sounds at various locations within a given environment. And the main idea of audio spatialization is you want to create a virtual environment in a real environment. So I'm in this environment, but I want to -- when I hear something, I want to hear like I'm in some other virtual environment. And this type of technology is very useful for -- and it's actually becoming more and more useful these days, because one example is multiparty video conferencing, where you have four, five people, you want to place them around the table, and, you know, you can watch them around the table, but you should also be able to hear them from where they're seated on the table. And another example, of course, is immersive games, where you want the player to feel like they're in another environment, and the sounds that they hear should be coming from particular locations. And here's a simple example. You have a table and so the bottom circle is essentially the listener. And there's those two red dots are the listener's ears, left ear and right ear, and you essentially have, say, three sources, one labeled 1, 2, 3, and they're seated around the table, and essentially, you know, you want to find the sound at two virtual locations, which are going to be the left ear and the right ear. And, of course, you know, so this is the virtual environment you wish to recreate, and, you know, the real environment you're using to play back can be different, of course. So in the simple case, you could just have a single listener listening on headphones, and this is the simplest case, of course, to, you know, that's the real environment. And you can also have more complicated environments, where you essentially have, let's say, two loudspeakers and, you know, the listener is sitting far 14 away, and these are essentially like, you know, right problem with spatializing the audio to these virtual locations and then the sort of the inverse problem to that is given the real environment, how do you play back the sounds to -- so that it sounds correct at these real locations. And essentially, you know, this is audio like, cross-talk cancellation, where you this loud speaker is going to go to both some path and you want to kind of invert spatialization, and then this can be, want to cancel, you know. Because ears, and it's going to travel through that path. So the overall process, when you want to create this virtual environment in a real environment, is sort of to, you know, invert the real environment and do the forward transfer among the virtual environment. And both of these, essentially, are going to use, you know, similar technologies, except in one case, you're going the inverse. The other, you're doing the forward. And the main thing here is that the sound, it's essentially, there's -- people think of this as kind of two components. So, you know, the sound is going to travel through the room. You know, there's the room response, which is how the sound travels from one point to another point. And, you know, this is essentially, you know, this is kind of the room impulse response, where the sound travels from one point to another and there's multiple components there. The direct path and then the sound is going to reflect off the walls and it's going to keep reflecting and decaying. And the late reflections and you essentially have like the room reverberation. And the other portion, of course, is the, you know, how a sound from some location differs when it goes to the left ear or to the right ear. And this is the head related impulse response or the head related transfer function. And typically, you can, you know, you can definitely measure both of these. So the head related transfer function is always typically measured, and the room impulse response can be either measured or it can be kind of just modeled given the room geometry. And this is the way people have been typically doing it, breaking it up into the room impulse response and the head related transfer function. And Wei-Ge and Zhengyou had presented some work couple years ago. Instead of breaking it up into two components, the room impulse response and the head related impulse response, you can alternatively just measure the combined response using a dummy head. So you can place, like a dummy head in a particular room, put 15 microphones in the ears and play sound from various locations and you can essentially measure the impulse response to each of the two ears. And this is similar to how you measure HRTFs, which is you place a listener in a particular location and you kind of put the sound at a certain distance from the listener, and at various angles, and just kind of measure what you're hearing. And so the difference, of course, is that in this case, the combined response is not just a function of the source location and the orientation, but it's also a function of the listener position. But, of course, we are can make a simplification, which is to assume that only the relative distance between -- relative distance and orientation between the source and listener matter. So if the sound source is located some distance, D, at an angle theta from the listener, we kind of assume that this is similar to -- so both of these have similar response. And we can probably relax this simplification also, but this makes it easier to measure -- it makes you have to measure at a fewer number of positions. Otherwise, you have to also change the listener position. Yes? >>: Question. This relationship, so if we are talking about basic head related transfer function, where you basically combine room response, how, I mean, how this [indiscernible] or, I mean, how good is optimization? >> Sanjeev Mehrotra: So I think it holds true for, of course, the direct path, right? So it's -- in HRTF, you actually measure without a room, essentially, right? And probably for the late reflections and the eventual room reverberation, it doesn't matter much either. Because, you know, basically just -- yeah. So I think it just matters the most for probably the earlier reflections. But I think if you're concerned primarily with locating where the sound source is, that's more related to the HRTF, right? And it's kind of like essentially, you're moving and the sound source is moving so that's kind of the effect you would get. So if you shift two meters to the right and I also shift two meters to the right, you know, probably how you hear me within this room won't change that much, unless you're like very close to a wall or something. 16 >>: [inaudible]. >> Sanjeev Mehrotra: So basically, what we're trying to do is we're trying to minimize the number of positions that we have to capture this response. >>: I know, but there's three things, direct path. >> Sanjeev Mehrotra: >>: Yes. Reverberation, you can capture in other ways. >> Sanjeev Mehrotra: Right. >>: So the only thing that really matters, the room itself is the early reflections, but those are not consequently changed. >> Sanjeev Mehrotra: Right. So they're -- what they're being measured here is for a given location in a given room. So, I mean, you can definitely measure it for, like let's say for this room, I can measure, you know, me moving around the room as well, right? But I think, you know, if the room is pretty large, then, you know, I think it matters more if you're closer to, you know, the actual reflecting surfaces, right? So, of course, the main issue with measuring HRTF in this combined head and room impulse response is that you can only -- you can place the listener at a particular location, but you can only measure this, you know, by moving the source at various locations. So let's say I measure, you know, I place a source here and I measure the response. And but then, of course, the source can be at a different location, right? So the natural thing is you can do interpolation from the points that you've measured close by, right? And there's actually theoretical results to show, you know, how the spacing should be done so that the interpolation is sort of perfect using some sort of synch interpolator. So you can actually, you know, limit the number of measurements pretty significantly. 17 But then the interpolation that's needed to do the reconstruction is essentially a synch interpolator to do the perfect reconstruction. So there's many simplification you can do to the interpolation. So one is, of course, you can just interpolate the impulse responses in time domain. But doing that without any other processing is not good, because there's different delays, right. So you may essentially, you know, if two responses are off by 180 degrees, you essentially just cancel them to zero. So the way people typically do is they try to align two impulse response that you're going to do the interpolation. And then once you do the alignment, you can do the linear interpolation. And doing temporal alignment in the time domain course, is that it assumes that the response is frequency components are shifted by exactly the course, complexity in determining the alignment has several issues. One, of linear phase, meaning that all same amount. And there's, of amount. So we're essentially going to do frequency domain interpolation. So for each of these impulse responses, we can take the transform, right, and do interpolation to frequency domain. And, of course, when you're doing interpolation to frequency domain, you can -- let's say you're interpolating this point from these four points. You can just sort of like a bilinear interpolation. You can interpolate along the radial direction and then interpolate along this angular direction. And the reason for using polar coordinates instead of using cartesian coordinates for doing the interpolation is that we want to essentially scale the power of the impulse response, because power is going to essentially decrease with square distance. So when you're in this coordinate system, it allows you to make sure that the power is correct, depending on the distance that you're at. And so the interpolation is going to be done independently on the magnitude and the phase of the impulse response. And the phase interpolation is going to assume -- I mean, phase interpolation is sometimes difficult, because you don't really know what the true phase -- I mean the true delay has been. We kind of make assumptions that it's minimum phase and causal system. And so the interpolation is kind of straightforward. The magnitude, and so 18 first, you know, do the angular interpolation, then do the interpolation in the other coordinate, and then the main thing is that you want to make sure that the power, you know, decays with the square of the distance. So the main thing here is just that in this coordinate system, we can scale the power correctly. And so the overall scheme is pretty straightforward, and essentially, the thing -- the results I'm going to show is that we measure -- so for a given distance, I'm going to show results. So you measure the actual response at three different angles, okay? And you interpolate the middle one using the other two. And then we're going to compare the true measured response with the interpolated response. And so here's the results. So this is for the left ear. This is for the right ear. And the left column is essentially the impulse response. The right is the frequency response. And basically, I show the true response is this dark blue line. And using the frequency domain interpolation technique is this pink line, and using the time domain interpolation technique is this light blue line. And the main thing to see here is that, you know, time domain interpolation, you know, has -- so basically, from the time domain response, it looks like it's working pretty decently. But if you actually look at the frequency response, doing, of course, the frequency domain interpolation gives you much better results, actually. And if you actually compare the SNR between the true impulse response and the interpolated one, the frequency domain interpolation technique gives you almost nine times the signal to noise ratio as doing the temporal domain interpolation. And yes? >>: I notice in this case, the example point is rather sparse. >> Sanjeev Mehrotra: >>: Yes. [inaudible] assume part of that is measurement difficult? >> Sanjeev Mehrotra: Right, right. 19 >>: I mean, instead of moving the [indiscernible] just rotate the basically either the human body or the dummy head. You can have a much [indiscernible] revolution. >> Sanjeev Mehrotra: Right. So that's one way of doing it, right. So I think to get sort of a perfect reconstruction, the theoretical results so that you need to at least measure almost to the rate, like five degrees or something like that. And so although I presented the results for the spatialization, you can also, you know, apply the same interpolation techniques for the cross-talk cancellation, because that also involves measuring the room response and the HRTF. And the final thing is that, you know, if you're already in the frequency domain, it's very easy to actually do the spatialization itself using overlap-add methods for convolution. Okay. >>: [inaudible]. >> Sanjeev Mehrotra: So I actually have a demo to kind of rotate, but we haven't done any user studies, but I have a demo to kind of simulate moving around. So basically, like the listener's in a particular location and let's say the sound source moves in a particular pattern around the room. And ->>: [inaudible]. >> Sanjeev Mehrotra: So you need to wear headphones. I mean, I can play it here, but I don't think you'll get the effect. But if anyone's interested, I can show it offline. Just I need to get the headphones for that. Yeah? >>: So main reason for doing this combined thing is to reduce the number of measurements? What is the main reason you're doing it combined rather than separate [inaudible] responses. >> Sanjeev Mehrotra: One of the main things is it gives you more realistic sensation of the actual room, rather than modeling the room. >>: Supposing you measure, you know, measured at a point, a [indiscernible] microphone, omni direction microphone sound source in the room. You just move 20 the sound source around, measure so you'd get a room impulse response? >> Sanjeev Mehrotra: Right. Oh, so you're saying measure the room response and use HRTF separately? Yeah, so another reason to do the combined thing is there's a complexity reduction when you do the combined impulse response. And if you look at the -- I think one of the main things is once you do the combined response, you can actually, you can actually break up the filter pretty easily into a very short top filter, which is independent of position and direction. So that means -- so the short tap filter is dependent on position direction, and the tail of the filter is essentially independent of position and direction. So it gives you a big complexity savings when you're actually doing the convolution. >>: Isn't the short one just the HRTF? >>: Yeah the head is already really short. >> Sanjeev Mehrotra: Yeah, so but the room response, also you can break it up into too, right? So you can break the room response into the long tail and the short tail. >>: [inaudible]. >> Sanjeev Mehrotra: Okay. So the next one is depth map codec scheme, and this is essentially a very low complexity realtime codec for coding depth maps and the main advantage of this is -- yes? >>: [inaudible]. >> Sanjeev Mehrotra: I saw that. So the main advantage of this is that the complexity is extremely low. You can encode and decode a frame in -- at the rate of almost over 100 frames per second. And the other advantage is it gives you compression ratios which are better than -- which are actually significantly better than JPEG 2000 lossless mode as well as JPEG-XR lossless node. That's not to say that this is necessarily the best depth map codec, but it's a very simple buffer-less -- it requires just no frame buffering. 21 And so depth maps are very -- are becoming more and more common in media processing, and, you know, there's examples of cameras such as Kinect and time of flight cameras, and depth maps themselves are very useful for many tasks. Processing tasks such as fore ground, background segmentation, activity detection, and also for rendering alternate viewpoints. So you capture a view from a particular location and you capture the depth map and you can use that to reconstruct alternate viewpoints. But the main drawback is that if you're actually going to transmit a depth map, a depth map is essentially 16 bits per pixel. And it's very, very costly to transmit a depth map. And we wanted to kind of develop a depth map codec which is extremely low complexity, doesn't require any frame delay. So no prediction across the frames. And although, I mean, you don't want to necessarily lossy code a depth map in the same way you lossy code an image, because the final thing that you're consuming is not the depth map itself, but the depth map is perhaps being used to do some other tasks, and so instead of doing loss decompression, the simple thing is just to do lossless to near lossless compression so that you don't necessarily lose fidelity when you're doing the task that you want to do. And, of course, we wanted to make sure that it gives you much better compression ratio than existing schemes. And so here's an example of a depth map and each pixel is essentially 16 bit. But the thing to note is, of course, there's very, very high correlation across pixel values. And another main thing is that the sensor accuracy actually decreases with depth. So if you're actually measuring the depth, which is twice as far as some other depth value, then the error in the measurement is going to be two to -- I mean, four times as large as the measurement or that you would get in the closer values. So the further out you go, the less accurate the sensor gets. And this essentially means that, you know, if the sensor is less accurate why, you know, why code it at full fidelity, right? So essentially, you can quantize the values that are further out without losing any accuracy. So the codec are essentially consists of three components. One component takes advantage of the fact, the last fact that I said, which is that the sensor 22 accuracy decreases with distance. So instead of coding the actual depth, you code inverse of depth. And essentially, you know, you pick A and B. If you pick A and B properly, you can actually maintain the full fidelity of the sensor in your coding. And if you actually choose A and B, you know, slightly, you know, off, off, then you can actually do some minimal quantization of the depth values also. So like if A is smaller than it needs to be, you're actually going to, you know, end up quantizing the depth value. The main thing is you don't need to know actually how this works, but the main thing is there's going to be a parameter which you can control, which controls whether you're lossless with respect to sensor accuracy or whether you've introduced some amount of loss. Some small amount of quantization loss. And the other thing, of course, was that there's very high correlation between pixels. So you can actually do a one-stop simple prediction. And then once you do the prediction, you end up with long strings of zeroes. But the long strings of zeroes, you don't want to pre-decide a distribution on zeroes or anything. So a very effective technique to code the output of that is to use an adaptive run length Golomb-Rice code. And this is a very low complexity, lossless entropy coding. So I'll just talk a little bit about the adaptive RLGR code. So I mean, the details are that, you know, this is an adaptive scheme to code. If you have a long string, there's many zeroes and then there's a level and then there's a long string of zeroes and then there's a level. But sometimes, you may have a long string of non-zero values as well. So sometimes you -- and so the way it kind of works is you operate in one of two modes. A run mode or an on-run mode. So in a run mode, you expect there to be long strings of zeroes and there's a parameter that tells you how long of a string of zeroes you expect. And in the no-run mode, you kind of are just coding levels themselves. So the no run mode, each symbol just gets coded using the Golomb-rice code. And in the run mode, you know, you code the string of zeroes using some symbol and then you code the level. And the other nice thing about this is that you don't pre-decide your distribution of zeroes, and you kind of learn it as you goal. And so when I'm going to show is I'm going to basically show you the compression ratios in encoding, deep coding time for this codec. And I'm going to code two sets of depth maps. One set is slightly more complicated because it actually has a 23 large range of depth values. So this left one, even the background has values. In this one, in the right one, this is a simpler set. And the it's simpler is that a lot of the background values are actually out of sensor range. So they're actually call awl coming up as zero. So that less information to code here. And I'm numeric coded. coding, depth reason the there's going to compare five methods for coding. One is basically just el lossless. There's absolutely no loss between the original and the And the way you do that is that you essentially don't do the inverse okay? Next one is this is essentially -- essentially lossless, with respect to sensor fidelity. And then there's two more, which are slightly lossy. They're pretty close to lossless, but slightly lossy. And then I'm going to compare the compression ratio with JPEG 2000 lossless mode, which -- so I tried many different codecs. JPEG 2000 lossless. JPEG XR loss less, you know. P&G and many different things. Actually it turned oust out of all those, this was the one that actually performs the best. And so here is the compression ratio for this set one, which is the one that's more complicated. So this is the numerically lossless coding. So, you know, you get about a 5-to-1 compression, numerically lossless. And this is, you know, essentially lossless with respect to sensor fidelity. You get about 6-to-1. And these are slightly lossy. And wean slight quantization, you can get pretty good compression ratios of 15-to-1. And as a comparison, JPEG 2000 compression ratio is about -- is under 4. This is the lossless mode of JPEG 2000. And these are median results, and even the max, 90th percentile results are not much worse than this. And, you know, here's the encoder speed in millisecond. Decoder speed in milliseconds. And this is for set two. So set two is much simpler to code. And in this set, you can see that even numerically lossless, you can get about 13-to-1 compression. And, you know, even with slight loss, you can get 35-to-1 compression. And JPEG 2000 lossless is under ten. And this set is actually very easy to code. So, you know, in about seven milliseconds, you can encode and decode a frame. And then so if you do numerically lossless, then any processing you do with a 24 depth map is going to be identical to the uncoded one. But what I want to show here is that even if you do, let's say, lossy coding of the depth map or near lossless coding of the depth map, what I want to show you is that processing is unaffected by this small amount of quantization. So the way to do that is that we capture a view from multiple viewpoints. So we have essentially, I guess, many cameras which are capturing the same scene from many viewpoints, and this is the viewpoint, you know, one given view point and this is a depth map corresponding, and all the cameras are calibrated. So this is the original viewpoint. This is the uncoded depth map and this is the coded depth map with a slight amount of loss. And this is an alternate viewpoint. So basically, the viewpoint before. So this, this RGB map and this depth map are used to reconstruct another viewpoint. And this is the original that you capture from that other viewpoint and this is the reconstruction you get using an uncoded depth map and this is the reconstruction you get using a coded depth map. And the results you get from using the coded depth map versus uncoded depth map are pretty identical. And this kind of just shows you the PSNR when you're reconstructing five different viewpoints. So this, you know, and they're kind of concatenated back to back, actually. So this is viewpoint one, viewpoint two, viewpoint three, viewpoint four, viewpoint five. And this essentially compares the original capture with the reconstruction you get. And this is basically using lossless and some amount of loss. Basically, what you can see is even if you code the depth map, you know, to some small quantization, you effectively don't hurt the reconstruction at all. You get the same reconstruction as you would if you use an uncoded depth map. >>: [inaudible]. >> Sanjeev Mehrotra: This is a PSNR between the original capture and the reconstructed. And reconstruction is done using another view point. >>: [inaudible]. >> Sanjeev Mehrotra: >>: Yes, for the IGB. [inaudible] is better than lossless. Yes. 25 >> Sanjeev Mehrotra: Yeah, 750, yeah, slightly better. I mean, but it's, I mean, there's no -- I mean, let's put it this way, the reconstruction itself is not 100% accurate, right? So, I mean, you could go either way when you code it, right? >>: [inaudible]. >> Sanjeev Mehrotra: >>: Yes. Yeah, ground truth is measured too. So that area is probably larger. >> Sanjeev Mehrotra: Yeah, right. And this is showing you the -- this is comparing the reconstruction using an uncoded depth map with a reconstruction using a coded depth map. And, you know, at various quantizations. So, you know, with near lossless, it's almost the same. And even with this, you know, even with this, this much amount of quantization, the results are, you know, the PSNR between the reconstruction is not that bad. And basically, the conclusion is that, you know, it's very low complexity codec, and it gives you better compression than existing, you know, lossless coding schemes. Okay? Yes? >>: So I can kind of accept the fact that the depth values are [inaudible] using different things, but when you take the inverse of that, and you [inaudible]. >> Sanjeev Mehrotra: Well, actually, yeah, they're pretty much identical. That's the thing. So I mean, the Kinect sensor, data coming from the Kinect sensor actually has identical values for neighboring pixels. >>: [inaudible] slight moderation or is it more coming away from it, or are they perfectly linearly related to each other? [indiscernible]. >> Sanjeev Mehrotra: So it won't be a linear relationship, but it may be -- I mean, it may be a mon tonic relationship. >>: So they looked better on the original samples than the [inaudible] samples, right? 26 >> Sanjeev Mehrotra: So you're saying basically the inverse will amplify the difference between two successive values, yeah. >>: But the sensor [indiscernible] is not [inaudible]. >> Sanjeev Mehrotra: So the thing is, I think, the Kinect sensor itself is probably doing some sort of filtering too. >>: At longer range, you can see the effect. [inaudible] within a foot or two, it's, the quantization error is almost the same. No? What is the curve? >>: [inaudible]. >> Sanjeev Mehrotra: The main thing is there's also some amount of filtering going on with the data coming from the sensor itself. >>: I know this is a fairly [indiscernible] you guys get a [indiscernible]. >>: We have also [indiscernible]. >>: What is the noise level new sensor compared with the old sensor? >>: [indiscernible]. >>: Same noise characteristics? >>: Kinect would be different. >>: Time of flight? >>: Time of flight is different. >>: And if moving to the [indiscernible], I mean, [indiscernible] is that noise extends? >>: For Kinects, for time of flight, probably not [indiscernible]. >>: You mean the noise level? >>: No, no. 27 >> Sanjeev Mehrotra: I think there's other ways to do quantization, even if you don't do the inverse coding. Even without the inverse coding. Inverse coding gives you an additional 30% or so. >>: I think the inverse coding still basically benefit, because if you're far away, the importance of accuracy is less. >> Sanjeev Mehrotra: I think essentially it's like a nonlinear quantization, right? So the further out you go, the [indiscernible] get larger, right? >>: That was my question, you try to [inaudible]. >> Sanjeev Mehrotra: No. But imagine you would probably get -- I mean, the gains may not necessarily be as large, because many pixels can begin well predicted from the top, I mean, from the previous pixel. So only probably the boundary pixels, like if somebody -- and even from frame to frame ->>: [inaudible]. >> Sanjeev Mehrotra: Right. But what I'm saying is like the only thing remaining after this pixel differencing is essentially the edges, right? And the edges are likely to move anyway from frame to frame, right? Right. So and the background is already pretty well zeroed out. So there may not be as much gain by doing frame differencing. >>: [inaudible]. >> Sanjeev Mehrotra: I think there's many ways things like this can be improved. But the main thing again is like, you know, this gives you already like pretty good recompression ratios and gives you encode and decode frame rates of close to 100-plus frames per second, and that's very important for any realtime process -- realtime application such as if you're using this for video conferencing, you don't want to do complicated things, right? You want to get as good com protection as you can for very, very low complexity. >> Zhengyou Zhang: [applause]. Okay. So let's thank Sanjeev again. 28 >> Zhengyou Zhang: So I'm going to talk about the ViewMark system. This is interactive video conferencing system for mobile devices. And this is joint work with my summer intern, Shu Shi from UIUC. So nowadays, it's very common, you have to join remote meetings from your home, your office, your [indiscernible]. And ideally, you want to have a similar experience as a face-to-face with remote participants. So this means you want to see what you want to see, you want to see who is talking. You want to see what's the reaction from other people. And from the audio aspect, you want to hear voice coming from different directions, so spatial audio, like what Sanjeev mentioned earlier. And also, you want to share documents and objects with other people. So that's the ideal case. So this motivates some high end [indiscernible] or HP [indiscernible] room. But this is very expensive, it's -- each room cost about a half million dollar, U.S. dollars. And another device which is much more affordable is the ViewMark Microsoft roundtable. This device has five cameras which produce panoramic video. So if you put the device at the center of the table, then the remote person can see everyone around the table. And also, there is six circular microphone around, and there's a microphone processing can determine who is talking so the device can send it to the remote party, active speaker video. So here is interface of the roundtable. So this is actually speaker window and this is the panoramic video. And here you can share screens with other people. And more and more, people are now using mobile device to communicate with others. And since mobile device is always with you, it's very easy to be, to reach out or to be reached. And with iPhone face time, more and more people are now use video for communication. But as a video on the mobile device still Pretty limited. So if you try to join a remote meeting room, it's probably pretty poor quality. So this is probably the best you can get. It's very hard to see speaker's face and the content on the display is probably very hard and you cannot see other people talking unless you ask other people to move the phone around. Okay. So it's not very immersive. Active speaker has to be done manually by asking other people, and there's no data presentation sharing with iPhone face 29 time. So to improve the experience of video conferencing, we develop this ViewMark system. And here I will show first the ViewMark system with the roundtable. So the mobile device is connected to a roundtable, and roundtable has been modified so we can produce spatial audio. So it's not -- standard roundtable give you only single channel audio. And we modify so we can get spatial audio. And here is the interface of the shoe mark on the mobile device. You have this remote video, so that as I mention earlier, you know, a roundtable can produce 360 degrees panoramic video. But because of the screen size on the mobile device, you can see only a fraction of it. And I will just cover this later, how to [indiscernible] panoramic video. And you have a [indiscernible] video which can be minimized to reduce the competition cost on the mobile device and you can use a portion of the screen to share data with [indiscernible]. And on the right here, this is screen shot with book marks. So when you feel a [indiscernible] is interesting, you can book mark the location and later you can come back quickly and it sets [indiscernible] here. So what we video, and -- use the interested do here is since it's remote, the roundtable can give you panoramic you see only portion of it. So one way is to use the finger touch touch to move around. And to determine which party you are in watching. Another way is to leverage the initial sensor on the mobile device. Basically, you can move the mobile device around to see different part of the remote room. And then if you say okay, this is location I want to watch probably later, you can tap on it and create a book mark here, and you can create another one, okay. So you can create as many book marks as you like. And then later on, if you want to go back to that location, you just double tap on any of the book mark and you switch -- you get to view live immediately from that viewpoint. Also, our system can do the data sharing. If you are in portrait mode, then the data is shared from the low part of the screen. And if utility to landscape mode, then you have a full screen of the screen shared by the remote, from the meeting room. 30 And the data sharing software was provided by Amazon Asia. And here, I will show you a video of the system in action. So it's a live recording of the demo. >>: This is my roundtable and I'm presenting some slides here. Some of my colleagues are joining remotely. We'll see how the expedience with the mobile device looks like. >> Zhengyou Zhang: get the video. So I'm on the remote side with the mobile device. So you >>: You see a video from the remote room. As you saw earlier, the remote room, there's a roundtable, takes us 360 degrees panoramic video so I can pan it around to see different people, okay? Here I see Rajesh, and I can see further [indiscernible], okay? So I can bookmark different location. >> Zhengyou Zhang: I don't know why I went back to [indiscernible]. >>: And later on, if I want to see Rajesh, I can double tap and immediately switch to Rajesh. And this is the video of myself just to check how I'm looking like in the meeting with the spatial audio. >> Zhengyou Zhang: So we cannot demonstrate spatial audio. So here you can hear the voice are coming from different directions because of our spatialized audio. >>: Actually, very immersive in the meeting. And the next, the data corroboration part, the data sharing part. So this remote [indiscernible], and let's take a look at a full view we see it's a full screen of the remote desktop. I can move see, it's fairly easy. >> Zhengyou Zhang: that's the demo. Okay. I will show you is data from the here. So here is, around and you can You're seeing the [indiscernible]. Anyway, so Okay. So we also develop connect ViewMark with a new device called a spin top. The spin top is shown here on the bottom. It can spin with a programmable motor. This can be controlled from the computer. So you cap see here, we just put the laptop on top of the Spintop device with a camera and you have a screen. So basically, it can serve as a proxy of a remote participant. 31 And the system works exactly the same way, like I say, interfaces the same, everything is the same except now the panning is not choosing the -- a particular view of a panorama video. Now they're panning on the mobile device we control physically spin top device. So you can move around remotely, okay, to control the device. Just gives the remote person a lot of control. So in summary, the ViewMark system give you a two-way interactive audio video conferencing experience. And it's immersive with binaural spatialized audio and the mobile user can control the video stream to view different parts of the remote room. And you can bookmark viewpoints of interest and switch view quickly. And you can also have the active speaker view provided by the roundtable device. And you can share screen and display. And in perspectives, the system can be improved in a number of ways. For example, currently, the ViewMarks are arranged sequentially, depending upon when you tap on it. Ideally, we want to have the ViewMarks arranged in the consistent way with the physical location. Okay? And the second possibility is to do the face detaching active speech detection also. So you can have a suggestion from system, who are the active speakers, possible speakers. So you don't need to tap on the device. It can be automatically populated as some book Marx on the mobile device. And now, if you move from the mobile device, small smart phone to a tablet which has a bigger screen, probably want to have a better UI design. And here's acknowledgment. We have received quite a lot of help from different people. And thank you for your attention. [applause]. >> Zhengyou Zhang: I just need two minutes for questions. >>: You mentioned using face detection [inaudible]. What if even by that time you use the detection, you locate the location, then you come back to that view, the person has moved. >> Zhengyou Zhang: That's possible. Then you need to update, yeah. If you can check the persons, you can update the location automatically. So because 32 of the ViewMark is, what it sees is the location. In the Spintop case, it's just angle of the device. So that's -- it can be updated depending on the face detection and tracking. >>: So the Spintop is [indiscernible]. >> Zhengyou Zhang: Yeah, yeah, it's like a proxy, right. Physical proxy of the remote person. So you can put a spin top on the end of the meeting room table. Okay. So thank you very much.