1

advertisement
1
>> Zhengyou Zhang: Okay. So let's get started. Welcome to the
[indiscernible] public seminar. I am Zhengyou Zhang, [indiscernible]
multimedia introduction and the communication group.
So this is a part of our group's public seminar series. So this is a tradition
started by a few, which idea is to give internal talks before we present to the
external world. So we think this is a good chance to practice, also, for us.
So today we have four mini talks. Those talks will be presented in about two
weeks to the [indiscernible] international shop on multimedia [indiscernible]
processing. And we'll have four talks. First, Dinei will give the first one.
And Sanjeev will give the next two talks, and I will give the last one.
So Dinei, just go.
>> Dinei Florencio: All right. So thanks for the introduction. So I've been
presenting our work in crowdsourcing for determining a region of interest in
video, and this is joint work with Flavio Ribiero at University of Sao Paulo.
So the first, like, crowdsourcing has attracted a lot of attention, and
crowdsourcing, exactly what is crowdsourcing? So it's essentially the idea of
actually using a crowd, using a large number of people to actually achieve some
purpose. And that purpose could be, like, funding or could be like
[indiscernible] some knowledge in. And essentially, the crowdsourcing like has
been there for a while, but with the advent of the internet, it's much easier
to achieve and reach a much higher number of people.
So very successful examples of that is Wiki and things. Like Wikipedia is a
very huge phenomenon, crowdfunding, where you actually apply to the community
and say we want to fund some specific project or anything. And then you
actually collect like a small amount for a number of people.
We're more interested in particular what we want to call human computation,
which is the idea of using humans to do processing tasks which are not easy for
computers to do. And in particular, you're trying to do that with
crowdsourcing.
So one of the benefits of actually doing that?
First, the scaleability.
You
2
can actually get very large crowds and recruit and.
>> Zhengyou Zhang: That very quickly. So if you need like a thousand people
for five minutes each, it's like, it will be essentially a huge operation to do
this in real life. If you can actually do that through crowdsourcing, using
the web, you can actually get the thousand people and dismiss them in five
minutes very easily.
Typically, the cost on that is being below market wages, in particular because
of this facility of recruiting and dismissing of that.
And you can recruit, like, a very diverse work force. So there's a number of
reasons it's a very simple, scale able, cost-effective way of actually getting
people to work on particular tasks for you.
So the most widespread, the most common platform for crowdsourcing is the
Amazon Mechanical Turk, which was launched in 2005. Has about half a million
people registered. And at any time, there's like 50 to 100 thousand tasks
offered out there available for people to actually take those tasks and do.
So the typical human intelligence task, as they call like HITs, are typically
very simple tasks. Preferably 15 in a page and requires like typical one or
two minutes to perform. And typically, you're going to be paying like five to
25 minutes -- five to 25 cents per that task.
So Microsoft does have now an internal offering in crowdsourcing, which is what
we call, like, the universal human relevance system, which actually we've been
using right now for internal things. So we're actually using that for a lot of
work relating to search and relating to some of the other things. But that's a
recent offering from Microsoft and is being mostly used internally right now.
So we have done some previous work on crowdsourcing, and that's essentially
been the idea of using the crowdsourcing to actually do MOS studies. So when
doing like mean opinion scores, like subjective testings off audio and other
things, we actually need to ask people. So typically, you actually will bring
people to the lab, do some experiment, ask their opinions and so on.
So we're trying to [indiscernible] to previous work, one applied to audio and
speech, the other one applied to image, essentially to reproduce that without
the controlled environment of the lab. So essentially just recruit workers
3
from the Mechanical Turk, have them rate the files, the same way that someone
would rate at the lab, except that you do not have the control. You do not
know if they are the right distance from the screen, if they have the correct
headphones and so on. And then find ways to actually filter that process and
get the results you would get from more controlled studies.
So essentially, the results were very encouraging. We essentially can get
pretty much the same quality that you would get a lab study, and with a much,
much faster and lower cost than you would get.
So however, that's a very batch sort of processing. So the idea was there's a
study in the lab, and I you to do that study using crowdsourcing. But it's
like a task, and the end is the end of the task, which is a number.
So our vision in that was more tightly integrated. So it was like can we use
human computation as a processing block. So have like some algorithm and
there's some particular piece of that algorithm which is hard for the computer
to do, and can then we use like crowdsourcing for that.
And then essentially, for that, the delay has to be much smaller, and the task
has to be a task that can actually, that fits into this process, like a
processing sort of task.
So in order to try to go in [indiscernible], we essentially have like this
particular work which is try to find region of interest in video.
So when you have a video code, we haven't done that part, and so that's what's
in future work here. So if you have like a video codec, where you're actually
going to use a region of interest, you're going to use the parts of the video
which are more relevant, and the parts of the video where people actually look
at and then put more bits there as opposed to the rest of the video, then you
need to know which parts of the video are those parts of interest.
So the problems of doing that by automatic means is very hard. And
essentially, you're going to see a few examples later. But like it's very hard
to actually figure out what's important or not in a piece of video.
So one way of doing that is asking people. So if you ask people to actually
watch a movie, and that's a traditional way it's done so you actually get
people put the movie in front of them and put on an eye tracker and you see
4
where they're looking at. You say people are looking here so that's the place
most people are looking at. So that's where we have to it the bits.
However, that requires the same thing, bringing people in the lab and then
having them with an eye tracker, which costs like $20,000 or so, and then going
out.
>>: So that application, does the screen have to be big enough? Because if
it's a small screen, can you figure out where I'm looking? It probably doesn't
matter.
>> Dinei Florencio: Yes, and no. So the bigger the angle of view, right, the
bigger that, the more your human vision system will decay. So yes, there would
be more difference. If what you're looking at is very small, then probably the
resolution of your eye all fits in the fovea and there isn't much to do. Yes.
But typically, so for HDTV kind of scenarios and even for most of the PC
viewing thing, there's actually almost like a 30 degree field of view, which is
actually very significant.
So essentially what we wanted was like, well, if you can do that in a video.
Say you have like, say, YouTube video, for example. So somebody upload a
video, and then you want to optimize the coding for that thing.
So I got that video, I can actually, for example, crowdsource this thing and
get what people are looking at and then insert that into your video coding and
recode that video with that particular result of that.
So let me do an experiment. The problem is like okay, I need to know what
people are looking at. And the problem is in the lab, we use an eye tracker,
but people at home don't have an eye tracker and I cannot provide them with eye
tracker. So how do I make that part? How do I make you at home figure out
where you're looking at and how do I force you to tell me?
So essentially, we say okay, we could ask people to point with their mouse,
like I'm looking here, right? But then it's like how do I force you to
actually do that and how do I guarantee that you do?
So essentially what we tried to do was like, okay, like if this is like a
video, and I want to know where you're looking at, saying we can do an
5
experiment and probably like people in the back, exactly for the reasons
Sanjeev was mentioning, the people in the back is not going to get the same
impression.
But what I want you to do is to look at this particular eye of the squirrel,
right. So when you look at that particular eye, and don't look anywhere else.
So look at the right eye of the squirrel. If you look in there and I do this,
so what happened is the image change. You can see that the image changed, but
it didn't hurt your eye. Right?
Now, let's do the same thing and look at this squirrel eye, and keep looking at
only that squirrel eye, and I'm going to toggle between the same two slides
that I did before. So essentially, if you're looking at that, it's like you
don't want to look there, right?
So essentially, the idea is what we're doing here is we sort of simulating the
visual blur that's typically for the human visual system, which has like this
gradual blur. So wherever you're looking at, you have like full resolution,
and the resolution sort of diminishes across that.
So if we simulate the same response with a filter, so essentially, if I sort of
blur the image as you go from the point you're looking at, and you look at the
right point, then it doesn't make that much difference. However, if you're
trying to look somewhere else, then it actually bothers you. And then by
bringing them out to this position, you would actually be able to actually look
at that and wouldn't sort of hurt your eyes. So it's a very intuitive thing.
So the problems I've actually, if we could do that in realtime, then that would
be very nice. You could move of the mouse, except for the constraints that we
actually have for the system, which essentially means you have to play that on
a flash player, it actually doesn't work, because I would have to do this
filtering in realtime. So what we actually do is an approximation of that,
which is actually what we saw in the previous slide, which is we actually
compute like a blurred image and an unblurred original image and a blurred
image in a 10x10 box, and then we use like an exponential of a blend between
those two images.
So essentially, we're not progressively blurring more. We're essentially doing
alpha mapping between the blurred and the unblurred thing. So essentially we
can actually do that in realtime as long as the video is not too big. So for a
6
640x360 at 24 frames per second, can actually get that on a 2.5 gigahertz. And
actually apply that and we actually measure the frame rate and the frame drops
and the users in this thing. So it turns out that about like 10% of the
workers experience some frame drops. So there's a limit which actually say wow
if you have too many frame drops, just stop, say wow, your computer is not fast
enough, we can't do this task.
But even one single drop -- and that happens about like one user, as I recall,
but otherwise, like, less than 10% of user got any frame drops at all. So
essentially paid like 25 cents to actually do a hit, which turns out to be like
five dollars per an hour.
And we have to filter those results. First, we have to filter for random
results, someone just like do any junk. And have to filter for distractions.
So a lot of times, you start looking at something else and just don't move the
mouse or you look anywhere else. That also happens with the eye tracker. So
when we're actually doing the eye tracker. So on our eye tracker experiment,
actually the screen is not using the full screen, and sometimes people will
look at outside the LCD display, which is actually expected.
So we also look to watch like how good results do we actually get? Do people
-- can people, how quickly can they follow the point of interest of the mouse.
So actually did an experiment, which is essentially this.
So we asked them to actually track that ball and then we moved the ball at
random places and see how long they actually take to actually get there. What
it's trying to do is estimate the delay. So how long after you move your eye
to a point of interest, how long it takes for you to actually move the mouse
there.
In this case, there's no question we know there's only one thing to look at the
screen, so actually, you know what you are supposed to do.
So these are the results that we actually get. So actually, you can see here,
so that the black bar, the black line is actually the ground truth. So people
essentially take sort of almost a reasonably constant delay. Some people
overshoot, some people undershoot this thing, the motion to the correct place,
right. And when I actually average them and then compensate for the average
delay, so your shift is by the average delay and then you average out the
things, it turns out that the accuracy is actually very impressive. So we can
7
actually see that the number of people who overshoot and undershoot is probably
about the same. So when you average, you probably get very good results.
>>:
Just a quick question.
>> Dinei Florencio:
>>:
So the ground truths result is that red ball?
Yes.
[inaudible] mouse to do it?
>> Dinei Florencio: Yes. So like when the circle move here, you going to have
to move the mouse quickly. So you're trying to move the mouse as quickly as
possible to the point you're looking at. And some people will ->>:
The eyes are quicker than the mouse.
>> Dinei Florencio: Yes, yes. So essentially, this is -- yeah. So there's
experiments that actually show that we spend about like 100 to 150 milliseconds
to actually look at some other point. So when you actually have something -so you see something moving here and then your eyes sort of try to track that.
And it takes about 100 to 150 milliseconds.
So it turns out that what's going on here is like your eye first see that ball,
and then you move the mouse, right. So you're taking another, like, 300 to 400
milliseconds extra to actually move the mouse there.
>>:
[inaudible].
>> Dinei Florencio:
this one.
>>:
So that one is about like 15 subjects.
And the same with
[inaudible].
>> Dinei Florencio: No, but the screen, like you don't want to have your
finger on it. Then it would require people to have the touch screen at home,
right, which is not always.
Another experiment we wanted to do was like okay. So that was sort of
different in the sense that I moved the ball to a random place and then you
have to track. But if you can actually, you know, in a movie, a lot of times
things are moving. So can you actually track the thing when it's moving.
8
This is the experiment. So this is the experiment we did. So the ball now is
just moving at constant speed and then changing directions. And I would say
wow, okay, how can you actually compensate, because now you know the ball is
going to move in that direction, can you actually compensate, right?
So in this experiment, actually, as you can see here, it's like the people did
much better. So the first, the delay, the average delay is actually 70
milliseconds [indiscernible] there's no [indiscernible] delay.
And again, when you average the results, you actually get very, very good
results on the thing. Okay. So that was, those two are like pie examples.
Can we get a better pie example? Which might still be a pie example, but was
more useful, more typical of the scenarios we're looking at.
So we did another experiment with a movie trailer, which is the Ice Age 3. So
it's two and a half minute trailer. And then we about two experiments. We did
the same experiment that I was describing before with actually [indiscernible]
on the web. People actually track with the mouse. And we actually had 40
workers actually running that. And then we get like 12 volunteers and actually
bring them into the lab with an eye tracker and some of the volunteers are
actually here.
So for comparison, keeping the people to the lab and stuff in this particular
case took about four hours total person time. And the Turk experiment takes
about ten dollars. So thanks for the people for the usability lab for the help
with this.
So here's the thing. So what I'm going show is I'm going to try to synchronize
the two. But the upper one will have the Mechanical Turk results and the
bottom one has the eye tracking results.
So the balls, each of these balls represent where someone is actually looking
at.
Okay. So I think most of you actually have seen the trailer so I'm going to
stop the trailer. But I wanted to show, actually, one thing here. Actually, I
will go back to that in a sec. So let me show what else is interesting here.
So this is one particular frame, except I'm doing a different plot of the
circle. So you actually can't see actually what people are looking at more
9
easily.
So in this particular frame, as you saw from this story [indiscernible] it's
like fighting for the nut, right? And then because of the story line,
actually, the story line starts with him and then switches like what they're
fighting for in there.
And in this particular frame, there's still like a change between them, and
some people actually going to be looking here, some people are going to be
looking at the other squirrel. Some people are going to be looking at the nut.
And essentially like any -- so when people try to do region of interest
automatic computer thing, essentially you're trying to look at the frame,
right?
So you cannot look at the story. So essentially, it's like you're never going
to be able to figure out exactly where the story's shifting, where the focus of
the story is shifting, right.
So that's actually important because of two reasons. First, we can actually
refine and use this as an upper layer on something -- okay. There's something
here which is interesting. And what exactly is interesting in that one, you
can actually use some computer vision, some salience estimation to actually
figure out what exactly is looking at.
So this is some results on what the X and Y position, by using on exactly that
clip, by using the Mechanical Turk and by using the eye tracker, and you can
see the number of things. The first one is the Mechanical Turk results seems
cleaner.
So if you look at this and you look the noise, essentially, regarding each of
these, you're going to conclude that the eye tracker has more noise. So this
actually has two reasons. First, the eye tracker does have more noise.
Second, it's like moving your mouse is harder than actually moving your eyes.
So moving your eyes, much more natural.
And the -- and actually brings me to one point I wanted to show on that video,
which is so the only place that actually there is a difference, significant
difference is -- hold on.
So if you look at this particular frame, I'm not wanting to spend the time on
10
the other one. But it's like that the story actually tells that the squirrel
is looking at something and then you can actually see that the people on the
eye tracker, they try to see what the squirrel is looking at. But they look,
and there is nothing, right? So it turns out that people who actually mouth
with the mouse, they actually, when they look at something, you keep looking at
different places. When you look at something, there's nothing there, you
actually don't move the mouse there.
So that actually was the one place, so this is one place actually significantly
different, which is not necessarily bad, but essentially it's like when you
look at something, then you look and there's nothing, you don't move the mouse
there. So there was one of the conclusions.
So essentially, we can actually do the region of interest determination at
reasonably low cost and very quickly. So we don't have to assemble and
disassemble everybody. The quality overall is good, and there are some
differences as I just point out, particularly in these instinctive saccades,
when you actually look at something and come back to the thing, a lot of times
interactions and stuff.
And there is the small -- there's a small delay on moving the mouse. And as I
said before, it can actually be combined with all the salience modeling. So
that's essentially the thing. And if you have any questions.
>>: [inaudible] studios reformat white screen format to forest to tree, they
have to [indiscernible] region of interest. I don't think [indiscernible]
because in many scenes, the director wans you to look at something, and
everything else is in the background, and it may not be in the center.
So there, the region of interest is much bigger, right? The forest to tree is
a [indiscernible] of the 16 is to 9. So can those scenarios be automated? I
guess it's mostly done manually currently?
>> Dinei Florencio: Yes, so it could be automated. It's probably not the
target prime for that, because typically, when you actually coding a DVD, like
an actual commercial DVD product, actually paying like a professional to
actually see what's the best framing is probably more reliable and probably
give better results than actually try to [indiscernible] that. So I think all
the applications which are either like lower volume, so you're going to use
this in like a lower volume kind of thing, instead of printing, like, million
11
of DVD copies.
>>: [indiscernible] the region of interest is bigger, do automated techniques
work? So I think here you're trying to go to [indiscernible] region of
interest.
>> Dinei Florencio: It should, because essentially, you can't really look.
You typically just look at one place. So what's going to happen is if you have
-- and this is one of the examples on what you're talking about, which is hard
to frame. Like there's a dialogue and the director put them like in the very
opposite corners, right. So what would happen is that if you actually use
this, some people are going to be looking here and they're going to put the
mouse there. And then as other people going to be looking at the other person.
So in the dialogue, they would be moving back and forth, right. So you would
know that some people are looking at this side of the screen. Some people are
looking at that side of the screen. Therefore, I cannot -- it's hard to cut
either one, right.
But the final decision, what I was saying about the professional kind of thing,
the final decisions, like okay, people are looking at both sides of the screen.
What do I do then? That decision requires some professional sort of thing.
But yeah, you would know that people are looking at both sides. Which means if
everyone is looking at one side, then you know you can chop off the other side,
right? But if not, then it doesn't tell you what to do.
>>: So [indiscernible] definitive results from my tracking, this is automated
results for like background segmentation, motion detection?
>> Dinei Florencio: Yeah, so typically, and typically you can get some results
from [indiscernible] and there's a lot of research on sort of trying to get
that. The problem of that, of those automated salience detection region of
interest things is that they don't -- they can't follow the story line, and
that's what I was trying to say here, right.
So a lot of times, like all they can do is they can say, well, this looks like
relevant and they do a lot of focal kind of thing. So they know that
everything that's blurred. If this is blurred, it's probably not of interest,
right?
12
But then they would not be able to actually tell you where the focus of
interest is, as the story progresses, right. So and that's one of the things.
But as I said, is like one of the things that would be relevant, actually, to
combine that with some lower level techniques to actually see in this region,
what is the interest.
>>:
[inaudible].
>> Dinei Florencio: So that was -- so when I did those two experiments, we're
trying to figure out, okay, so when there's a saccade, what is the delay and
can we track? And what we concluded was exactly those two times are different.
So when you actually, when you're tracking something, you can actually
compensate for that. So we would need better modeling of what's actually going
down.
So and then yes. If you actually do the modeling, you would actually -- we
didn't do that, but you would [indiscernible], but the speed you're moving the
mouse, I know you're not looking, you're just moving somewhere else and I
assume that the delay is going to be the delay that we got, like the 500
millisecond delay, right? If you're actually moving the mouse continuously,
then I know you're actually tracking something so the delay is probably 70
milliseconds or so delay.
But we haven't actually done it. So we -- if the delay was like more uniform
than was easy just to shift, but we didn't.
The other thing is like the eye tracker results also has a delay because they
actually have to do some filtering, because the initial results is very noisy.
And then they do some smoothing, which means it also has a delay, and it's also
hard to -- we didn't do the experiment of the balls and stuff with the eye
tracker with so we don't know what the delay is, but we know there is a delay.
Okay?
>>:
Thank you, Dinei.
[applause]
>> Zhengyou Zhang: I'm just about to introduce Dinei. Dinei is an associate
here and has done a lot of work from audio, video, image, and to security. And
he was a technical [indiscernible] of this year, 2011.
13
So next, Sanjeev will talk. Sanjeev is principal software architect here, and
he was [indiscernible] manager of Windows Media Group in charge of audio
[inaudible].
>> Sanjeev Mehrotra: Yes. Hello. Good morning. I'm going to start with this
talk, the first one. And the first one is regarding basically audio
spatialization. And the technology that I'm going to present here is basically
a method to do low complexity audio spatialization and allow you to arbitrarily
place sounds at various locations within a given environment.
And the main idea of audio spatialization is you want to create a virtual
environment in a real environment. So I'm in this environment, but I want to
-- when I hear something, I want to hear like I'm in some other virtual
environment.
And this type of technology is very useful for -- and it's actually becoming
more and more useful these days, because one example is multiparty video
conferencing, where you have four, five people, you want to place them around
the table, and, you know, you can watch them around the table, but you should
also be able to hear them from where they're seated on the table.
And another example, of course, is immersive games, where you want the player
to feel like they're in another environment, and the sounds that they hear
should be coming from particular locations.
And here's a simple example. You have a table and so the bottom circle is
essentially the listener. And there's those two red dots are the listener's
ears, left ear and right ear, and you essentially have, say, three sources, one
labeled 1, 2, 3, and they're seated around the table, and essentially, you
know, you want to find the sound at two virtual locations, which are going to
be the left ear and the right ear. And, of course, you know, so this is the
virtual environment you wish to recreate, and, you know, the real environment
you're using to play back can be different, of course. So in the simple case,
you could just have a single listener listening on headphones, and this is the
simplest case, of course, to, you know, that's the real environment.
And you can also have more complicated environments, where you essentially
have, let's say, two loudspeakers and, you know, the listener is sitting far
14
away, and these are essentially like, you know, right problem with spatializing
the audio to these virtual locations and then the sort of the inverse problem
to that is given the real environment, how do you play back the sounds to -- so
that it sounds correct at these real locations.
And essentially, you know, this is audio
like, cross-talk cancellation, where you
this loud speaker is going to go to both
some path and you want to kind of invert
spatialization, and then this can be,
want to cancel, you know. Because
ears, and it's going to travel through
that path.
So the overall process, when you want to create this virtual environment in a
real environment, is sort of to, you know, invert the real environment and do
the forward transfer among the virtual environment. And both of these,
essentially, are going to use, you know, similar technologies, except in one
case, you're going the inverse. The other, you're doing the forward.
And the main thing here is that the sound, it's essentially, there's -- people
think of this as kind of two components. So, you know, the sound is going to
travel through the room. You know, there's the room response, which is how the
sound travels from one point to another point. And, you know, this is
essentially, you know, this is kind of the room impulse response, where the
sound travels from one point to another and there's multiple components there.
The direct path and then the sound is going to reflect off the walls and it's
going to keep reflecting and decaying. And the late reflections and you
essentially have like the room reverberation.
And the other portion, of course, is the, you know, how a sound from some
location differs when it goes to the left ear or to the right ear. And this is
the head related impulse response or the head related transfer function. And
typically, you can, you know, you can definitely measure both of these. So the
head related transfer function is always typically measured, and the room
impulse response can be either measured or it can be kind of just modeled given
the room geometry.
And this is the way people have been typically doing it, breaking it up into
the room impulse response and the head related transfer function. And Wei-Ge
and Zhengyou had presented some work couple years ago. Instead of breaking it
up into two components, the room impulse response and the head related impulse
response, you can alternatively just measure the combined response using a
dummy head. So you can place, like a dummy head in a particular room, put
15
microphones in the ears and play sound from various locations and you can
essentially measure the impulse response to each of the two ears.
And this is similar to how you measure HRTFs, which is you place a listener in
a particular location and you kind of put the sound at a certain distance from
the listener, and at various angles, and just kind of measure what you're
hearing.
And so the difference, of course, is that in this case, the combined response
is not just a function of the source location and the orientation, but it's
also a function of the listener position.
But, of course, we are can make a simplification, which is to assume that only
the relative distance between -- relative distance and orientation between the
source and listener matter. So if the sound source is located some distance,
D, at an angle theta from the listener, we kind of assume that this is similar
to -- so both of these have similar response. And we can probably relax this
simplification also, but this makes it easier to measure -- it makes you have
to measure at a fewer number of positions. Otherwise, you have to also change
the listener position.
Yes?
>>: Question. This relationship, so if we are talking about basic head
related transfer function, where you basically combine room response, how, I
mean, how this [indiscernible] or, I mean, how good is optimization?
>> Sanjeev Mehrotra: So I think it holds true for, of course, the direct path,
right? So it's -- in HRTF, you actually measure without a room, essentially,
right? And probably for the late reflections and the eventual room
reverberation, it doesn't matter much either. Because, you know, basically
just -- yeah. So I think it just matters the most for probably the earlier
reflections.
But I think if you're concerned primarily with locating where the sound source
is, that's more related to the HRTF, right? And it's kind of like essentially,
you're moving and the sound source is moving so that's kind of the effect you
would get. So if you shift two meters to the right and I also shift two meters
to the right, you know, probably how you hear me within this room won't change
that much, unless you're like very close to a wall or something.
16
>>:
[inaudible].
>> Sanjeev Mehrotra: So basically, what we're trying to do is we're trying to
minimize the number of positions that we have to capture this response.
>>:
I know, but there's three things, direct path.
>> Sanjeev Mehrotra:
>>:
Yes.
Reverberation, you can capture in other ways.
>> Sanjeev Mehrotra:
Right.
>>: So the only thing that really matters, the room itself is the early
reflections, but those are not consequently changed.
>> Sanjeev Mehrotra: Right. So they're -- what they're being measured here is
for a given location in a given room. So, I mean, you can definitely measure
it for, like let's say for this room, I can measure, you know, me moving around
the room as well, right?
But I think, you know, if the room is pretty large, then, you know, I think it
matters more if you're closer to, you know, the actual reflecting surfaces,
right?
So, of course, the main issue with measuring HRTF in this combined head and
room impulse response is that you can only -- you can place the listener at a
particular location, but you can only measure this, you know, by moving the
source at various locations.
So let's say I measure, you know, I place a source here and I measure the
response. And but then, of course, the source can be at a different location,
right?
So the natural thing is you can do interpolation from the points that you've
measured close by, right? And there's actually theoretical results to show,
you know, how the spacing should be done so that the interpolation is sort of
perfect using some sort of synch interpolator. So you can actually, you know,
limit the number of measurements pretty significantly.
17
But then the interpolation that's needed to do the reconstruction is
essentially a synch interpolator to do the perfect reconstruction. So there's
many simplification you can do to the interpolation.
So one is, of course, you can just interpolate the impulse responses in time
domain. But doing that without any other processing is not good, because
there's different delays, right. So you may essentially, you know, if two
responses are off by 180 degrees, you essentially just cancel them to zero.
So the way people typically do is they try to align two impulse response that
you're going to do the interpolation. And then once you do the alignment, you
can do the linear interpolation.
And doing temporal alignment in the time domain
course, is that it assumes that the response is
frequency components are shifted by exactly the
course, complexity in determining the alignment
has several issues. One, of
linear phase, meaning that all
same amount. And there's, of
amount.
So we're essentially going to do frequency domain interpolation. So for each
of these impulse responses, we can take the transform, right, and do
interpolation to frequency domain. And, of course, when you're doing
interpolation to frequency domain, you can -- let's say you're interpolating
this point from these four points. You can just sort of like a bilinear
interpolation. You can interpolate along the radial direction and then
interpolate along this angular direction.
And the reason for using polar coordinates instead of using cartesian
coordinates for doing the interpolation is that we want to essentially scale
the power of the impulse response, because power is going to essentially
decrease with square distance. So when you're in this coordinate system, it
allows you to make sure that the power is correct, depending on the distance
that you're at.
And so the interpolation is going to be done independently on the magnitude and
the phase of the impulse response. And the phase interpolation is going to
assume -- I mean, phase interpolation is sometimes difficult, because you don't
really know what the true phase -- I mean the true delay has been. We kind of
make assumptions that it's minimum phase and causal system.
And so the interpolation is kind of straightforward.
The magnitude, and so
18
first, you know, do the angular interpolation, then do the interpolation in the
other coordinate, and then the main thing is that you want to make sure that
the power, you know, decays with the square of the distance.
So the main thing here is just that in this coordinate system, we can scale the
power correctly.
And so the overall scheme is pretty straightforward, and essentially, the thing
-- the results I'm going to show is that we measure -- so for a given distance,
I'm going to show results. So you measure the actual response at three
different angles, okay? And you interpolate the middle one using the other
two. And then we're going to compare the true measured response with the
interpolated response.
And so here's the results. So this is for the left ear. This is for the right
ear. And the left column is essentially the impulse response. The right is
the frequency response. And basically, I show the true response is this dark
blue line. And using the frequency domain interpolation technique is this pink
line, and using the time domain interpolation technique is this light blue
line.
And the main thing to see here is that, you know, time domain interpolation,
you know, has -- so basically, from the time domain response, it looks like
it's working pretty decently. But if you actually look at the frequency
response, doing, of course, the frequency domain interpolation gives you much
better results, actually. And if you actually compare the SNR between the true
impulse response and the interpolated one, the frequency domain interpolation
technique gives you almost nine times the signal to noise ratio as doing the
temporal domain interpolation.
And yes?
>>:
I notice in this case, the example point is rather sparse.
>> Sanjeev Mehrotra:
>>:
Yes.
[inaudible] assume part of that is measurement difficult?
>> Sanjeev Mehrotra:
Right, right.
19
>>: I mean, instead of moving the [indiscernible] just rotate the basically
either the human body or the dummy head. You can have a much [indiscernible]
revolution.
>> Sanjeev Mehrotra: Right. So that's one way of doing it, right. So I think
to get sort of a perfect reconstruction, the theoretical results so that you
need to at least measure almost to the rate, like five degrees or something
like that.
And so although I presented the results for the spatialization, you can also,
you know, apply the same interpolation techniques for the cross-talk
cancellation, because that also involves measuring the room response and the
HRTF. And the final thing is that, you know, if you're already in the
frequency domain, it's very easy to actually do the spatialization itself using
overlap-add methods for convolution.
Okay.
>>:
[inaudible].
>> Sanjeev Mehrotra: So I actually have a demo to kind of rotate, but we
haven't done any user studies, but I have a demo to kind of simulate moving
around. So basically, like the listener's in a particular location and let's
say the sound source moves in a particular pattern around the room. And ->>:
[inaudible].
>> Sanjeev Mehrotra: So you need to wear headphones. I mean, I can play it
here, but I don't think you'll get the effect. But if anyone's interested, I
can show it offline. Just I need to get the headphones for that. Yeah?
>>: So main reason for doing this combined thing is to reduce the number of
measurements? What is the main reason you're doing it combined rather than
separate [inaudible] responses.
>> Sanjeev Mehrotra: One of the main things is it gives you more realistic
sensation of the actual room, rather than modeling the room.
>>: Supposing you measure, you know, measured at a point, a [indiscernible]
microphone, omni direction microphone sound source in the room. You just move
20
the sound source around, measure so you'd get a room impulse response?
>> Sanjeev Mehrotra: Right. Oh, so you're saying measure the room response
and use HRTF separately? Yeah, so another reason to do the combined thing is
there's a complexity reduction when you do the combined impulse response. And
if you look at the -- I think one of the main things is once you do the
combined response, you can actually, you can actually break up the filter
pretty easily into a very short top filter, which is independent of position
and direction.
So that means -- so the short tap filter is dependent on position direction,
and the tail of the filter is essentially independent of position and
direction.
So it gives you a big complexity savings when you're actually doing the
convolution.
>>:
Isn't the short one just the HRTF?
>>:
Yeah the head is already really short.
>> Sanjeev Mehrotra: Yeah, so but the room response, also you can break it up
into too, right? So you can break the room response into the long tail and the
short tail.
>>:
[inaudible].
>> Sanjeev Mehrotra: Okay. So the next one is depth map codec scheme, and
this is essentially a very low complexity realtime codec for coding depth maps
and the main advantage of this is -- yes?
>>:
[inaudible].
>> Sanjeev Mehrotra: I saw that. So the main advantage of this is that the
complexity is extremely low. You can encode and decode a frame in -- at the
rate of almost over 100 frames per second. And the other advantage is it gives
you compression ratios which are better than -- which are actually
significantly better than JPEG 2000 lossless mode as well as JPEG-XR lossless
node. That's not to say that this is necessarily the best depth map codec, but
it's a very simple buffer-less -- it requires just no frame buffering.
21
And so depth maps are very -- are becoming more and more common in media
processing, and, you know, there's examples of cameras such as Kinect and time
of flight cameras, and depth maps themselves are very useful for many tasks.
Processing tasks such as fore ground, background segmentation, activity
detection, and also for rendering alternate viewpoints. So you capture a view
from a particular location and you capture the depth map and you can use that
to reconstruct alternate viewpoints.
But the main drawback is that if you're actually going to transmit a depth map,
a depth map is essentially 16 bits per pixel. And it's very, very costly to
transmit a depth map. And we wanted to kind of develop a depth map codec which
is extremely low complexity, doesn't require any frame delay. So no prediction
across the frames.
And although, I mean, you don't want to necessarily lossy code a depth map in
the same way you lossy code an image, because the final thing that you're
consuming is not the depth map itself, but the depth map is perhaps being used
to do some other tasks, and so instead of doing loss decompression, the simple
thing is just to do lossless to near lossless compression so that you don't
necessarily lose fidelity when you're doing the task that you want to do.
And, of course, we wanted to make sure that it gives you much better
compression ratio than existing schemes.
And so here's an example of a depth map and each pixel is essentially 16 bit.
But the thing to note is, of course, there's very, very high correlation across
pixel values. And another main thing is that the sensor accuracy actually
decreases with depth. So if you're actually measuring the depth, which is
twice as far as some other depth value, then the error in the measurement is
going to be two to -- I mean, four times as large as the measurement or that
you would get in the closer values.
So the further out you go, the less accurate the sensor gets.
And this essentially means that, you know, if the sensor is less accurate why,
you know, why code it at full fidelity, right? So essentially, you can
quantize the values that are further out without losing any accuracy.
So the codec are essentially consists of three components. One component takes
advantage of the fact, the last fact that I said, which is that the sensor
22
accuracy decreases with distance. So instead of coding the actual depth, you
code inverse of depth. And essentially, you know, you pick A and B. If you
pick A and B properly, you can actually maintain the full fidelity of the
sensor in your coding.
And if you actually choose A and B, you know, slightly, you know, off, off,
then you can actually do some minimal quantization of the depth values also.
So like if A is smaller than it needs to be, you're actually going to, you
know, end up quantizing the depth value. The main thing is you don't need to
know actually how this works, but the main thing is there's going to be a
parameter which you can control, which controls whether you're lossless with
respect to sensor accuracy or whether you've introduced some amount of loss.
Some small amount of quantization loss.
And the other thing, of course, was that there's very high correlation between
pixels. So you can actually do a one-stop simple prediction. And then once
you do the prediction, you end up with long strings of zeroes. But the long
strings of zeroes, you don't want to pre-decide a distribution on zeroes or
anything. So a very effective technique to code the output of that is to use
an adaptive run length Golomb-Rice code. And this is a very low complexity,
lossless entropy coding. So I'll just talk a little bit about the adaptive
RLGR code.
So I mean, the details are that, you know, this is an adaptive scheme to code.
If you have a long string, there's many zeroes and then there's a level and
then there's a long string of zeroes and then there's a level. But sometimes,
you may have a long string of non-zero values as well. So sometimes you -- and
so the way it kind of works is you operate in one of two modes. A run mode or
an on-run mode. So in a run mode, you expect there to be long strings of
zeroes and there's a parameter that tells you how long of a string of zeroes
you expect. And in the no-run mode, you kind of are just coding levels
themselves. So the no run mode, each symbol just gets coded using the
Golomb-rice code. And in the run mode, you know, you code the string of zeroes
using some symbol and then you code the level.
And the other nice thing about this is that you don't pre-decide your
distribution of zeroes, and you kind of learn it as you goal. And so when I'm
going to show is I'm going to basically show you the compression ratios in
encoding, deep coding time for this codec. And I'm going to code two sets of
depth maps. One set is slightly more complicated because it actually has a
23
large range of depth values. So this left one, even the background has
values. In this one, in the right one, this is a simpler set. And the
it's simpler is that a lot of the background values are actually out of
sensor range. So they're actually call awl coming up as zero. So that
less information to code here.
And I'm
numeric
coded.
coding,
depth
reason
the
there's
going to compare five methods for coding. One is basically just
el lossless. There's absolutely no loss between the original and the
And the way you do that is that you essentially don't do the inverse
okay?
Next one is this is essentially -- essentially lossless, with respect to sensor
fidelity. And then there's two more, which are slightly lossy. They're pretty
close to lossless, but slightly lossy. And then I'm going to compare the
compression ratio with JPEG 2000 lossless mode, which -- so I tried many
different codecs. JPEG 2000 lossless. JPEG XR loss less, you know. P&G and
many different things. Actually it turned oust out of all those, this was the
one that actually performs the best.
And so here is the compression ratio for this set one, which is the one that's
more complicated. So this is the numerically lossless coding. So, you know,
you get about a 5-to-1 compression, numerically lossless. And this is, you
know, essentially lossless with respect to sensor fidelity. You get about
6-to-1. And these are slightly lossy. And wean slight quantization, you can
get pretty good compression ratios of 15-to-1.
And as a comparison, JPEG 2000 compression ratio is about -- is under 4. This
is the lossless mode of JPEG 2000. And these are median results, and even the
max, 90th percentile results are not much worse than this.
And, you know, here's the encoder speed in millisecond. Decoder speed in
milliseconds. And this is for set two. So set two is much simpler to code.
And in this set, you can see that even numerically lossless, you can get about
13-to-1 compression. And, you know, even with slight loss, you can get 35-to-1
compression. And JPEG 2000 lossless is under ten.
And this set is actually very easy to code. So, you know, in about seven
milliseconds, you can encode and decode a frame.
And then so if you do numerically lossless, then any processing you do with a
24
depth map is going to be identical to the uncoded one. But what I want to show
here is that even if you do, let's say, lossy coding of the depth map or near
lossless coding of the depth map, what I want to show you is that processing is
unaffected by this small amount of quantization.
So the way to do that is that we capture a view from multiple viewpoints. So
we have essentially, I guess, many cameras which are capturing the same scene
from many viewpoints, and this is the viewpoint, you know, one given view point
and this is a depth map corresponding, and all the cameras are calibrated. So
this is the original viewpoint. This is the uncoded depth map and this is the
coded depth map with a slight amount of loss. And this is an alternate
viewpoint. So basically, the viewpoint before. So this, this RGB map and this
depth map are used to reconstruct another viewpoint. And this is the original
that you capture from that other viewpoint and this is the reconstruction you
get using an uncoded depth map and this is the reconstruction you get using a
coded depth map.
And the results you get from using the coded depth map versus uncoded depth map
are pretty identical.
And this kind of just shows you the PSNR when you're reconstructing five
different viewpoints. So this, you know, and they're kind of concatenated back
to back, actually. So this is viewpoint one, viewpoint two, viewpoint three,
viewpoint four, viewpoint five. And this essentially compares the original
capture with the reconstruction you get. And this is basically using lossless
and some amount of loss. Basically, what you can see is even if you code the
depth map, you know, to some small quantization, you effectively don't hurt the
reconstruction at all. You get the same reconstruction as you would if you use
an uncoded depth map.
>>:
[inaudible].
>> Sanjeev Mehrotra: This is a PSNR between the original capture and the
reconstructed. And reconstruction is done using another view point.
>>:
[inaudible].
>> Sanjeev Mehrotra:
>>:
Yes, for the IGB.
[inaudible] is better than lossless.
Yes.
25
>> Sanjeev Mehrotra: Yeah, 750, yeah, slightly better. I mean, but it's, I
mean, there's no -- I mean, let's put it this way, the reconstruction itself is
not 100% accurate, right? So, I mean, you could go either way when you code
it, right?
>>:
[inaudible].
>> Sanjeev Mehrotra:
>>:
Yes.
Yeah, ground truth is measured too.
So that area is probably larger.
>> Sanjeev Mehrotra: Yeah, right. And this is showing you the -- this is
comparing the reconstruction using an uncoded depth map with a reconstruction
using a coded depth map. And, you know, at various quantizations. So, you
know, with near lossless, it's almost the same. And even with this, you know,
even with this, this much amount of quantization, the results are, you know,
the PSNR between the reconstruction is not that bad.
And basically, the conclusion is that, you know, it's very low complexity
codec, and it gives you better compression than existing, you know, lossless
coding schemes. Okay? Yes?
>>: So I can kind of accept the fact that the depth values are [inaudible]
using different things, but when you take the inverse of that, and you
[inaudible].
>> Sanjeev Mehrotra: Well, actually, yeah, they're pretty much identical.
That's the thing. So I mean, the Kinect sensor, data coming from the Kinect
sensor actually has identical values for neighboring pixels.
>>: [inaudible] slight moderation or is it more coming away from it, or are
they perfectly linearly related to each other? [indiscernible].
>> Sanjeev Mehrotra: So it won't be a linear relationship, but it may be -- I
mean, it may be a mon tonic relationship.
>>: So they looked better on the original samples than the [inaudible]
samples, right?
26
>> Sanjeev Mehrotra: So you're saying basically the inverse will amplify the
difference between two successive values, yeah.
>>:
But the sensor [indiscernible] is not [inaudible].
>> Sanjeev Mehrotra: So the thing is, I think, the Kinect sensor itself is
probably doing some sort of filtering too.
>>: At longer range, you can see the effect. [inaudible] within a foot or
two, it's, the quantization error is almost the same. No? What is the curve?
>>:
[inaudible].
>> Sanjeev Mehrotra: The main thing is there's also some amount of filtering
going on with the data coming from the sensor itself.
>>:
I know this is a fairly [indiscernible] you guys get a [indiscernible].
>>:
We have also [indiscernible].
>>:
What is the noise level new sensor compared with the old sensor?
>>:
[indiscernible].
>>:
Same noise characteristics?
>>:
Kinect would be different.
>>:
Time of flight?
>>:
Time of flight is different.
>>: And if moving to the [indiscernible], I mean, [indiscernible] is that
noise extends?
>>:
For Kinects, for time of flight, probably not [indiscernible].
>>:
You mean the noise level?
>>:
No, no.
27
>> Sanjeev Mehrotra: I think there's other ways to do quantization, even if
you don't do the inverse coding. Even without the inverse coding. Inverse
coding gives you an additional 30% or so.
>>: I think the inverse coding still basically benefit, because if you're far
away, the importance of accuracy is less.
>> Sanjeev Mehrotra: I think essentially it's like a nonlinear quantization,
right? So the further out you go, the [indiscernible] get larger, right?
>>:
That was my question, you try to [inaudible].
>> Sanjeev Mehrotra: No. But imagine you would probably get -- I mean, the
gains may not necessarily be as large, because many pixels can begin well
predicted from the top, I mean, from the previous pixel. So only probably the
boundary pixels, like if somebody -- and even from frame to frame ->>:
[inaudible].
>> Sanjeev Mehrotra: Right. But what I'm saying is like the only thing
remaining after this pixel differencing is essentially the edges, right? And
the edges are likely to move anyway from frame to frame, right? Right. So and
the background is already pretty well zeroed out. So there may not be as much
gain by doing frame differencing.
>>:
[inaudible].
>> Sanjeev Mehrotra: I think there's many ways things like this can be
improved. But the main thing again is like, you know, this gives you already
like pretty good recompression ratios and gives you encode and decode frame
rates of close to 100-plus frames per second, and that's very important for any
realtime process -- realtime application such as if you're using this for video
conferencing, you don't want to do complicated things, right? You want to get
as good com protection as you can for very, very low complexity.
>> Zhengyou Zhang:
[applause].
Okay.
So let's thank Sanjeev again.
28
>> Zhengyou Zhang: So I'm going to talk about the ViewMark system. This is
interactive video conferencing system for mobile devices. And this is joint
work with my summer intern, Shu Shi from UIUC. So nowadays, it's very common,
you have to join remote meetings from your home, your office, your
[indiscernible]. And ideally, you want to have a similar experience as a
face-to-face with remote participants. So this means you want to see what you
want to see, you want to see who is talking. You want to see what's the
reaction from other people.
And from the audio aspect, you want to hear voice coming from different
directions, so spatial audio, like what Sanjeev mentioned earlier. And also,
you want to share documents and objects with other people. So that's the ideal
case. So this motivates some high end [indiscernible] or HP [indiscernible]
room.
But this is very expensive, it's -- each room cost about a half million dollar,
U.S. dollars.
And another device which is much more affordable is the ViewMark Microsoft
roundtable. This device has five cameras which produce panoramic video. So if
you put the device at the center of the table, then the remote person can see
everyone around the table. And also, there is six circular microphone around,
and there's a microphone processing can determine who is talking so the device
can send it to the remote party, active speaker video.
So here is interface of the roundtable. So this is actually speaker window and
this is the panoramic video. And here you can share screens with other people.
And more and more, people are now using mobile device to communicate with
others. And since mobile device is always with you, it's very easy to be, to
reach out or to be reached. And with iPhone face time, more and more people
are now use video for communication.
But as a video on the mobile device still Pretty limited. So if you try to
join a remote meeting room, it's probably pretty poor quality. So this is
probably the best you can get. It's very hard to see speaker's face and the
content on the display is probably very hard and you cannot see other people
talking unless you ask other people to move the phone around.
Okay. So it's not very immersive. Active speaker has to be done manually by
asking other people, and there's no data presentation sharing with iPhone face
29
time.
So to improve the experience of video conferencing, we develop this ViewMark
system. And here I will show first the ViewMark system with the roundtable.
So the mobile device is connected to a roundtable, and roundtable has been
modified so we can produce spatial audio. So it's not -- standard roundtable
give you only single channel audio. And we modify so we can get spatial audio.
And here is the interface of the shoe mark on the mobile device. You have this
remote video, so that as I mention earlier, you know, a roundtable can produce
360 degrees panoramic video. But because of the screen size on the mobile
device, you can see only a fraction of it. And I will just cover this later,
how to [indiscernible] panoramic video. And you have a [indiscernible] video
which can be minimized to reduce the competition cost on the mobile device and
you can use a portion of the screen to share data with [indiscernible].
And on the right here, this is screen shot with book marks. So when you feel a
[indiscernible] is interesting, you can book mark the location and later you
can come back quickly and it sets [indiscernible] here.
So what we
video, and
-- use the
interested
do here is since it's remote, the roundtable can give you panoramic
you see only portion of it. So one way is to use the finger touch
touch to move around. And to determine which party you are
in watching.
Another way is to leverage the initial sensor on the mobile device. Basically,
you can move the mobile device around to see different part of the remote room.
And then if you say okay, this is location I want to watch probably later, you
can tap on it and create a book mark here, and you can create another one,
okay. So you can create as many book marks as you like.
And then later on, if you want to go back to that location, you just double tap
on any of the book mark and you switch -- you get to view live immediately from
that viewpoint.
Also, our system can do the data sharing. If you are in portrait mode, then
the data is shared from the low part of the screen. And if utility to
landscape mode, then you have a full screen of the screen shared by the remote,
from the meeting room.
30
And the data sharing software was provided by Amazon Asia. And here, I will
show you a video of the system in action. So it's a live recording of the
demo.
>>: This is my roundtable and I'm presenting some slides here. Some of my
colleagues are joining remotely. We'll see how the expedience with the mobile
device looks like.
>> Zhengyou Zhang:
get the video.
So I'm on the remote side with the mobile device.
So you
>>: You see a video from the remote room. As you saw earlier, the remote
room, there's a roundtable, takes us 360 degrees panoramic video so I can pan
it around to see different people, okay? Here I see Rajesh, and I can see
further [indiscernible], okay? So I can bookmark different location.
>> Zhengyou Zhang:
I don't know why I went back to [indiscernible].
>>: And later on, if I want to see Rajesh, I can double tap and immediately
switch to Rajesh. And this is the video of myself just to check how I'm
looking like in the meeting with the spatial audio.
>> Zhengyou Zhang: So we cannot demonstrate spatial audio. So here you can
hear the voice are coming from different directions because of our spatialized
audio.
>>: Actually, very immersive in the meeting. And the next,
the data corroboration part, the data sharing part. So this
remote [indiscernible], and let's take a look at a full view
we see it's a full screen of the remote desktop. I can move
see, it's fairly easy.
>> Zhengyou Zhang:
that's the demo.
Okay.
I will show you
is data from the
here. So here is,
around and you can
You're seeing the [indiscernible].
Anyway, so
Okay. So we also develop connect ViewMark with a new device called a spin top.
The spin top is shown here on the bottom. It can spin with a programmable
motor. This can be controlled from the computer. So you cap see here, we just
put the laptop on top of the Spintop device with a camera and you have a
screen. So basically, it can serve as a proxy of a remote participant.
31
And the system works exactly the same way, like I say, interfaces the same,
everything is the same except now the panning is not choosing the -- a
particular view of a panorama video. Now they're panning on the mobile device
we control physically spin top device. So you can move around remotely, okay,
to control the device. Just gives the remote person a lot of control.
So in summary, the ViewMark system give you a two-way interactive audio video
conferencing experience. And it's immersive with binaural spatialized audio
and the mobile user can control the video stream to view different parts of the
remote room. And you can bookmark viewpoints of interest and switch view
quickly. And you can also have the active speaker view provided by the
roundtable device. And you can share screen and display.
And in perspectives, the system can be improved in a number of ways. For
example, currently, the ViewMarks are arranged sequentially, depending upon
when you tap on it. Ideally, we want to have the ViewMarks arranged in the
consistent way with the physical location. Okay?
And the second possibility is to do the face detaching active speech detection
also. So you can have a suggestion from system, who are the active speakers,
possible speakers. So you don't need to tap on the device. It can be
automatically populated as some book Marx on the mobile device.
And now, if you move from the mobile device, small smart phone to a tablet
which has a bigger screen, probably want to have a better UI design.
And here's acknowledgment. We have received quite a lot of help from different
people. And thank you for your attention.
[applause].
>> Zhengyou Zhang:
I just need two minutes for questions.
>>: You mentioned using face detection [inaudible]. What if even by that time
you use the detection, you locate the location, then you come back to that
view, the person has moved.
>> Zhengyou Zhang: That's possible. Then you need to update, yeah. If you
can check the persons, you can update the location automatically. So because
32
of the ViewMark is, what it sees is the location. In the Spintop case, it's
just angle of the device. So that's -- it can be updated depending on the face
detection and tracking.
>>:
So the Spintop is [indiscernible].
>> Zhengyou Zhang: Yeah, yeah, it's like a proxy, right. Physical proxy of
the remote person. So you can put a spin top on the end of the meeting room
table. Okay. So thank you very much.
Download