Document 17864777

advertisement
>> Michael Cohen: Morning is my pleasure to welcome Li Zhang here. I’ve known him for almost 15
years probably since he arrived at the University of Washington across the lake, as a graduate student.
Did a lot of great work in computational photography back then and continuing to this day, I guess about
five or six years ago we lost them to the other UW which is University of Wisconsin. He’s going to talk to
us today about some new image sensors that are coming out and how to use those sensors both for
understanding the world and for producing great images, I guess.
>> Li Zhang: okay, yeah…
>> Michael Cohen: I’ll leave…
>> Li Zhang: thank you Michael for the introduction and thank you for having me here. So let me start
the talk by showing you a very interesting picture. So this is a picture of, taken in 2005 where, when the
Pope was announced. That was eight years ago and there’s another comparison picture like this. So I
guess for researchers in this field when we see this we got very excited because the mobile devices are
really a part of our life. So we’re using it to take pictures, if we can do anything on the camera or to the
pictures that are taken, we very likely may you know impact our daily life.
Okay so as researchers I think I’m interested in several questions in this area. So one is usually when we
want to take good pictures we sort of use a big camera, right? So now most likely were going to use sort
of small cameras. Can we sort of improve the imaging performance of these small cameras to make it
more comparable to this? Okay so that’s one type of questions. Like for example dynamic range, low
light performance. The second question is, okay once we are able to very easily take questions, or take
pictures were going to see a lot of, we will have a lot of pictures. How do we browse and navigate
through these images?
Okay so there’s a third question which is how do we extract some knowledge from these pictures, like
doing face recognition or more general sort of object segmentation, and the recognition, okay? So I
have done a little bit of work in these areas so I will discuss these works with you. I will start with the
first one. Okay, so high dynamic range images. So this is a high dynamic range images are important so
you can use I Phones these days to take HDR images. So as we all know the way it works is you take a
succession of pictures with different exposure times. So this longer wider bar means longer exposure.
We merge them we get this nicer looking picture okay.
So we can see that the reason we want the longer exposure is that this part is so dark if we use forty
exposures we cannot see too much. But if we use long exposure then when the scenes is moving okay
so we’re going to have blur here. Okay so then that becomes a problem. So not too many work have
addressed this issue. I mean addressed the blur issue when we reconstruct the high dynamic range
images okay.
Okay so if the issue is, you know we have a lot noise in this dark region okay. So maybe alternative is
this, so we just take a succession of short exposure pictures and maybe we can somehow remove the
noise by sort of somehow aggregating all the measurement here. Okay so this is another alternative. So
then the question is okay which is better? Shall we do this trying to remove the blur here or shall we
sort of take a, sort of remove the noise in this sequence? So and because when you use a short
exposure time within a certain period of time you can take more pictures, okay. So then the, sort of
there’s this design question, okay which sort of a scheme should we use? Should we take a very short
sequence of sort of noisy images? We should you know take this type of sequences.
I think in 2010 we had a paper, sort of did an analysis comparing these two schemes. So our conclusion
is that this tends to work better in practice. So, you want to estimate the motion between these frames
and then remove the noise. There are some more general analysis from [indiscernible] Group which
also compare this type of image capture with other sort of coded capture imaging, for example flatter
shutter and sort of coded capture imaging to see okay which is better and in which scenario, okay.
So I will show you some examples of my work. So for example, is there a way to dim the light?
>> Michael Cohen: Uh, yeah.
>> Li Zhang: No, no that’s fine, ha, ha, yeah.
>> Michael Cohen: it’s a trade-off between the people there.
>> Li Zhang: Oh I see, u-huh.
>> Michael Cohen: So you just speak and the…
[laughter]
>> Li Zhang: Oh, okay, ha, ha. So for example in this low light condition we take a picture of this
birthday cake and if we just blow out sort of a amplify and the pixel intense stage we tend to get this
okay. So the idea is okay so how about we take a very short sequence of noisy images and then you can,
and this is like a [indiscernible] mapping result. When you do this there’s a very likely your hand may be
shaking so you take a very short sequence so you can recover all these nice details.
Okay this is one example. So let me show you another example okay. So this is the noise level and, so
there’s a motion, this motion is smoother okay. So this is sort of demonstrative concept okay we can
capture a sequence of images then we can still reconstruct the high dynamic range images. Okay the
question is okay, is the problem solved? So when you think about that it’s not quite. Because what
does that mean? If you want to take a high dynamic range videos on a cell phone then your cell phone
has to, you know work at this very high frame rate. Then usually that means you’re going to consume a
lot of power, okay.
So then we thought about okay so if we do, we take high frame rate because we want to avoid the
motion then sometimes maybe the scene’s not moving. So maybe we want to do this image sampling in
a more adaptive way, meaning that if there’s no motion we want to slow down, we want, we can use a
longer exposure time, okay. So that’s another work we did. So use the [indiscernible] estimation to
control the exposure time. So usually the exposure time is controlled by the brightness. If it’s bright
enough you use short exposure, if it’s dark then you use long exposure. So, but we can use motion to
control the exposure time, okay.
So I think three, I think maybe three years ago I think there are some cameras can do this but there’s
only four static shots. So now we haven’t found sort of video cameras that can do this. So we build a
prototype to experiment with this concept. So I will show you this how it works. So this is like a regular
video as we’re moving. So you can see that once we move this is a constant exposure then things get
blurred, okay. But here once it starts moving things get noisier. You see the difference? So here once it
moves we have an underlying sort of motion detection. You get a noisy reconstruction and you get a
noisy capture because we reduce the shutter speed. Here the exposure time is constant so you get,
okay.
So the idea is okay, maybe from this type of measurement we can get a better image reconstruction.
Yes?
>>: Is the motion estimate of the you know like the inertia…
>> Li Zhang: No it’s like shorter imaged based, like Lucas [indiscernible] type of motion estimation.
>>: Right, but it has to be fast enough to drive the camera and this is already a high speed capture.
>> Li Zhang: Right so we did this, we didn’t do this on a phone. We have a laptop connected to a Point
Gray camera. So the motion estimation is done on the laptop, yeah. So, any other questions?
Okay so, okay so then, so you can see that at the particular moment if this thing is moving fast its blur
and in this capturing scheme it’s noisy. In order to evaluate to this we did some, evaluate to this scheme
we did some synthetic experiment. Suppose to have a panoramic image and we’re sort of have a
simulative camera moving this way, okay. The first if you look at the dashed red line that’s the signal
noise ratio in the input, meaning that we use a constant short exposure, meaning that every frame is
very noisy. So the signal to a noise ratio is very low.
Then we can apply some video denoising method. For example we used the Liu and Freeman methods I
think that’s maybe three years ago. Yeah, you can bump up the signal noise ratio to this, so it works
quite nicely. But if you use our sort of our adaptive sampling originally it’s here, sort of a static. So that
means the signal noise ratio is pretty high. Then you start to move, so moving fast then you reduce the
exposure time and suddenly the noise ration, signal to noise ration becomes low. Then once you come
here the stops and the move stops. So you see that the dashed black line is the signal to noise ratio in
this adaptive captured, in this motion based exposure control capture, okay.
Then based on this you can do some video denoising then we can get a better sort of a signal to noise
ratio. Consider for this good input images we get a very high signal to noise ratio and we can get some
improvement like this.
>>: So this is all single framing result, right?
>> Li Zhang: Well actually it’s sort of an aggregating multiple frames.
>>: So the slalom lines are aggregating?
>> Li Zhang: Yeah, yeah key benefit we can get is, you know we have very good frames here, sort of a
very good frames here. So that we can compute an outgo flow and if we give more weights here, more
weights here so when we do the reconstruction we can get higher improvement. We can get better…
>>: Why does the…
>> Li Zhang: Reconstruct…
>>: Black dash line not go all the way down where the red dash line is?
>> Li Zhang: Actually it depends on the speed. Here it touches the red line, yeah.
>>: So that means it still exposing the monitor in that first state?
>> Li Zhang: That means here its move is really fast. So this is sort of a conservative way of setting
exposure to make it very short.
>>: So you [indiscernible] motion in scene…
>> Li Zhang: Uh-huh.
>>: That the objects move around…
>> Li Zhang: Uh-huh.
>>: [indiscernible].
>> Li Zhang: Yeah, if everything is moving then you probably end up, so that’s the worst case. You
probably end up with something like this. Because we keep…
>>: [inaudible]
>> Li Zhang: Right, right, right, right, yeah. Okay so here are some comparisons. So again this is
synthetic examples so that we can have ground truth. So this is the constant exposure time. It’s blurry,
and this is noisy input. So we can apply this CBM3D method on each frame to get the noise removed.
This is another sort of new Liu and Freeman method to remove the noise. This is our method, you can
see that this is a, our method is more similar, well in terms of the sharpness it’s probably more
comparable to this. This is almost the same, it’s slightly a little bit blurrier. So this is the ground truth
and if we compare the signal to noise ratio, so this is the input, the red solid line is the input, and we
have two other sort of CBM3D method, and Liu and Freeman method they’re somewhere here, and sort
of our method you can sort of do better, okay.
So this is another example, in this case I think both the Liu and Freeman method and the CBM3D
method sort of over smooth the result a little bit. This sort of, in this result the difference is more clear,
so our result is more comparable to the ground truth. So the key reason is that our methods are
switched between the special denoising and the temporal denoising. If the motion estimation is
unreliable we use a single frame. Otherwise we use sort of temporal information. So again, so this is
the comparison. So this black line is our result, okay.
Okay so the land question is okay is this all? So not really so, can you think of any issues with this? So
there’s some issues. First of all when we use the motion to control exposure we can only measure the
previous two frames, right. Then use the motion to predict the next frame but this prediction is always
delayed. See if we’re standing, moving slowly, suddenly you move fast then, you know the next frame is
wrong, okay.
So, now, right so that’s the main problem. In the end we always have some sort of blurry frame in the
result. In order to enhance this one we have to somehow sort of doing some sort of registration
between a sharp one and the blurry one.
>>: But when you say there’s the delay…
>> Li Zhang: Uh-huh.
>>: I mean IMUs can run at two hundred hertz or something like that, right, so.
>> Li Zhang: Uh-huh.
>>: And most cell phones have some sort of inertial measurement built in…
>> Li Zhang: Uh-huh.
>>: Right.
>> Li Zhang: Right, right, right, so but the motion there even if the camera is absolutely static there
could be SIM motion.
>>: That’s true.
>> Li Zhang: So that kind of motion need to be estimated from the frames.
>>: Uh-huh.
>> Li Zhang: Yeah, so okay if we, then you have to do some registration between the blurry and the
sharp image. That becomes a tricky optical flow question because the brightness constancy is violated.
So what, if you just use a regular flow then you’re going to somehow move these pixels to match with
these, the blurred intestine patterns, the flow will be wrong.
Okay, so where addresses this problem in the paper in last CVPR. It works quite nice for the interior. So
the basic idea is we want to compute a flow such that if you use that flow to blur out this one and then
the blurred version will match up with this guy. Okay, so that’s the basic idea. So it works quite nice
with these interior regions but near the collusion boundaries, the depth is continuity. It’s very hard, it’s
very hard. So then I think we can still keep improving writing papers but in terms of having a real robust
system I think this is probably not the way to go, okay.
So then we sort of come up with this idea. This is sort of a randomized exposure scheme. So if you look
at this, this is an illustration on a ten pixel camera. You have a ten pixel camera. So and let me just say
how it works. So for example at time A, at time A you only read three pixels. The three pixels is pixel
one, okay you read pixel one and the pixel one has been exposed for two frames. So this is the exposure
time, okay. Then you also read pixel six, okay. So pixel six has been exposed for four frames, okay,
that’s the exposure time. You also read this pixel nine and it has a short exposure time, okay. For the
pixels that you’re not reading at this frame it’s going to continue to expose. For example this one, pixel
two you don’t read it so it continues to expose, okay.
Then you come to the next time instant. Okay we’re going to read out at this frame we’re going to read
out pixel one with this exposure time, and the pixel two with this exposure time. Does that make sense?
Then, and also pixel six with this exposure time, okay, so the basic ideas for another example at this
frame we’re going to read out pixel three with this exposure time, and I guess pixel seven with this long
exposure time.
Okay so the basic idea is at each time you’re going to select a subset of pixels and read them out, and
yes?
>>: Except the colors don’t mean anything…
>> Li Zhang: Yeah the color doesn’t mean anything.
>>: To distinguish one segment from the next.
>> Li Zhang: Right, right…
>>: Okay.
>> Li Zhang: Right, right, ha, ha.
>>: Parts of…
>> Li Zhang: Yeah.
>>: So there are no gaps in any of the horizontal rows?
>> Li Zhang: Right, right, right.
>>: And then the black line indicates readout and then immediately you start excluding…
>> Li Zhang: Right…
>>: Everything.
>> Li Zhang: Right, right, so…
>>: But the colors indicate exposure time, no?
>> Li Zhang: Sort of a…
>>: [inaudible]
>> Li Zhang: It’s a color, well…
>>: [inaudible]
>> Li Zhang: You can say that…
>>: [inaudible]
>> Li Zhang: Yellow means, yeah yellow means eight frames…
>>: Okay.
>> Li Zhang: And the Cyan or the Cream is four frames.
>>: Okay.
>> Li Zhang: Yeah.
>>: Sorry, so for pixel three it seems like sometimes you expose it very long and sometimes very short.
>> Li Zhang: Uh-huh, right.
>>: Um…
>>: [inaudible]
>>: It’s just random.
>> Li Zhang: Yeah it’s random, yeah…
>>: I see.
>> Li Zhang: Right now it’s random.
>>: Okay.
>> Li Zhang: Any other questions? Okay so from this, okay it’s like we measure the sum of photons in
this interval and the sum of photons within this interval, right. We got the sum sort of reduced the
number of photons then we want to recover the number of photons in this sort of cell and this cell, and
in this cell. So you can do that, we can recover this sort of high speed frames. Also because we have
these [indiscernible] of the long exposure and short exposure probably we can recover the high dynamic
range ones, okay.
So for, sort of there are some at least a potential advantages of doing this. One is it has 100%, nearly a
100% lights throughput. We’re not wasting any light. You can, potentially we can simultaneously cover
HDR and high speed videos. We don’t just constantly doing something until it has reduced the power
consumption. Also I think that probably the very nice thing is it can be implemented on a single chip.
Because it only requires this partial read out.
So I was talking to some people at CVPR last week. So this one does not exist right now, so everything
we do is simulation. But for the next generation of CMOS sensors we might be able to do this. So right
now we can control per row not per pixel yet, okay. So this is like the first experiment we did. So we
didn’t design, we haven’t figured out what’s the optimum pattern yet. So right now we only just do a
four exposure time one, two, four, eight, and randomly permute them, assign them to each pixels, okay.
So, okay how do we do the reconstruction? So we have these measurements. Those measurements are
constraints, right. But the number of constraints is less than the number of pixels we want to recover.
We need to have some additional regularizations. Okay we do this sort of a block matching. So within
the space time volume, okay so you, so let’s say if we consider let’s say this is a four by four by four, sort
of a space time patch. If we can say okay these two ones, these two are very similar so that gives us a
regularization, right. So and typically when we work with regular videos this sort of a space time volume
is very, is not very robust because the motion is very fast. The temporal is something where it cannot
compare with the spatial one. But here each frame is this like high frame, this frame, we have a much
higher sort of temporal sampling rate. So each, for example we can be talking about like a two hundred
frame per second. So that’s why this space time volume makes more sense in this scenario, okay.
So then if you look at this we want to compare this volume with this volume. It’s a little bit tricky
because of the sampling. We don’t exactly know the pixel value here, right, when we compare this with
this. So how can we do the comparison to say okay this block is similar to this block? So essentially the
problem is like this. So we have, so this yellow bar, okay. So this yellow bar is this one, okay, and we
consider here we have this magenta bar and the yellow bar, and the green one here. So these are, is
that right? Yeah, I think this green, magenta, blue, yellow, so this is this row, okay. So we want to
compare these four pixels with these four pixels, okay. How are we going to do this?
Okay, so we cannot do it directly. See instead we create some sort of a virtual sample, create a virtual
sample, meaning that what will be the measurement here? Can we reconstruct somehow approximates
the sample at this locate, at this location, okay? When we construct this we really need to, because this
measurement, this sample is for the whole eight frames, right. So we have to consider okay what would
be the virtual measurement for these whole eight frames, okay? So what you can do is you can do some
sort of weighted blending. So you take half of this pixel, okay, you take sort of the total value here and
do half value here, okay. You do a weighted average to create a virtual sample for this location. Does
that make sense?
So then you can compare these two values. So if you do that you can compute a score sort of between
the difference, a score to measure the difference between these two blocks, okay. So if you do that, you
can say that for each block what are the most similar blocks within this value? Okay that will give you
some regularization, okay for this whole process. So I will skip the details I will show you some result.
So this is the, sort of the, our coded sampling input. This is our reconstruction result, this is a video let
me just go through each block. This is the ground truth and this is the so called four exposure time.
What does that mean is? We have a regular video and the exposure time is four frames, its four frames
long, okay. So this is a regular video and the exposure time is only one frame long. But for this one it’s
going to be noisy so apply denoising to this frame we’ll get denoised at one frame, okay. So let me play
the video here.
[video]
So you can see that these motions are sort of jerky because the frame rate is low, okay. So here we can
reconstruct this high frame rate output so that’s why it’s smooth. So let me do a toll mapping so you
can better see it. Okay so this is one thing. So there’s some sort of close up comparison between, for
the details. So I guess the result is a little bit hard to see this. So that result is a little bit sharper than
sort of denoised short exposure sampling. So this is another result that shows that this method is
reasonably robust to complex motion.
>>: But what [indiscernible] long time exposure in the middle is not moving faster.
>> Li Zhang: Oh here is the thing so let me go back. So you can see that just for fair comparison if we
consider one, two, four, eight in total we have a length of fifteen, right. We have a length of fifteen
frames. For this fifteen frames only sample four times…
>>: Um.
>> Li Zhang: So essentially we’re reducing the sampling rate by a factor of four. So to do a fair
comparison when we generate to the regular videos we only sample every four frames, okay. So that’s
why you see the, for the bottom row these short exposure and long short frames videos they’re jerkier,
yeah. Any other questions?
Okay, so…
>>: I guess I have a question, just how physically realizable is it to have a different exposure in each
pixel?
>> Li Zhang: Actually if you think about it we don’t need to set it. We don’t need to set it. We just need
to control when to read it.
>>: Uh-huh.
>> Li Zhang: So we don’t have a, sorry…
>>: Right you just have to read it, but how can, you know given current silk and the row column
architecture on sensors and so on…
>> Li Zhang: Uh-huh.
>>: How do you read different pixels at different times?
>> Li Zhang: Uh, that’s a good question. So I don’t do circuit design but I ask the UC Faculties…
>>: Uh-huh.
>> Li Zhang: They think this is certainly doable because it’s purely a VLSI problem. You can give a mask…
>>: Uh-huh.
>> Li Zhang: They can read out whatever pixel you want. So right now you can, we can already sort of
do a [indiscernible] reading…
>>: That’s right, yeah.
>> Li Zhang: We cannot do this per pixel reading yet…
>>: Uh-huh.
>> Li Zhang: But I, someone told me at CVPR that this per pixel reading is also possible. Potentially I
think it’s possible that in the next [indiscernible] generations CMOS sensor this one can be achieved.
>>: Okay.
>> Li Zhang: Yeah.
>>: So if it’s possible by row could you right now build our sensors at depth the first row by one and the
second row by four, and third row…
>> Li Zhang: Yeah, that’s possible. We can do that. I suspect the result will be in principle the whole, all
the method in our paper can work with that but my gut feeling is the result might be a little bit worse.
So this is also confirming I think in Columbia Group they tried this. If you have per pixel control you tend
to get better result. Not for this project they did some other project they go the same sort of empirical
result, yeah. Yes?
>>: So I guess depending on like how sparse your readings are…
>> Li Zhang: Uh-huh.
>>: I guess there’s sort of a range where if you read like many, many pixels that it might be just as
efficient to just read everything and then ignore…
>> Li Zhang: Uh-huh.
>>: Right…
>> Li Zhang: Yeah, yeah, yeah.
>>: Versus sort of setting like locations and then reading those. Do you have a feeling for like where
that spot is because then you don’t really need that extra logic to be able to read from individual…
>> Li Zhang: That’s a good point. So I think that, then the special case is you just read every pixel at
every high frame rate, at every frame right.
>>: Uh-huh.
>> Li Zhang: But we’re talking about it’s a three hundred frame per second then if you operate at that
mode you, probably it will consume a lot of energy. So one, I guess one argument is compressive
synthesizer making it, if you do that, if you sort of, if you reduce the sampling rate you will save power
consumption. Yeah?
>>: Do you know if it is cheaper to rate all sort of pixels versus single pixels?
>> Li Zhang: Uh-huh, it’s possible, yes.
>>: Okay.
>> Li Zhang: It’s possible.
>>: Using the same idea for masks. So in the mask if you have a two by two square…
>> Li Zhang: Uh-huh.
>>: Is it cheaper to rate the whole square together versus single pixels? Because it seems like there is
possible to get some more denoisation by…
>> Li Zhang: Uh-huh.
>>: Considering blocks of pixels in this diagram.
>> Li Zhang: Uh-huh, right, right. Yeah that’s a possibility. We haven’t…
>>: Right.
>> Li Zhang: Explored that yet.
>>: It seems like sometimes the dynamic range from the scene is going to be limited. You don’t need
the one to eight X…
>> Li Zhang: Right, right.
>>: So say if you wanted to turn off the eight X and the…
>> Li Zhang: Uh-huh, right you only do…
>>: [inaudible]
>> Li Zhang: One, two, four.
>>: One, two four, right.
>> Li Zhang: Yeah, yeah, yeah.
>>: You just don’t have that much dynamic range in the scenes?
>> Li Zhang: Right, right, right.
>>: What happens then to the [indiscernible]?
>> Li Zhang: It doesn’t change the…
>>: It doesn’t speed up faster?
>> Li Zhang: Right, right right, yeah.
>>: And conversely sometimes maybe just the highlights are just at a very small range, small area…
>> Li Zhang: Uh-huh.
>>: Scene that requires…
>> Li Zhang: This very short exposure…
>>: [inaudible]
>> Li Zhang: Right, right, right.
>>: Very short exposure but then you’re sacrificing the rest of the scene to capture that one thing
[indiscernible].
>> Li Zhang: Uh-huh, right, right right.
>>: Yeah.
>> Li Zhang: Well…
>>: No more [indiscernible] right. You really haven’t talked about stage [indiscernible] and so...
>> Li Zhang: Right we haven’t talked about this…
>>: Temporal adaptation.
>> Li Zhang: Right, right, we haven’t done that yet. But I know that there’s one ICCV submission sort of
trying to do this type of thing, not from my group, from other groups, yeah. So, did I show this? I
probably showed this right. So this is another example, this is also, it’s so dark I guess I’ll skip this. It’s
just that you have a set of rolling dice. Those are all sort of hard examples if you want to do a traditional
optical flow based method to do these, oops, sorry. See there’s some artifacts here but it doesn’t sort
of downgrade sort of drastically. So that’s sort of something that Whitfield as a pros might be promising
to handle really robustly for fast motion.
So this shows that underlying this we have iterative method to do the reconstruction. This, we use this
thing called an alternating direction method of multipliers. This, right now the method is slow. It really
requires quite some iteration as to reconstruct the highlight. Because for the highlight a lot of pixels the
readers are sort of invalid. We have to somehow fill in these values.
Okay, so this example shows that the limitation of this method. Okay see this, let me play this again.
Okay I stop here and when you compare these method you can see that our method cannot completely,
completely remove the blur, right. It sort of removes the blur for these locations which are closer to the
center where the motion is slower and it’s better than the long exposure time. But for this very fast
motion we still cannot completely remove the blur. Okay, so that’s right now the limitation. But the
one way to address this is to somehow adaptively you can throw in more shorter exposure frames to
address this if you really want to get everything sampled correctly.
Okay so I’ll skip this, some future work on sort of optimal or adaptive exposure patterns, and more
efficient reconstruction and some real time preview, right. So right now we can capture this. This one
doesn’t look so good so maybe we can have some fast way of generating, not the perfect video but
some a little bit blurrier video but sort of regularly looking so that we can preview the content, okay.
Okay so that’s sort of a one part of my talk. So I’ll continue on this, sort of the next thread which is using
multiple cameras. Okay, so when we talk about camera arrays I think maybe ten, or ten years ago all the
arrays are big. So but now more recently we have seen these much smaller arrays. This is a, you can
buy this from Point Gray. We have camera arrays of sort of appearing on a cell phone. Also, so this is a
paper in Nature this year so we can really make the camera very small. So each of these lens, each of
these little ball is a lens. Behind the lens there’s a photo receptor. So as you can see that this is a dome
shaped camera array. This is a two millimeter, the whole thing is like a fingernail size.
So, okay, so there’s, it potentially then we, later in the future we can have this type of array appearing
on our mobile devices. So one question is okay how, what kind of image processing can we do for these
type of exotic devices? Okay, so one thing I did several years ago is, okay if you have multiple image
measurements and because the camera is very small so the image quality won’t be very high. So it’s
going to be noisy, can we aggregate these images together to get a higher quality image? Sort of give
you a sense, this is a, if you work in three D reconstruction you know that this is a synthetic example. So
this is the one sort of a noisy image. Okay so in this work we assume that the noise is dependent on the
image intensity, it has this Poisson distribution. Okay so you can, we can compare, sorry this is the
ground truth. So this is our reconstruction. It’s sort of does a pretty reasonable job except you see that
for the textural regions.
So I think I will sort of skip the details and just show you a couple results for this work. So for example in
terms of the noise reduction we compare with sort of the best single image noise reduction. Okay so
this BM3D method many people use it as a benchmark. It does a pretty good job for removing noise but
it also has this blurry appearance. This is I think in Mark Lavoy’s group the synthetic [indiscernible]
denoising. It sort of, because it operates at each pixel sort of independently in a sense, it doesn’t work
as well. So this is our method, it’s sort of really is we recover the details quite well compared to the
ground truth. So this is another synthetic example.
So this is a real example all images are captured using Point Gray cameras. With the shortest takes for
time and highest again. So this is quite noisy and this is a single image denoising and this one actually
what we did is we captured twenty five images and we sort of treat these twenty five images as a video
and feed in the video denoising method. So they have a video denoising margin as well. Surprisingly it
doesn’t just improve, you just don’t get the better result. If you just, if you have more data, the reason
is when you do the, video denoising is still how to do the some sort of registration across the frames. If
you treat them as a video then the degrees of freedom you’re computing optical flow, the degrees of
freedom is much larger. But when we treat them as a multi view denoising there’s [indiscernible]
constraint we can use. So we can get matching much better. So that’s why sort of in our result we can
get quite better results.
>>: What was our first set of theory? How many cameras?
>> Li Zhang: We have twenty five…
>>: [inaudible]
>> Li Zhang: In this particular…
>>: [inaudible] separate cameras?
>> Li Zhang: Yeah, well there’s the same camera we’re just put…
>>: [inaudible]
>> Li Zhang: It at different locations.
>>: Is that a new grid, a grid or in…
>> Li Zhang: In line.
>>: In the line.
>> Li Zhang: For this one I think it’s a grid.
>>: Yeah.
>> Li Zhang: This is one example that shows that we can handle through signal the intensity dependent
noise as well. So when we do image denoising we usually have to provide a parameter which is the
noise standard diversion. So, but if we have, if we have sort of a saw noise if you set this value too low
then you don’t remove the big noise, so if you set the [indiscernible] too large then you sort of kill
details as well. So if we provide the sort of correct model then you can sort of remove the noise but also
keep the details. So that’s one sort of feature in this work.
Okay, so how many views do we need? We did some empirical evaluation. This is the horizontal access
is number of views, the vertical access is the signal to noise ratio. So you can see that after a certain,
after maybe twenty the performance doesn’t improve as rapidly as we hoped. So I guess there are two
possible explanations. One is as you have adding more cameras, so all these cameras are front looking.
So the common error they can see becomes less and less, so that’s one thing. So the second thing is
we’re not purely just, the noise reduction is not purely dependent on, okay so the noise reduction needs
some redundant measurement. The measurement redundancy does not completely come from these
multiple views. Within a single image you can still get redundant measurement because self similarity,
right.
So if we have enough views to help us to find accurate self similarities within a single image then
probably that’s enough. We don’t need infinite number of view to sort of get a very high quality noise
reduction. Okay so that’s for noise reduction. We also did some work on exploiting camera arrays for
video stabilization. So these are some professional solutions for video stabilization. You have these
mechanical devices to damp out the motion which is not very comfortable to use for consumers. So and
we have a, many sort of algorithm based video stabilization method. They tend to assume distant
scenes and similarity based at camera motion.
We also have some hardware based solution. Let’s see you have a floating sensor to compensate the
motion or you have a floating lens to compensate motion. So all these works are, have limited degrees
of freedom and the motion cannot be too big, so the baselines limited. So this is a paper from
SIGGRAPH two thousand nine. So this is a type of scenes that we typically apply stabilization now, okay.
So usually we have these assumptions. So the background needs to have a lot of sort of features we can
do tracking, okay. So the dynamic targets need to be small, okay.
So this is, even in this CVPR, I saw a video stabilization method which sort of works in a similar regime,
okay. So the type of things that will break the video stabilization is something like this. Okay the motion
is big, the background is sort of solid white, it doesn’t have too much sift to features for us to track. You
have a large depth variation so this type of thing. Okay, nearby, so these are the sort of challenges.
We’re going to show that if we have a camera array all these challenges can be addressed in a much,
much more easily.
Assume you have a camera array. This array is somehow vibrating in a three D space and we can view
the stabilization as an image based a rendering problem you just want to render a video along a smooth
trajectory. So this concept is willing to come up with this concept. This concept is mentioned in this
paper. Okay, but they didn’t use array. So we were arguing that if you have array this, we can make this
idea work much better. Okay, so that’s the idea. The reason the array helps is at each moment in time
you have array you can estimate to the depth and you can do image based rendering to render all the
scenes to that particular desired view point you want to form a smooth video.
Okay say this video will sort of demonstrate the concept. So you have a camera red, these red pyramids
are red, are five cameras of the physical array. We only use five cameras for this particular project. So
this is one of the five input this is sort of vibrating a lot. So then this is the virtual camera. The virtual
camera is sort of moving around it’s like a car suspension, right. So the seat is stable but there’s motion
between the seat and the frame. Okay if you sit on a seat then you see everything stable. So that’s the
basic idea.
>>: Is this that camera you just showed? So what’s the…
>> Li Zhang: Yeah.
>>: Separation is?
>> Li Zhang: Um, forty millimeter.
>>: So it’s…
>> Li Zhang: Yeah.
>>: Pretty small compared to the [indiscernible]?
>> Li Zhang: Right. So this is another compare, yeah you can see that this, it’s very hard to make the
traditional video stabilization work on this type of, maybe a little bit to sort of contrive the scene to try
to illustrate the concept.
>>: In those two samples the amount that you’re cropping out seems to be somewhat minimal
compared to some of the other consumer grade stabilization techniques.
>> Li Zhang: By cropping out you mean the…
>>: Yeah the amount of video that gets lost.
>> Li Zhang: Right, right, right, we try to minimize that. You can sort of somehow optimize the position
so that the crop out region is somehow minimized.
>>: Yeah.
>> Li Zhang: Yeah?
>>: So you [indiscernible]…
>> Li Zhang: Uh-huh.
>>: The area was really small…
>> Li Zhang: Uh-huh.
>>: It would be like a centimeter or one centimeter…
>> Li Zhang: Uh-huh.
>>: This virtual view can go outside the grid, the square and then it becomes like an extrapolation
from…
>> Li Zhang: Uh-huh, right.
>>: So in reality do you think this [indiscernible] would work? Like which factors of that estimation…
>> Li Zhang: It becomes, yeah…
>>: [inaudible]
>> Li Zhang: If you need, okay if these red cameras are closer…
>>: Right.
>> Li Zhang: This blue one is to go outside way more.
>>: Yes.
>> Li Zhang: Yeah so then it becomes extrapolation, yeah that really, if you do that it will be more,
certainly much hotter. So the depth needs to be more accurate, yes. Yes?
>>: [indiscernible] from the [indiscernible] cameras.
>> Li Zhang: Right, right, right.
>>: Yeah.
>> Li Zhang: If you do interpolation that’s easier, yeah. Okay, so we compared with the result with,
oops, sort of iMovie, see it has a hard time to stabilize this. This is another result. Okay, so there’s a one
key thing in this method. So if we do, if we implement this idea in a more straight forward way we will
run some sort of a structure for motion to get to the three D trajectory. Then we sort of get a smooth
from this red trajectory we smooth it out and get a blue one. Okay, that’s the more straight forward
way of implementing this if you do that then we run into problems. Because you can see that all these
scenes you don’t have that many features to track.
So the keys or the idea is when we do imaged based rendering we don’t really care the absolute
location, absolute location of the red trajectory. We only need to know the relative post between the
virtual camera and the physical camera. That’s all we care about. We just need to know the relative
post. So then we can formulate this optimization problem. We want to find a sequence of relative
posts, such that if you generate this virtual video all the salyn features, by salyn features I mean the
edge map will move smoothly in the virtual camera. So that’s why we can completely avoid the
structure for motion. So that’s why we can sort of get to this, get the results for all very difficult scenes.
Okay so if we can generate one virtual view we can also generate two virtual views. Then if we have a
goggle we can, you can image you see some three D movies out of this. So you get the concept. See if
we have a depth as we’re underlying this technique we have the depth, you have, we have the depth we
can augment the video with sort of virtual object, like say this virtual ball is rotating as the guy. We can
easily model the collusion relationship. Okay, so that’s, yes?
>>: How did you get the depth? Did you write your own stereo…
>> Li Zhang: Yeah, yeah, we have…
>>: Stereo [inaudible]…
>> Li Zhang: A few, maybe two or three depth map gap estimation CVPR or as papers, yeah we use
those, yeah. Other questions?
Okay, so then I will switch to another topic in terms of the image. Okay, maybe after this I’ll stop I’ll cut
off the other stuff. So okay we know that if we can estimate to the rigid scene, three D rigid scene we
can organize images. Okay, so that’s what Photo Tourism / Photo Synth is doing. So I’ve been thinking is
okay, if we, how about we extend this idea to non-rigid structure motion in non-rigid scenes. So one
thing I did in my, when I was still doing this if we can capture a whole lot of three D shapes then we can
very easily navigate through the shapes and maybe come up with new shapes, so that the interfaces like
this standard graphics sort of interface where you manipulate the object you get new shapes.
Okay, so I’ve been thinking okay can we use the similar interface for two D images without getting into
this three D reconstruction? So that the one sort of scenario would be okay this guy Simpson saw a
person somewhere and he forgot exactly shape but he can query faces. So you’ve got a whole bunch of
face shapes. Okay, then this person wants to nail down from these millions of faces how can we get the
exact shape in my mind? I don’t have the specific picture but I have semi-rough idea maybe this person
has a big mouth or a smiling face, something like that. So in order to do this the difficulty is how do we
refine the search result, right?
So maybe a particular smiling style or particular pose he wants, okay. So but right now for example one
interface we may come up with this. Okay, give a current retrieve the face maybe you can just drag the
nose to the left you get a set of sort of front looking faces, okay. Another example is you just do this
pinch on your touch pad interface you get a set of sort of open mouth face images, okay. So this type of
thing sort of explores the geometric attributes which are, you know hard to say in the words. But it’s
easy to manipulate on your cell phone or whatever touch pad interfaces. So it could be useful for
criminal profiling or photo management.
So again so I’ll probably skip the details just show you a prototype we built.
[video]
Okay, I guess you get the idea. So the, to really make it more useful we have to lock down the identity
so that, you know the identity doesn’t just change. The problem in practice is we can now find each
actor or each public figure. It doesn’t have all the different expressions, all the posses, you know in our
databases so we sort of have to remove that constraint, okay.
Okay, so…
>>: How much was that dataset [inaudible]?
>> Li Zhang: We have about ten thousand images. We do a pre-processing to get all the landmarks
detected. That probably removed five thousand because there’s something, some alignments which are
not very reliable so we remove them or, you know final search database.
>>: But that’s when you say you don’t have the actor in all expressions and, but that’s because you’re
going to Google image search or something like that. If you actually started analyzing the movies…
>> Li Zhang: Uh-huh.
>>: Actor over the course of a movie will probably pose in…
>> Li Zhang: That’s true. We didn’t do that we just used a, some database maybe Columbia or some
other, I think Columbia collected this dataset.
>>: Okay.
>> Li Zhang: So yeah, okay so I think I need to wrap up just, although I have some other stuff. So one
thing is the geometric attributes is not enough. So what you really want to type in maybe red hair or the
guy with a mustache. So we need some appearance attributes to make this thing more useful. That’s
one thing, the other thing is we really need more robust face alignment for this to be useable. Because
this is a little bit different from photo [indiscernible] where you can take a million pictures of the Statue
of Liberty, if somehow the feature matching doesn’t work you can throw away maybe half of them. But
you still have a lot. Here for faces maybe we’re interested in certain interesting smile or if you throw
that smile away then you’re really missing the point that the pictures of the people may want to find,
okay.
So we really need this. We have some face alignment paper last year. When we worked on this
problem we realized a very difficult problem in practice which is all these different datasets. Each
university has its own dataset. Each dataset has different definition of landmarks. So for this one the
similar data has, dataset has like fifty points, but for this dataset has twenty points. So there’s, it’s very
difficult to merge these datasets together to get a very robust, to get a more robust face alignment
model.
So, how many more minutes do I have?
>> Michael Cohen: About five.
>> Li Zhang: Okay, so because of this, so more recently I’ve been sort of thinking maybe we want to
have a different representation for faces. Instead of figuring out these landmarks, okay we can for each
pixel give it a label whether it’s a skin, cheek or nose, or teeth, or hair. Right, so we’ve only used a soft
segmentation as a representation for faces, which has probably handled hair in the future. It’s very hard
to use contour and landmarks to model hair as well. So that’s one motivation and also we want to make
this method sort of general so that we can use it for street view scene as well.
So I will just show you I guess some, the current result we can get. Okay, so this is one sort of a result
we get in the last ECCV. This is a frame by frame processing sort of example based sort of parsing of
street view scenes. If you pay attention usually we miss smaller objects but for the big objects like skies,
buildings, road surfaces we tend to work, the method tends to work better. So that’s for this.
So on faces in this CVPR we’ll have a sort of similar but different example based method which works
quite well. We compare with other face segmentation method. Just give you access, show you a
demonstration. So one nice thing, right now we cannot handle hair yet but with this sort of a segment
based method it’s possible you can model teeth. Sometimes the teeth is important cue if you want to
analyze the lip region.
>>: Do you get the same orientation with the labeling method versus the contours? So like before you
were able to drag points to set the orientation or the gaze of the face, can you still accomplish that with
this method?
>> Li Zhang: Uh, that’s a good point. This method you can integrate the contours in, yes. So in order to
drag these points you have to have some anchor point to drag, right. So this method, for this segment
based representation we can integrate with the contour as well, yes we can do that.
Okay, that’s, that’s I guess that’s it.
[applause]
Download