>> Michael Cohen: Morning is my pleasure to welcome Li Zhang here. I’ve known him for almost 15 years probably since he arrived at the University of Washington across the lake, as a graduate student. Did a lot of great work in computational photography back then and continuing to this day, I guess about five or six years ago we lost them to the other UW which is University of Wisconsin. He’s going to talk to us today about some new image sensors that are coming out and how to use those sensors both for understanding the world and for producing great images, I guess. >> Li Zhang: okay, yeah… >> Michael Cohen: I’ll leave… >> Li Zhang: thank you Michael for the introduction and thank you for having me here. So let me start the talk by showing you a very interesting picture. So this is a picture of, taken in 2005 where, when the Pope was announced. That was eight years ago and there’s another comparison picture like this. So I guess for researchers in this field when we see this we got very excited because the mobile devices are really a part of our life. So we’re using it to take pictures, if we can do anything on the camera or to the pictures that are taken, we very likely may you know impact our daily life. Okay so as researchers I think I’m interested in several questions in this area. So one is usually when we want to take good pictures we sort of use a big camera, right? So now most likely were going to use sort of small cameras. Can we sort of improve the imaging performance of these small cameras to make it more comparable to this? Okay so that’s one type of questions. Like for example dynamic range, low light performance. The second question is, okay once we are able to very easily take questions, or take pictures were going to see a lot of, we will have a lot of pictures. How do we browse and navigate through these images? Okay so there’s a third question which is how do we extract some knowledge from these pictures, like doing face recognition or more general sort of object segmentation, and the recognition, okay? So I have done a little bit of work in these areas so I will discuss these works with you. I will start with the first one. Okay, so high dynamic range images. So this is a high dynamic range images are important so you can use I Phones these days to take HDR images. So as we all know the way it works is you take a succession of pictures with different exposure times. So this longer wider bar means longer exposure. We merge them we get this nicer looking picture okay. So we can see that the reason we want the longer exposure is that this part is so dark if we use forty exposures we cannot see too much. But if we use long exposure then when the scenes is moving okay so we’re going to have blur here. Okay so then that becomes a problem. So not too many work have addressed this issue. I mean addressed the blur issue when we reconstruct the high dynamic range images okay. Okay so if the issue is, you know we have a lot noise in this dark region okay. So maybe alternative is this, so we just take a succession of short exposure pictures and maybe we can somehow remove the noise by sort of somehow aggregating all the measurement here. Okay so this is another alternative. So then the question is okay which is better? Shall we do this trying to remove the blur here or shall we sort of take a, sort of remove the noise in this sequence? So and because when you use a short exposure time within a certain period of time you can take more pictures, okay. So then the, sort of there’s this design question, okay which sort of a scheme should we use? Should we take a very short sequence of sort of noisy images? We should you know take this type of sequences. I think in 2010 we had a paper, sort of did an analysis comparing these two schemes. So our conclusion is that this tends to work better in practice. So, you want to estimate the motion between these frames and then remove the noise. There are some more general analysis from [indiscernible] Group which also compare this type of image capture with other sort of coded capture imaging, for example flatter shutter and sort of coded capture imaging to see okay which is better and in which scenario, okay. So I will show you some examples of my work. So for example, is there a way to dim the light? >> Michael Cohen: Uh, yeah. >> Li Zhang: No, no that’s fine, ha, ha, yeah. >> Michael Cohen: it’s a trade-off between the people there. >> Li Zhang: Oh I see, u-huh. >> Michael Cohen: So you just speak and the… [laughter] >> Li Zhang: Oh, okay, ha, ha. So for example in this low light condition we take a picture of this birthday cake and if we just blow out sort of a amplify and the pixel intense stage we tend to get this okay. So the idea is okay so how about we take a very short sequence of noisy images and then you can, and this is like a [indiscernible] mapping result. When you do this there’s a very likely your hand may be shaking so you take a very short sequence so you can recover all these nice details. Okay this is one example. So let me show you another example okay. So this is the noise level and, so there’s a motion, this motion is smoother okay. So this is sort of demonstrative concept okay we can capture a sequence of images then we can still reconstruct the high dynamic range images. Okay the question is okay, is the problem solved? So when you think about that it’s not quite. Because what does that mean? If you want to take a high dynamic range videos on a cell phone then your cell phone has to, you know work at this very high frame rate. Then usually that means you’re going to consume a lot of power, okay. So then we thought about okay so if we do, we take high frame rate because we want to avoid the motion then sometimes maybe the scene’s not moving. So maybe we want to do this image sampling in a more adaptive way, meaning that if there’s no motion we want to slow down, we want, we can use a longer exposure time, okay. So that’s another work we did. So use the [indiscernible] estimation to control the exposure time. So usually the exposure time is controlled by the brightness. If it’s bright enough you use short exposure, if it’s dark then you use long exposure. So, but we can use motion to control the exposure time, okay. So I think three, I think maybe three years ago I think there are some cameras can do this but there’s only four static shots. So now we haven’t found sort of video cameras that can do this. So we build a prototype to experiment with this concept. So I will show you this how it works. So this is like a regular video as we’re moving. So you can see that once we move this is a constant exposure then things get blurred, okay. But here once it starts moving things get noisier. You see the difference? So here once it moves we have an underlying sort of motion detection. You get a noisy reconstruction and you get a noisy capture because we reduce the shutter speed. Here the exposure time is constant so you get, okay. So the idea is okay, maybe from this type of measurement we can get a better image reconstruction. Yes? >>: Is the motion estimate of the you know like the inertia… >> Li Zhang: No it’s like shorter imaged based, like Lucas [indiscernible] type of motion estimation. >>: Right, but it has to be fast enough to drive the camera and this is already a high speed capture. >> Li Zhang: Right so we did this, we didn’t do this on a phone. We have a laptop connected to a Point Gray camera. So the motion estimation is done on the laptop, yeah. So, any other questions? Okay so, okay so then, so you can see that at the particular moment if this thing is moving fast its blur and in this capturing scheme it’s noisy. In order to evaluate to this we did some, evaluate to this scheme we did some synthetic experiment. Suppose to have a panoramic image and we’re sort of have a simulative camera moving this way, okay. The first if you look at the dashed red line that’s the signal noise ratio in the input, meaning that we use a constant short exposure, meaning that every frame is very noisy. So the signal to a noise ratio is very low. Then we can apply some video denoising method. For example we used the Liu and Freeman methods I think that’s maybe three years ago. Yeah, you can bump up the signal noise ratio to this, so it works quite nicely. But if you use our sort of our adaptive sampling originally it’s here, sort of a static. So that means the signal noise ratio is pretty high. Then you start to move, so moving fast then you reduce the exposure time and suddenly the noise ration, signal to noise ration becomes low. Then once you come here the stops and the move stops. So you see that the dashed black line is the signal to noise ratio in this adaptive captured, in this motion based exposure control capture, okay. Then based on this you can do some video denoising then we can get a better sort of a signal to noise ratio. Consider for this good input images we get a very high signal to noise ratio and we can get some improvement like this. >>: So this is all single framing result, right? >> Li Zhang: Well actually it’s sort of an aggregating multiple frames. >>: So the slalom lines are aggregating? >> Li Zhang: Yeah, yeah key benefit we can get is, you know we have very good frames here, sort of a very good frames here. So that we can compute an outgo flow and if we give more weights here, more weights here so when we do the reconstruction we can get higher improvement. We can get better… >>: Why does the… >> Li Zhang: Reconstruct… >>: Black dash line not go all the way down where the red dash line is? >> Li Zhang: Actually it depends on the speed. Here it touches the red line, yeah. >>: So that means it still exposing the monitor in that first state? >> Li Zhang: That means here its move is really fast. So this is sort of a conservative way of setting exposure to make it very short. >>: So you [indiscernible] motion in scene… >> Li Zhang: Uh-huh. >>: That the objects move around… >> Li Zhang: Uh-huh. >>: [indiscernible]. >> Li Zhang: Yeah, if everything is moving then you probably end up, so that’s the worst case. You probably end up with something like this. Because we keep… >>: [inaudible] >> Li Zhang: Right, right, right, right, yeah. Okay so here are some comparisons. So again this is synthetic examples so that we can have ground truth. So this is the constant exposure time. It’s blurry, and this is noisy input. So we can apply this CBM3D method on each frame to get the noise removed. This is another sort of new Liu and Freeman method to remove the noise. This is our method, you can see that this is a, our method is more similar, well in terms of the sharpness it’s probably more comparable to this. This is almost the same, it’s slightly a little bit blurrier. So this is the ground truth and if we compare the signal to noise ratio, so this is the input, the red solid line is the input, and we have two other sort of CBM3D method, and Liu and Freeman method they’re somewhere here, and sort of our method you can sort of do better, okay. So this is another example, in this case I think both the Liu and Freeman method and the CBM3D method sort of over smooth the result a little bit. This sort of, in this result the difference is more clear, so our result is more comparable to the ground truth. So the key reason is that our methods are switched between the special denoising and the temporal denoising. If the motion estimation is unreliable we use a single frame. Otherwise we use sort of temporal information. So again, so this is the comparison. So this black line is our result, okay. Okay so the land question is okay is this all? So not really so, can you think of any issues with this? So there’s some issues. First of all when we use the motion to control exposure we can only measure the previous two frames, right. Then use the motion to predict the next frame but this prediction is always delayed. See if we’re standing, moving slowly, suddenly you move fast then, you know the next frame is wrong, okay. So, now, right so that’s the main problem. In the end we always have some sort of blurry frame in the result. In order to enhance this one we have to somehow sort of doing some sort of registration between a sharp one and the blurry one. >>: But when you say there’s the delay… >> Li Zhang: Uh-huh. >>: I mean IMUs can run at two hundred hertz or something like that, right, so. >> Li Zhang: Uh-huh. >>: And most cell phones have some sort of inertial measurement built in… >> Li Zhang: Uh-huh. >>: Right. >> Li Zhang: Right, right, right, so but the motion there even if the camera is absolutely static there could be SIM motion. >>: That’s true. >> Li Zhang: So that kind of motion need to be estimated from the frames. >>: Uh-huh. >> Li Zhang: Yeah, so okay if we, then you have to do some registration between the blurry and the sharp image. That becomes a tricky optical flow question because the brightness constancy is violated. So what, if you just use a regular flow then you’re going to somehow move these pixels to match with these, the blurred intestine patterns, the flow will be wrong. Okay, so where addresses this problem in the paper in last CVPR. It works quite nice for the interior. So the basic idea is we want to compute a flow such that if you use that flow to blur out this one and then the blurred version will match up with this guy. Okay, so that’s the basic idea. So it works quite nice with these interior regions but near the collusion boundaries, the depth is continuity. It’s very hard, it’s very hard. So then I think we can still keep improving writing papers but in terms of having a real robust system I think this is probably not the way to go, okay. So then we sort of come up with this idea. This is sort of a randomized exposure scheme. So if you look at this, this is an illustration on a ten pixel camera. You have a ten pixel camera. So and let me just say how it works. So for example at time A, at time A you only read three pixels. The three pixels is pixel one, okay you read pixel one and the pixel one has been exposed for two frames. So this is the exposure time, okay. Then you also read pixel six, okay. So pixel six has been exposed for four frames, okay, that’s the exposure time. You also read this pixel nine and it has a short exposure time, okay. For the pixels that you’re not reading at this frame it’s going to continue to expose. For example this one, pixel two you don’t read it so it continues to expose, okay. Then you come to the next time instant. Okay we’re going to read out at this frame we’re going to read out pixel one with this exposure time, and the pixel two with this exposure time. Does that make sense? Then, and also pixel six with this exposure time, okay, so the basic ideas for another example at this frame we’re going to read out pixel three with this exposure time, and I guess pixel seven with this long exposure time. Okay so the basic idea is at each time you’re going to select a subset of pixels and read them out, and yes? >>: Except the colors don’t mean anything… >> Li Zhang: Yeah the color doesn’t mean anything. >>: To distinguish one segment from the next. >> Li Zhang: Right, right… >>: Okay. >> Li Zhang: Right, right, ha, ha. >>: Parts of… >> Li Zhang: Yeah. >>: So there are no gaps in any of the horizontal rows? >> Li Zhang: Right, right, right. >>: And then the black line indicates readout and then immediately you start excluding… >> Li Zhang: Right… >>: Everything. >> Li Zhang: Right, right, so… >>: But the colors indicate exposure time, no? >> Li Zhang: Sort of a… >>: [inaudible] >> Li Zhang: It’s a color, well… >>: [inaudible] >> Li Zhang: You can say that… >>: [inaudible] >> Li Zhang: Yellow means, yeah yellow means eight frames… >>: Okay. >> Li Zhang: And the Cyan or the Cream is four frames. >>: Okay. >> Li Zhang: Yeah. >>: Sorry, so for pixel three it seems like sometimes you expose it very long and sometimes very short. >> Li Zhang: Uh-huh, right. >>: Um… >>: [inaudible] >>: It’s just random. >> Li Zhang: Yeah it’s random, yeah… >>: I see. >> Li Zhang: Right now it’s random. >>: Okay. >> Li Zhang: Any other questions? Okay so from this, okay it’s like we measure the sum of photons in this interval and the sum of photons within this interval, right. We got the sum sort of reduced the number of photons then we want to recover the number of photons in this sort of cell and this cell, and in this cell. So you can do that, we can recover this sort of high speed frames. Also because we have these [indiscernible] of the long exposure and short exposure probably we can recover the high dynamic range ones, okay. So for, sort of there are some at least a potential advantages of doing this. One is it has 100%, nearly a 100% lights throughput. We’re not wasting any light. You can, potentially we can simultaneously cover HDR and high speed videos. We don’t just constantly doing something until it has reduced the power consumption. Also I think that probably the very nice thing is it can be implemented on a single chip. Because it only requires this partial read out. So I was talking to some people at CVPR last week. So this one does not exist right now, so everything we do is simulation. But for the next generation of CMOS sensors we might be able to do this. So right now we can control per row not per pixel yet, okay. So this is like the first experiment we did. So we didn’t design, we haven’t figured out what’s the optimum pattern yet. So right now we only just do a four exposure time one, two, four, eight, and randomly permute them, assign them to each pixels, okay. So, okay how do we do the reconstruction? So we have these measurements. Those measurements are constraints, right. But the number of constraints is less than the number of pixels we want to recover. We need to have some additional regularizations. Okay we do this sort of a block matching. So within the space time volume, okay so you, so let’s say if we consider let’s say this is a four by four by four, sort of a space time patch. If we can say okay these two ones, these two are very similar so that gives us a regularization, right. So and typically when we work with regular videos this sort of a space time volume is very, is not very robust because the motion is very fast. The temporal is something where it cannot compare with the spatial one. But here each frame is this like high frame, this frame, we have a much higher sort of temporal sampling rate. So each, for example we can be talking about like a two hundred frame per second. So that’s why this space time volume makes more sense in this scenario, okay. So then if you look at this we want to compare this volume with this volume. It’s a little bit tricky because of the sampling. We don’t exactly know the pixel value here, right, when we compare this with this. So how can we do the comparison to say okay this block is similar to this block? So essentially the problem is like this. So we have, so this yellow bar, okay. So this yellow bar is this one, okay, and we consider here we have this magenta bar and the yellow bar, and the green one here. So these are, is that right? Yeah, I think this green, magenta, blue, yellow, so this is this row, okay. So we want to compare these four pixels with these four pixels, okay. How are we going to do this? Okay, so we cannot do it directly. See instead we create some sort of a virtual sample, create a virtual sample, meaning that what will be the measurement here? Can we reconstruct somehow approximates the sample at this locate, at this location, okay? When we construct this we really need to, because this measurement, this sample is for the whole eight frames, right. So we have to consider okay what would be the virtual measurement for these whole eight frames, okay? So what you can do is you can do some sort of weighted blending. So you take half of this pixel, okay, you take sort of the total value here and do half value here, okay. You do a weighted average to create a virtual sample for this location. Does that make sense? So then you can compare these two values. So if you do that you can compute a score sort of between the difference, a score to measure the difference between these two blocks, okay. So if you do that, you can say that for each block what are the most similar blocks within this value? Okay that will give you some regularization, okay for this whole process. So I will skip the details I will show you some result. So this is the, sort of the, our coded sampling input. This is our reconstruction result, this is a video let me just go through each block. This is the ground truth and this is the so called four exposure time. What does that mean is? We have a regular video and the exposure time is four frames, its four frames long, okay. So this is a regular video and the exposure time is only one frame long. But for this one it’s going to be noisy so apply denoising to this frame we’ll get denoised at one frame, okay. So let me play the video here. [video] So you can see that these motions are sort of jerky because the frame rate is low, okay. So here we can reconstruct this high frame rate output so that’s why it’s smooth. So let me do a toll mapping so you can better see it. Okay so this is one thing. So there’s some sort of close up comparison between, for the details. So I guess the result is a little bit hard to see this. So that result is a little bit sharper than sort of denoised short exposure sampling. So this is another result that shows that this method is reasonably robust to complex motion. >>: But what [indiscernible] long time exposure in the middle is not moving faster. >> Li Zhang: Oh here is the thing so let me go back. So you can see that just for fair comparison if we consider one, two, four, eight in total we have a length of fifteen, right. We have a length of fifteen frames. For this fifteen frames only sample four times… >>: Um. >> Li Zhang: So essentially we’re reducing the sampling rate by a factor of four. So to do a fair comparison when we generate to the regular videos we only sample every four frames, okay. So that’s why you see the, for the bottom row these short exposure and long short frames videos they’re jerkier, yeah. Any other questions? Okay, so… >>: I guess I have a question, just how physically realizable is it to have a different exposure in each pixel? >> Li Zhang: Actually if you think about it we don’t need to set it. We don’t need to set it. We just need to control when to read it. >>: Uh-huh. >> Li Zhang: So we don’t have a, sorry… >>: Right you just have to read it, but how can, you know given current silk and the row column architecture on sensors and so on… >> Li Zhang: Uh-huh. >>: How do you read different pixels at different times? >> Li Zhang: Uh, that’s a good question. So I don’t do circuit design but I ask the UC Faculties… >>: Uh-huh. >> Li Zhang: They think this is certainly doable because it’s purely a VLSI problem. You can give a mask… >>: Uh-huh. >> Li Zhang: They can read out whatever pixel you want. So right now you can, we can already sort of do a [indiscernible] reading… >>: That’s right, yeah. >> Li Zhang: We cannot do this per pixel reading yet… >>: Uh-huh. >> Li Zhang: But I, someone told me at CVPR that this per pixel reading is also possible. Potentially I think it’s possible that in the next [indiscernible] generations CMOS sensor this one can be achieved. >>: Okay. >> Li Zhang: Yeah. >>: So if it’s possible by row could you right now build our sensors at depth the first row by one and the second row by four, and third row… >> Li Zhang: Yeah, that’s possible. We can do that. I suspect the result will be in principle the whole, all the method in our paper can work with that but my gut feeling is the result might be a little bit worse. So this is also confirming I think in Columbia Group they tried this. If you have per pixel control you tend to get better result. Not for this project they did some other project they go the same sort of empirical result, yeah. Yes? >>: So I guess depending on like how sparse your readings are… >> Li Zhang: Uh-huh. >>: I guess there’s sort of a range where if you read like many, many pixels that it might be just as efficient to just read everything and then ignore… >> Li Zhang: Uh-huh. >>: Right… >> Li Zhang: Yeah, yeah, yeah. >>: Versus sort of setting like locations and then reading those. Do you have a feeling for like where that spot is because then you don’t really need that extra logic to be able to read from individual… >> Li Zhang: That’s a good point. So I think that, then the special case is you just read every pixel at every high frame rate, at every frame right. >>: Uh-huh. >> Li Zhang: But we’re talking about it’s a three hundred frame per second then if you operate at that mode you, probably it will consume a lot of energy. So one, I guess one argument is compressive synthesizer making it, if you do that, if you sort of, if you reduce the sampling rate you will save power consumption. Yeah? >>: Do you know if it is cheaper to rate all sort of pixels versus single pixels? >> Li Zhang: Uh-huh, it’s possible, yes. >>: Okay. >> Li Zhang: It’s possible. >>: Using the same idea for masks. So in the mask if you have a two by two square… >> Li Zhang: Uh-huh. >>: Is it cheaper to rate the whole square together versus single pixels? Because it seems like there is possible to get some more denoisation by… >> Li Zhang: Uh-huh. >>: Considering blocks of pixels in this diagram. >> Li Zhang: Uh-huh, right, right. Yeah that’s a possibility. We haven’t… >>: Right. >> Li Zhang: Explored that yet. >>: It seems like sometimes the dynamic range from the scene is going to be limited. You don’t need the one to eight X… >> Li Zhang: Right, right. >>: So say if you wanted to turn off the eight X and the… >> Li Zhang: Uh-huh, right you only do… >>: [inaudible] >> Li Zhang: One, two, four. >>: One, two four, right. >> Li Zhang: Yeah, yeah, yeah. >>: You just don’t have that much dynamic range in the scenes? >> Li Zhang: Right, right, right. >>: What happens then to the [indiscernible]? >> Li Zhang: It doesn’t change the… >>: It doesn’t speed up faster? >> Li Zhang: Right, right right, yeah. >>: And conversely sometimes maybe just the highlights are just at a very small range, small area… >> Li Zhang: Uh-huh. >>: Scene that requires… >> Li Zhang: This very short exposure… >>: [inaudible] >> Li Zhang: Right, right, right. >>: Very short exposure but then you’re sacrificing the rest of the scene to capture that one thing [indiscernible]. >> Li Zhang: Uh-huh, right, right right. >>: Yeah. >> Li Zhang: Well… >>: No more [indiscernible] right. You really haven’t talked about stage [indiscernible] and so... >> Li Zhang: Right we haven’t talked about this… >>: Temporal adaptation. >> Li Zhang: Right, right, we haven’t done that yet. But I know that there’s one ICCV submission sort of trying to do this type of thing, not from my group, from other groups, yeah. So, did I show this? I probably showed this right. So this is another example, this is also, it’s so dark I guess I’ll skip this. It’s just that you have a set of rolling dice. Those are all sort of hard examples if you want to do a traditional optical flow based method to do these, oops, sorry. See there’s some artifacts here but it doesn’t sort of downgrade sort of drastically. So that’s sort of something that Whitfield as a pros might be promising to handle really robustly for fast motion. So this shows that underlying this we have iterative method to do the reconstruction. This, we use this thing called an alternating direction method of multipliers. This, right now the method is slow. It really requires quite some iteration as to reconstruct the highlight. Because for the highlight a lot of pixels the readers are sort of invalid. We have to somehow fill in these values. Okay, so this example shows that the limitation of this method. Okay see this, let me play this again. Okay I stop here and when you compare these method you can see that our method cannot completely, completely remove the blur, right. It sort of removes the blur for these locations which are closer to the center where the motion is slower and it’s better than the long exposure time. But for this very fast motion we still cannot completely remove the blur. Okay, so that’s right now the limitation. But the one way to address this is to somehow adaptively you can throw in more shorter exposure frames to address this if you really want to get everything sampled correctly. Okay so I’ll skip this, some future work on sort of optimal or adaptive exposure patterns, and more efficient reconstruction and some real time preview, right. So right now we can capture this. This one doesn’t look so good so maybe we can have some fast way of generating, not the perfect video but some a little bit blurrier video but sort of regularly looking so that we can preview the content, okay. Okay so that’s sort of a one part of my talk. So I’ll continue on this, sort of the next thread which is using multiple cameras. Okay, so when we talk about camera arrays I think maybe ten, or ten years ago all the arrays are big. So but now more recently we have seen these much smaller arrays. This is a, you can buy this from Point Gray. We have camera arrays of sort of appearing on a cell phone. Also, so this is a paper in Nature this year so we can really make the camera very small. So each of these lens, each of these little ball is a lens. Behind the lens there’s a photo receptor. So as you can see that this is a dome shaped camera array. This is a two millimeter, the whole thing is like a fingernail size. So, okay, so there’s, it potentially then we, later in the future we can have this type of array appearing on our mobile devices. So one question is okay how, what kind of image processing can we do for these type of exotic devices? Okay, so one thing I did several years ago is, okay if you have multiple image measurements and because the camera is very small so the image quality won’t be very high. So it’s going to be noisy, can we aggregate these images together to get a higher quality image? Sort of give you a sense, this is a, if you work in three D reconstruction you know that this is a synthetic example. So this is the one sort of a noisy image. Okay so in this work we assume that the noise is dependent on the image intensity, it has this Poisson distribution. Okay so you can, we can compare, sorry this is the ground truth. So this is our reconstruction. It’s sort of does a pretty reasonable job except you see that for the textural regions. So I think I will sort of skip the details and just show you a couple results for this work. So for example in terms of the noise reduction we compare with sort of the best single image noise reduction. Okay so this BM3D method many people use it as a benchmark. It does a pretty good job for removing noise but it also has this blurry appearance. This is I think in Mark Lavoy’s group the synthetic [indiscernible] denoising. It sort of, because it operates at each pixel sort of independently in a sense, it doesn’t work as well. So this is our method, it’s sort of really is we recover the details quite well compared to the ground truth. So this is another synthetic example. So this is a real example all images are captured using Point Gray cameras. With the shortest takes for time and highest again. So this is quite noisy and this is a single image denoising and this one actually what we did is we captured twenty five images and we sort of treat these twenty five images as a video and feed in the video denoising method. So they have a video denoising margin as well. Surprisingly it doesn’t just improve, you just don’t get the better result. If you just, if you have more data, the reason is when you do the, video denoising is still how to do the some sort of registration across the frames. If you treat them as a video then the degrees of freedom you’re computing optical flow, the degrees of freedom is much larger. But when we treat them as a multi view denoising there’s [indiscernible] constraint we can use. So we can get matching much better. So that’s why sort of in our result we can get quite better results. >>: What was our first set of theory? How many cameras? >> Li Zhang: We have twenty five… >>: [inaudible] >> Li Zhang: In this particular… >>: [inaudible] separate cameras? >> Li Zhang: Yeah, well there’s the same camera we’re just put… >>: [inaudible] >> Li Zhang: It at different locations. >>: Is that a new grid, a grid or in… >> Li Zhang: In line. >>: In the line. >> Li Zhang: For this one I think it’s a grid. >>: Yeah. >> Li Zhang: This is one example that shows that we can handle through signal the intensity dependent noise as well. So when we do image denoising we usually have to provide a parameter which is the noise standard diversion. So, but if we have, if we have sort of a saw noise if you set this value too low then you don’t remove the big noise, so if you set the [indiscernible] too large then you sort of kill details as well. So if we provide the sort of correct model then you can sort of remove the noise but also keep the details. So that’s one sort of feature in this work. Okay, so how many views do we need? We did some empirical evaluation. This is the horizontal access is number of views, the vertical access is the signal to noise ratio. So you can see that after a certain, after maybe twenty the performance doesn’t improve as rapidly as we hoped. So I guess there are two possible explanations. One is as you have adding more cameras, so all these cameras are front looking. So the common error they can see becomes less and less, so that’s one thing. So the second thing is we’re not purely just, the noise reduction is not purely dependent on, okay so the noise reduction needs some redundant measurement. The measurement redundancy does not completely come from these multiple views. Within a single image you can still get redundant measurement because self similarity, right. So if we have enough views to help us to find accurate self similarities within a single image then probably that’s enough. We don’t need infinite number of view to sort of get a very high quality noise reduction. Okay so that’s for noise reduction. We also did some work on exploiting camera arrays for video stabilization. So these are some professional solutions for video stabilization. You have these mechanical devices to damp out the motion which is not very comfortable to use for consumers. So and we have a, many sort of algorithm based video stabilization method. They tend to assume distant scenes and similarity based at camera motion. We also have some hardware based solution. Let’s see you have a floating sensor to compensate the motion or you have a floating lens to compensate motion. So all these works are, have limited degrees of freedom and the motion cannot be too big, so the baselines limited. So this is a paper from SIGGRAPH two thousand nine. So this is a type of scenes that we typically apply stabilization now, okay. So usually we have these assumptions. So the background needs to have a lot of sort of features we can do tracking, okay. So the dynamic targets need to be small, okay. So this is, even in this CVPR, I saw a video stabilization method which sort of works in a similar regime, okay. So the type of things that will break the video stabilization is something like this. Okay the motion is big, the background is sort of solid white, it doesn’t have too much sift to features for us to track. You have a large depth variation so this type of thing. Okay, nearby, so these are the sort of challenges. We’re going to show that if we have a camera array all these challenges can be addressed in a much, much more easily. Assume you have a camera array. This array is somehow vibrating in a three D space and we can view the stabilization as an image based a rendering problem you just want to render a video along a smooth trajectory. So this concept is willing to come up with this concept. This concept is mentioned in this paper. Okay, but they didn’t use array. So we were arguing that if you have array this, we can make this idea work much better. Okay, so that’s the idea. The reason the array helps is at each moment in time you have array you can estimate to the depth and you can do image based rendering to render all the scenes to that particular desired view point you want to form a smooth video. Okay say this video will sort of demonstrate the concept. So you have a camera red, these red pyramids are red, are five cameras of the physical array. We only use five cameras for this particular project. So this is one of the five input this is sort of vibrating a lot. So then this is the virtual camera. The virtual camera is sort of moving around it’s like a car suspension, right. So the seat is stable but there’s motion between the seat and the frame. Okay if you sit on a seat then you see everything stable. So that’s the basic idea. >>: Is this that camera you just showed? So what’s the… >> Li Zhang: Yeah. >>: Separation is? >> Li Zhang: Um, forty millimeter. >>: So it’s… >> Li Zhang: Yeah. >>: Pretty small compared to the [indiscernible]? >> Li Zhang: Right. So this is another compare, yeah you can see that this, it’s very hard to make the traditional video stabilization work on this type of, maybe a little bit to sort of contrive the scene to try to illustrate the concept. >>: In those two samples the amount that you’re cropping out seems to be somewhat minimal compared to some of the other consumer grade stabilization techniques. >> Li Zhang: By cropping out you mean the… >>: Yeah the amount of video that gets lost. >> Li Zhang: Right, right, right, we try to minimize that. You can sort of somehow optimize the position so that the crop out region is somehow minimized. >>: Yeah. >> Li Zhang: Yeah? >>: So you [indiscernible]… >> Li Zhang: Uh-huh. >>: The area was really small… >> Li Zhang: Uh-huh. >>: It would be like a centimeter or one centimeter… >> Li Zhang: Uh-huh. >>: This virtual view can go outside the grid, the square and then it becomes like an extrapolation from… >> Li Zhang: Uh-huh, right. >>: So in reality do you think this [indiscernible] would work? Like which factors of that estimation… >> Li Zhang: It becomes, yeah… >>: [inaudible] >> Li Zhang: If you need, okay if these red cameras are closer… >>: Right. >> Li Zhang: This blue one is to go outside way more. >>: Yes. >> Li Zhang: Yeah so then it becomes extrapolation, yeah that really, if you do that it will be more, certainly much hotter. So the depth needs to be more accurate, yes. Yes? >>: [indiscernible] from the [indiscernible] cameras. >> Li Zhang: Right, right, right. >>: Yeah. >> Li Zhang: If you do interpolation that’s easier, yeah. Okay, so we compared with the result with, oops, sort of iMovie, see it has a hard time to stabilize this. This is another result. Okay, so there’s a one key thing in this method. So if we do, if we implement this idea in a more straight forward way we will run some sort of a structure for motion to get to the three D trajectory. Then we sort of get a smooth from this red trajectory we smooth it out and get a blue one. Okay, that’s the more straight forward way of implementing this if you do that then we run into problems. Because you can see that all these scenes you don’t have that many features to track. So the keys or the idea is when we do imaged based rendering we don’t really care the absolute location, absolute location of the red trajectory. We only need to know the relative post between the virtual camera and the physical camera. That’s all we care about. We just need to know the relative post. So then we can formulate this optimization problem. We want to find a sequence of relative posts, such that if you generate this virtual video all the salyn features, by salyn features I mean the edge map will move smoothly in the virtual camera. So that’s why we can completely avoid the structure for motion. So that’s why we can sort of get to this, get the results for all very difficult scenes. Okay so if we can generate one virtual view we can also generate two virtual views. Then if we have a goggle we can, you can image you see some three D movies out of this. So you get the concept. See if we have a depth as we’re underlying this technique we have the depth, you have, we have the depth we can augment the video with sort of virtual object, like say this virtual ball is rotating as the guy. We can easily model the collusion relationship. Okay, so that’s, yes? >>: How did you get the depth? Did you write your own stereo… >> Li Zhang: Yeah, yeah, we have… >>: Stereo [inaudible]… >> Li Zhang: A few, maybe two or three depth map gap estimation CVPR or as papers, yeah we use those, yeah. Other questions? Okay, so then I will switch to another topic in terms of the image. Okay, maybe after this I’ll stop I’ll cut off the other stuff. So okay we know that if we can estimate to the rigid scene, three D rigid scene we can organize images. Okay, so that’s what Photo Tourism / Photo Synth is doing. So I’ve been thinking is okay, if we, how about we extend this idea to non-rigid structure motion in non-rigid scenes. So one thing I did in my, when I was still doing this if we can capture a whole lot of three D shapes then we can very easily navigate through the shapes and maybe come up with new shapes, so that the interfaces like this standard graphics sort of interface where you manipulate the object you get new shapes. Okay, so I’ve been thinking okay can we use the similar interface for two D images without getting into this three D reconstruction? So that the one sort of scenario would be okay this guy Simpson saw a person somewhere and he forgot exactly shape but he can query faces. So you’ve got a whole bunch of face shapes. Okay, then this person wants to nail down from these millions of faces how can we get the exact shape in my mind? I don’t have the specific picture but I have semi-rough idea maybe this person has a big mouth or a smiling face, something like that. So in order to do this the difficulty is how do we refine the search result, right? So maybe a particular smiling style or particular pose he wants, okay. So but right now for example one interface we may come up with this. Okay, give a current retrieve the face maybe you can just drag the nose to the left you get a set of sort of front looking faces, okay. Another example is you just do this pinch on your touch pad interface you get a set of sort of open mouth face images, okay. So this type of thing sort of explores the geometric attributes which are, you know hard to say in the words. But it’s easy to manipulate on your cell phone or whatever touch pad interfaces. So it could be useful for criminal profiling or photo management. So again so I’ll probably skip the details just show you a prototype we built. [video] Okay, I guess you get the idea. So the, to really make it more useful we have to lock down the identity so that, you know the identity doesn’t just change. The problem in practice is we can now find each actor or each public figure. It doesn’t have all the different expressions, all the posses, you know in our databases so we sort of have to remove that constraint, okay. Okay, so… >>: How much was that dataset [inaudible]? >> Li Zhang: We have about ten thousand images. We do a pre-processing to get all the landmarks detected. That probably removed five thousand because there’s something, some alignments which are not very reliable so we remove them or, you know final search database. >>: But that’s when you say you don’t have the actor in all expressions and, but that’s because you’re going to Google image search or something like that. If you actually started analyzing the movies… >> Li Zhang: Uh-huh. >>: Actor over the course of a movie will probably pose in… >> Li Zhang: That’s true. We didn’t do that we just used a, some database maybe Columbia or some other, I think Columbia collected this dataset. >>: Okay. >> Li Zhang: So yeah, okay so I think I need to wrap up just, although I have some other stuff. So one thing is the geometric attributes is not enough. So what you really want to type in maybe red hair or the guy with a mustache. So we need some appearance attributes to make this thing more useful. That’s one thing, the other thing is we really need more robust face alignment for this to be useable. Because this is a little bit different from photo [indiscernible] where you can take a million pictures of the Statue of Liberty, if somehow the feature matching doesn’t work you can throw away maybe half of them. But you still have a lot. Here for faces maybe we’re interested in certain interesting smile or if you throw that smile away then you’re really missing the point that the pictures of the people may want to find, okay. So we really need this. We have some face alignment paper last year. When we worked on this problem we realized a very difficult problem in practice which is all these different datasets. Each university has its own dataset. Each dataset has different definition of landmarks. So for this one the similar data has, dataset has like fifty points, but for this dataset has twenty points. So there’s, it’s very difficult to merge these datasets together to get a very robust, to get a more robust face alignment model. So, how many more minutes do I have? >> Michael Cohen: About five. >> Li Zhang: Okay, so because of this, so more recently I’ve been sort of thinking maybe we want to have a different representation for faces. Instead of figuring out these landmarks, okay we can for each pixel give it a label whether it’s a skin, cheek or nose, or teeth, or hair. Right, so we’ve only used a soft segmentation as a representation for faces, which has probably handled hair in the future. It’s very hard to use contour and landmarks to model hair as well. So that’s one motivation and also we want to make this method sort of general so that we can use it for street view scene as well. So I will just show you I guess some, the current result we can get. Okay, so this is one sort of a result we get in the last ECCV. This is a frame by frame processing sort of example based sort of parsing of street view scenes. If you pay attention usually we miss smaller objects but for the big objects like skies, buildings, road surfaces we tend to work, the method tends to work better. So that’s for this. So on faces in this CVPR we’ll have a sort of similar but different example based method which works quite well. We compare with other face segmentation method. Just give you access, show you a demonstration. So one nice thing, right now we cannot handle hair yet but with this sort of a segment based method it’s possible you can model teeth. Sometimes the teeth is important cue if you want to analyze the lip region. >>: Do you get the same orientation with the labeling method versus the contours? So like before you were able to drag points to set the orientation or the gaze of the face, can you still accomplish that with this method? >> Li Zhang: Uh, that’s a good point. This method you can integrate the contours in, yes. So in order to drag these points you have to have some anchor point to drag, right. So this method, for this segment based representation we can integrate with the contour as well, yes we can do that. Okay, that’s, that’s I guess that’s it. [applause]