>> Cha Zhang: Okay. Good morning, everyone. It's my great pleasure to have Professor Minh Do here. Professor Do is an associate professor from the Department of Electrical and Computer Engineering at UIUC. He has received numerous awards, including a career award from NSF, a Young Author Best Paper Award IEEE in 2008. He was named Beckman Fellow at the Center for Advanced Study, UIUC, in '06 and a Xerox Award for Faculty Research from the College of Engineering. He's also a cofounder and a CTO of Nuvixa Corporation, a spinoff from UIUC, to commercialize depth-based visual communication. So without further adieu, let's welcome Professor Do. >> Minh N. Do: Thank you very much. It is my great pleasure and honor to be here. It's my first time visiting Microsoft Research. I always admired a lot of work coming out from this place, and it's a great pleasure to connect with some of the old friends as well as see some person that I only know through papers and publications. And suddenly it resonates all the with the recently activities coming out of Microsoft. I think now Microsoft is really truly the center of the universe surrounding their depth sensings and how to use this for immersive visual communication. So, again, it is a great time to be here, and I hope we can start up some collaboration and some discussion afterward. So there is something that -- I've been working at University of Illinois for almost ten years now, and my original background was in developing high-dimensional waveless for image representations. So, you know, you can tackle a problem like compressions, de-noising, and so on. And about five years ago I get very interested in the depth cameras, the technology that can provide additional to typical color image that we as image processor would deal with. Now we can also have a very different type of visual information, namely, the depth information, and a very interesting question to me is how can we provide different techniques to explore that new type of data, and in particular for the visual communication application. So let's first start with -- so this is something that we dreamed up in 2005 about a vision that, you know, can we extend beyond television that a typical visual communication where we only have a single fixed camera recording the event and the persons and then the viewer just simply passively stay still and watch the content. But can we -- for example, very nice pleasure to be here, and we move around, talk to people. So the freedom not just having a fixed camera, but I can walk around and see information in two, four-dimensional, you know, 3D in space and [inaudible]. I think certainly it is a very big freedom that we can enhance. So what does it mean? So there is a picture actually drawn by somebody in here in the audience [inaudible] met with my former Ph.D. student at UIUC and now working at Microsoft. The vision that was very excited at that time is, you know, we can have multiple cameras recording dynamic environment scene and then we would do a lot of very interesting advanced signal and image processing that [inaudible] information, have a very efficient representation, and then trust me that across a network or storage so that later on in our life we can view the information, individual persons or different, you know, sessions, we can view that at different viewing angle, different perspective, different positions. So that was the dream. And there is a setup, but as you can see, it can be manifest itself in many different other applications. For example, how can we generate stereoscopic stream, right? For example, we can record the views, and now people are looking at the autostereoscopic display which then you have to synthesize multiple views, and we only have a limited set of recordings so now I have to synthesize those views. So that's one. Also, you know, if you take the person out, you want to extract and change the viewpoint slightly, for example, do some eye gaze corrections, all of these problem I think to me it is fundamentally a problem of view synthesis using live recorded data is going to be an exciting problem. So the vision which, you know, again, get us very excited about is existing audiovisual communication still only uses single camera, very little processing, and the viewer just stay there passively. So as we look in the future at the cameras, the sensors getting very, very cheap, and we have a lot of computing power and bandwidth available to us. How can we provide a new viewing experience? And I think that's a very exciting resource problem. Why? Because in the middle of that is how can we develop new signal processing theory and algorithms to capture these kind of new setting and believe a new experience. So shortly after we decide about this problem, and now it's become such a very exciting time for us is there's a new kid on the blocks that provides not just regular camera that has color information, but then there's a new type of sensing capability that can measure in realtime video rate depth information. So that provides us additional information, and this device now as, you know, SoftKinetic is going to enrich the consumer space at a low cost. So, again, how can we explore this to deliver this vision communication experience. And, again, I'm fully aware that it's also something that's very actively researched and done here at Microsoft Research. So I'm coming here to learn and look forward to new opportunities to collaborate. So let's first look at -- again, my background is signal processing, so first we look at the image, we look at the signal, and one of the thing -- so that's one example, the depth camera. We have one of these -- you know, when we start several years ago it was very expensive. It's like a $10,000 PMD camera. Now we're so excited that with the same amount of money we can get a hundred of those connect cameras and play around with them. But the fundamental problem is there is the depth information tenth to be noisy, have a lower resolution, and there's some of the occlusion problem here because of the way the data is measured. So zooming in, again, you see those very important problems there. So, again, as someone who has been working on image, video processing, dealing a lot with linear image, now it's a very different type of data, and I'm daunted because of, okay, what are the processing algorithm we can develop on here. So now go back to the vision that we set out earlier. Can we generate multiple viewpoints in a free-view [phonetic] video based on some fixed recording. So let's make the thing concrete. I'm going to start, you know, studying and look at the problem in foreign setting. So we want to have a fixed set of cameras. There's two color cameras. In the middle we put a depth camera, and we want to record a 3D environment, right, a dynamic 3D scene. And then based on this recording, we want to use a small number of cameras so that it's very efficiently transmitted over the internet or store. We can synthesize -- we would like to synthesize an arbitrary view and, you know, if we have time we can freely move around here or you can have -- synthesize them as a stereo pan [phonetic]. Then we can look environment in 3D at any viewing angle. And, you know, we want to do it fast. So that is a concrete setting, and when we look at this problem, and, again, we got [inaudible], what are the key challenges? One is the depth cameras that we deal with at that time, they tend to have very low resolutions. So now how can we couple that with the existing -- the valuable color image with high color quality, high resolution. So to combine that, that's going to be one key issue. And, you know, we trusted in the setting that camera can be dropped in very easily without very complicated calibration techniques so the user doesn't have to -- you know, the camera doesn't have to become be in the integrated system so that, you know, it can [inaudible]. And the next one is at that time also, you know, parallel computing which now becomes -- the mainstream is how can we deploy these kind of computing algorithms in a parallel platform so that we can have, you know, realtime high speed-up. Okay. So let me jump in and explain our algorithm which brought ideas from many different techniques. It's just our attempt to develop and deliver some of these free view and then set up that as a framework to analyze the quality of the possible reconstructed image. Okay. So, again, here is the setting. So we have one depth camera in the middle and two color cameras. So the algorithm is, you know, very trivial. First because we have the depth information, so we can go back and then we can synthesize what the depth information from this middle one to these two color cameras. Now, after we do that, again, the simple point why we did that is because then now at those color camera positions we have both color and depth, and now because we have that, we can exploit the rich high resolution, high quality of this color image to help sort of the problem with occlusions, low resolution in these [inaudible] depth images. And then based on that we do processing and provide a very high quality depth image here. So, again, the key point, how to combine depth with color. And when we have this depth information per pixel at those color cameras, and then we can, you know, do it by projection here and decide the visual level viewpoint. Okay. So that's the three steps. And, again, we purposely think about how to map the whole thing into a GPU so that we can utilize all sorts of hardware acceleration that's provided with GPU. Okay. So, again, my background is in signal processing so I'm trusting first this is a processing step so let me walk through this set of sequence of processing steps that was provided to combine the depth and color into a high resolution depth enhanced. So first if we remember the first step is we take a depth from a fixed point and we walk [phonetic] them to the color point, and the depth we use is very low resolution so you can highly see -- so what you have is a very sparse set of [inaudible]. And the color -- they are color coded here so that they show, you know -- I think the darker the color, the closer the object, and vice-versa. But then, you know, we know because of those points and we know that which object is closer than the other, so then we can eliminate those -- you know, the wide point here have to be occluded from those darker points, so then now we simply do this occlusion removal and now we have this more accurate set of [inaudible] representing the 3D scene in front of us. But, still, it's very poor quality. So now combining depth with color, that's where the power coming in, and one of the very popular techniques, again, we pick this one because we can then map them to GPU very fast. And we'll show some number later. We quickly get interpolated, you know, few in terms of holes, and higher resolution. But, again, you see that a lot of this is problem's still here. Again, you know, I was trained as an image processor, so to me the best way to explain the algorithm is to go through the sequence of intermediate steps or the detail of those steps, you know, later on I can show the reference and you can look into detail, but let me just go through these images, try to deliver the point. So the next step is, again, we still have those big holes in the middle because of the occluded, so obviously when we draw them this [inaudible] we know that this part have to belong to the background. So we know that, you know, that is a background and using certain just simple 1D interpolations -- sorry, extrapolations, that we can look at the [inaudible] and we can extend them out and we can fill them. So that's another step. And now we're going to fill them in. Without doing that -- sorry. Simply just filling this hole is just afternoon interpolate, then you can smear them out, right? But with a little bit of understanding about what is a 3D object underlying the scene, because now we know the depth information, we can do a better job in filling these occluded areas. Okay. Now, after that, one thing we learn is for the depth image frontal view, we have a very shallow boundary, but as soon as we move away that view, then there's a term now people call the flying pixel, which is a pixel that, you know, between the foreground, background, and that get -- when they measure depth, they get average, so the -- and this is along this line here, and they are very pool quality. So when you wipe [phonetic] them over, these ones show up, so we have to developing something that can clean up this. And if you look at this -- first, you know, my students show me this and, you know, how can we clean up this? It's a really hard problem. But then we realize that if we have color information next to us because we have some very nice color image here, then we know exactly at that viewpoint how we can find out the exact edge so that we can do this processing the limited way, the background way, the foreground way in between pixel. So with that, then we can do some, you know, very simple trick, again, some very simple and efficient techniques that can give us this clean image. All right. So the example result, again, is not perfect because you can still see some of these artifacts here, but what it shows here is, you know, we take two image and the depth in the middle that has poor resolution, we can synthesize a high quality depth map and then, based on that, we can render image at normal viewpoint. Here is another perspective here. Okay. And here is a video that we can fly through and see the object. Again, we can synthesize any view, so you can actually see them as -- you can [inaudible] a stereo pan and you can see thing in 3D. And all of this can be done in realtime because -- I will there's a number that we map that efficiently to GPU. I have some of my colleagues in one class exploiting color and they have those two mapped out to a GPU. Okay. As you can see that it's a [inaudible] of algorithms, and then one thing that we've been looking into is, well, can we quantify accurately what is an error that can result from this set of, you know, processing steps in synthesizing the final image and using that to maybe give a prediction, you know, what the best configuration of the scene, how to -- you know, in compressions, I have depth and color, how to best allocate bits between depth and color to deliver the final high quality synthesized viewpoint. So that is a problem that we like to analyze. Of course now we switch gear now to try to analyze it and then, you know, do a lot of simplification here so that it's -you know, we still make the problem tractable. And we use the framework the propagation algorithm I just described and look at the simplified version of that. So let me look at a very simple setting here first to try to deliver the point. So imagine that we have some, you know, surface, some object in 3D, and we at this viewpoint here through propagation we have a set of depths corresponding to this camera and the color pixel. So we have that set. And now we want to synthesize the new viewpoint based on a set of those per pixel -- depth per pixel viewpoint. And we can set the -- sorry. We can, you know -- we can develop the setting which is, you know, [inaudible] surface. We throw in that the camera have certain resolutions, the depth and texture of the color image have certain accuracies. You know, this thing, again, it was -- based on my training is, you know, let's say we do a coding with a certain bit rate, and here's a certain distortion so we can, you know, quantify that. And now given that, what is the best way to deliver a high quality image. Classical problem of, you know, write allocation. So given the setting here, you know, let me give you have the gist of this approach. So here is our actual image, one single actual image. And here's our virtual viewpoint where we need to synthesize the color image. So what we do is we take this pixel, we know the depth, we know that, you know, along this [inaudible] varies, and then we hit the surface and then we [inaudible] them back here. So now we have a color pixel over here. Now, what happened when this depth is noisy? So, you know, we would go [inaudible] here. That noise is either due to noise from the measurement or noise from the quantization due to coding. It go up somewhere here, and now we walk now, so now the pixel now get -- move over here. So that is -- we can quantify what is that [inaudible], right? And it's a simple geometrical argument. You can [inaudible] that. The next thing is now we look at the virtual viewpoint, and the virtual viewpoint now, we have a set of pixels here, and the ideal function we want to reconstruct, the ideal function we want to -- the ideal function we want to reconstruct is here. But instead of having the pixel here due to the [inaudible] we have those pixel, and then we have the measurement here. So the noise coming in two part. One is the [inaudible] now gives you the wrong location due to the depth measurement, and then the other one is due to the color, so then you have this. And given this new set of [inaudible], we just do a simple interpolation, right? Do simple interpolation. Here we show you the linear interpolation, but the technique, you know -- let's say you can pick up your favorite [inaudible] interpolation, for example, we can bound that. And then we can bound what is the error of this reconstructed image with the underlying original image. So we can bound that. The next thing is now, remember that we don't just have one actual image, we have a number of them, and then we would go to 3D scene and we propagate back to a virtual viewpoint. So we have a collection of those, and they -- we can use a random argument to characterize what the density of those functions at the sample point. It turn out that we can, you know, write that as a closed-form formula. And put all of those things together, then we have a final bound on what is the -- we can characterize as a very classical technique in randomized, random set, what is a moment of those difference in a sampling interval. After we take those points, we walk over here. Okay. Even those look very, you know, heavy, but actually we can compute exactly those number. So given that, then we can put together now and we can find exactly what is the error when we do the reconstruction. A number of terms here. So let me explain -- so, you know, first this one here is telling us how smooth the texture of the scene, right? So if the texture is very smooth, then the error would be small. The other one here is this one here is fully dependent on the configuration of the camera in the scene. And then there's an error due to the measurement in texture and then the measurement in depth, and they all [inaudible] into this. And all of these terms here we can work out as a closed form. Okay. Now, of course, the question is -- okay. Actually I just explained exactly what those term have to be. So, you know, again recap here. This term, you know, encode about the geometry of the cameras, you know, what is the scene and how we place the cameras. The human here is telling us about the sampling densities of the either pixel, color, or depth. And, you know, other term decoding about what are the accuracy of depth and color measurements. So, again, this is giving us some particular analysis, and we can have a sense of, well, even when to improve the technology, how the error -- how the reconstruction error went to [inaudible] as we get better, better accuracy. So even though that is a very messy formula, but actually, you know, in certain kinds we can find exactly the closed form and can plot them out. And, again, where I'll go here, trying to make the problem tractable with look at the what is the certain behavior of this setting. So here is a scenario which, you know, we have several camera putting a lot of line, and we want to synthesize the intermediate viewpoints. Then, for example, our critical prediction is how the error going to behave as we have more and more samples, right? And we can synthesize the scenes. We have everything exactly set up. Then we can measure actual error that tend to be, and we can see that it follow very nicely with this physical bound and, again, accurately reflect what we predict. For example, this guy, the error decreased linearly as the number of sample increased. Of course, we can expand that to 3D, you know, by using our -- we can also extend them to when there's certain occluded areas and we can note that those boundary and we can also add in some additional terms. Okay. So the main idea here, I guess, is, you know, we coupling a particular technique, method, to synthesize the viewpoint, and we want to quantify what is the accuracy of the reconstructed image based on the same configurations based on the characteristics of the camera as provided to us. Okay. So the next part -- oh, and I promise, this is slightly a bit out of order. The timing we deal with is, you know, for example, we picked particular this algorithm that, you know, if we run in CPU it would take a lot of time, but, you know, very efficiently mapped with GPU because of the [inaudible] architecture, then we have the saving, but certain algorithm, you know, would not give us that big saving. Overall several component we put together, we can achieve realtime rendering. Okay. So that was, you know, some, you know, 2007, 2009, you know, some of the initial algorithm, and if you notice carefully the data that I show earlier on was completely scientific data. Right? We have a depth map, we synthesize -- well, we have a 3D model. We synthesize what the depth map have to be, and then we do the reconstruction. So now comes the real thing -- you know, the real thing come later because when this camera become available, you know, and when we can afford to buy some. So we go ahead and have this simple setup here in our lab that have a depth camera here and have a multiple number of color cameras here. Somebody in the room know that, you know, you construct part of that setup. There's some footprint here. And then, you know, when we have these images, the problem we try to do here is simulate the algorithm that I described earlier. So, namely, I use input as I, D, and B, and I want to synthesize what would it be at C, okay? So everything is real, and we try to set up and try to run the algorithm. So, again, here's one of my students. There's three input, two left and right and depth in the middle here. So just show the result quickly here. Left and right, and here's a rendered view, and here's a ground truth. Now, why these are important problems, as you can see, the goal here is we try to be able to track where the eye gaze could be, and then want to synthesize the view that the person look directly into the camera. So you can see both of these original viewpoint a little bit off, but then we can synthesize the person's [inaudible] on the camera. So recognize certain area here along those boundary, and I will later show where it come from and how can we deal with them. >>: [inaudible] >> Minh N. Do: Yes. >>: [inaudible] >> Minh N. Do: Okay. So everyone recognize that the depth camera we use that time is very bad with the [inaudible] reflexes. Human hair tend to absorb a lot of [inaudible], and I know the connect camera now, there's some [inaudible], but, you know, for that camera, for example, dense hair, a lot of pixel getting missing so that's the way we [inaudible]. All right. So, again, just the sequence of steps and how we process them. So, you know, we propagate the depth, remove them, we -- you know, after having these very coarse set of point cloud [phonetic] we can synthesize with higher quality depth and we can fill in the hole by doing this occlusion, fill in and edge enhancement. So we end up with a much higher, you know, per-pixel depth quality here and then when we synthesize. Now, the nice thing about this is because we have this, then the part that we have, like, you know, along the edge here, we can easily move out and now we extract, you know, the person from the background. Some of those artifacts are due to the compressions. Okay. So that is how we could do a real image and understand that when we map the actual depth maps from an actual depth camera into, you know, other color viewpoint and then each of those color viewpoint now, they would have fully per-pixel depth plus color. And we strongly believe that that would be a very efficient representation of the data that -- with this data now we can ready to transmit and send them or store them somewhere so that viewers, either remotely or later in time, can quickly, you know, view the scene from different viewing angle. Okay. So now come to the next problem is how can we -- yes? >>: [inaudible] >> Minh N. Do: No. We assume that it's only, you know, one depth per pixel. >>: [inaudible] >> Minh N. Do: Yes. So the whole filling, we realize -- let's say we don't have one view here, but we have multiple views. And each views they have a, you know, per-pixel color and depth. So we ->>: [inaudible] >> Minh N. Do: We act like we have, you know, one single depth, multiple color, but then we propagate them into processing and now from one depth camera we can synthesize multiple depth viewpoint. And if it says occlusion, but then we have color, so we can fill in those occluded area. >>: [inaudible] >> Minh N. Do: So, okay, maybe let me try to understand the point here. So when we first propagate and [inaudible] them over, we have a lot of those occluded area, but the -- let me clarify the setting here. So how can we -- okay. So the setting we have is one single depth camera, and we have two color camera at two viewpoints. So first we use propagations and get the two depth cam. And, of course, each of those have a lot of those occluded area. But the key thing we have is at this viewpoint here we have color information, and we use that information to guide how to fill those occluded area and then we now, per color viewpoint here, we have a -- its full depth, you know. >>: [inaudible] >> Minh N. Do: Each color, yes. >>: [inaudible] >> Minh N. Do: Oh, the view in between, it just -- yeah. >>: [inaudible] >> Minh N. Do: So we limit that the view between that, you know, have to be -- any pixel here have to be either seen by one of these two cameras. So we limit what the freedom of this virtual view can be. >>: [inaudible] >> Minh N. Do: Oh, yes. Certain kind, let's say, have some, like, you know, cell occluded area, yes. >>: [inaudible] >> Minh N. Do: Oh, I see. I see. Yeah. Yeah. >>: [inaudible] >> Minh N. Do: Right. Right. >>: [inaudible] >> Minh N. Do: I see. I see. >>: [inaudible] >> Minh N. Do: I see. Yeah. We was thinking about, you know, we have, like, you know, simple concave surface here, so let's say we look from here this angle, certain part here get occluded and we have here now the total unit of this ->>: You would really need to know, oh, this is a sphere >> Minh N. Do: Sphere, yeah, yeah. >>: [inaudible] >> Minh N. Do: Or, you know, let's say I think maybe you have a, like, object, I guess, that is convex surface, for example. Yeah. Yeah. Good point, yes. So it is some, like, you know, very convoluted concave surface, and, yeah, we have those. So, okay, I think that's good refresh point here because now we assume that we have multiple views, and each of them we have fully color plus per-pixel depth after we've been kind of, you know, preprocessing. And now with this data we transmit them and then, you know, we let the user, the viewing side, simply, you know, do this viewpoint synthesis. Okay. And that raise a very challenging and interesting questions and was tackled by Matthiew here who is doing his Ph.D. at UIC and now working Microsoft. So the key observation that Matthiew have here is now if we have a depth and a color and assume that we do a post-processing -- sorry, pre-processing so that we fill all the holes so we have a, you know, nice depth map here and color here, how can we best jointly compress these two images? So of course you can compress the depth and color separately, but then how can we jointly? And the observation we see is, you know, okay someone trained in wavelets and [inaudible] here, most of the bits were spent on encoding the edge, the locations, right, the significant coefficient. And what the two things share in common here is really along these -- they have the same edge, right? And, again, when we do this synthesis we really use the edge of the color to guide, you know, how to perfect the edge of the depth. So, you know, they are sharing the same, you know, locations. So that -- we want to use that. And also want to show later that having explicit depth location here also had the synthesized reconstruction, the synthesized viewpoint, the view synthesis. Okay. So given these two images, or, you know, they can be a video sequence, but let's say I consider this as a, you know, coding them as I frame, then first we, you know, detect the edge point, and then once we have this edge point we only save them once. So we code them, like, using chain coding. And then [inaudible] up we can efficiently encode the color of the depth given that edge information. So the idea is very cleverly constructed by Matt. We look at the lifting scheme and the lifting, it is efficient way that, you know, people use in [inaudible] 2000. You have a sequence of samples here. So think about one scan line of the image. And you have a set of pixels here. You partition them into odd and even and the odd sample can be predicted by the even sample, you know, by the two nearby, but through that prediction, and we subtract them so those pixel become -- tend to be very small, and then we can, you know, iterate that several layer so that over here more of the coefficient would be, you know, zero or small and then very few are going to be significant. So that's how we can get the energy compaction. Now, the key trouble here is now we do that, you know, typically leaves a whole line of the image, but now we only, you know, look at the scan line. So we go along here and here's the break point, and then another point here, and so along this one here is it a very nicely approximated by a, you know, low-order polynomial so that with the lifting scheme, a lot of the coefficient go to zero. But then when they have an edge point here, that going to significantly give out high magnitude coefficient. So how can we eliminate that given the knowledge that we know exactly that is the edge location? So the idea here is when we do the lifting here and we want to locally extend that area, and these are the pixels that we have to insert them in. And the way that [inaudible] simple -- they come up with, you know, models similar [inaudible] polynomial, we can easily extrapolate those pixels and come with this very nice closed-form formula for that, and we can just insert them here, and those pixels there are computed by these existing guy here. And based on that, then we have a lot of other coefficients would go to zero. No significant one. So that was a key idea. And, you know, zooming this image -- so this, you know, for [inaudible] it would be an ideal image model to compress. Even though with that, you know, at a very, very bit rate we see a lot of these, you know, compression artifacts. Now, if we spend some bit it would take that overhead way of coding the boundary here, and then the remaining bit we spend on coding the coefficient now because of this exactly edge locations we know those go to zero, so we have a lot of coefficients going to zero. Now, with the send bit rate we have a much, much higher decoded image now. Okay. Now, what does it mean? Well, taking this -- so here's the zoom-in version here -- what does it mean is taking this per-pixel depth, a different viewpoint, and now we synthesize normal view. Then because of all of this nasty edge point here, when we synthesize it would create those visual artifacts. Now, with the nice edge that we can encode it efficiently, then the synthesized viewpoint now have a much better quality. So, again, encoding now turn into the gain in the visual quality. Okay. So let me come to the final, you know, project that we've been working on is we recognize that certain depth cameras -- this is one of these Mesa Swiss Ranger camera [inaudible] this is a prior connect error, very poor quality, poor resolution, but then, you know, the color camera is cheap, we can get HD color camera. We stick them together and the question is how can we improve the depth measurement from this. So you can see that, you know, if we have a connect, that will be exactly the same setup. Or color next to a depth, fully register, how can we now provide enhanced depth image. So let me just go to the algorithm. The algorithm is very simple. We tried many different approach and, you know, this problem also been looked at by many, many other research groups. So the one that we found work best is what we call a joint global mode filtering. So the idea is the following: So assume that we have a color at pixel p here, we have a color value, we have a depth value at that, and we want to enhance them now. And so there is a guided signal, so the depth and the color volume here, and this function g here is just Gaussian so it's a localized, you know -- it's small, so if this thing here code to zero, that have a high value, and if it's -- you know, they have a lot of difference, then it's small value. So it give us sense of, like -- if you work on bilateral filtering, you know, this is the things that you see that pop up everywhere. And what we got [inaudible] on is, you know, how similar these two guided signal, how similar in location these pixels and then look at this value. And then we build this histogram, and then we just find out the max value. All right. So maybe explain by figure here and just give a sense that algorithm is actually very simple. So imagine that I have a noisy or low resolution depth map, and I want to know what this pixel p here, what that depth have to be. That pixel p here could be one of the existing location or one that we don't know. We want a sample. And we simply look through a window around that, and in those pixel we know what the -- you know, that have a known depth and color, and then we, you know, move them over. So if this guy have a very similar color or very close by, then, you know, collectively they have some high weight, and so that we have that weight. And then, you know, we do the same thing with the other pixel, and then we, you know -- collectively we build the histogram for that and then we just sum them up and we have a function, you know, using that weighted, and then we have that function, and then we just pick out the max. So that's a simple idea. It turn out that what we -- solving it, you know, some analysis is show that the joint bilateral of sampling which is, you know, the bilateral filter version of combining with depth and to up sample being proposed before, it's simply an L2 minimizations of this function where this one here becomes this L1 minimization, and we found that it's much more robust and it makes more accurate. So example results shown here. So here is [inaudible], you know. So holding some image here, so they have color, the depth is poor quality and its nearest neighbor interpolations, so you can see that a lot of these artifacts. And joint bilateral up sampling can do this. When you're zooming in there's those -- you know, because of this L2 averaging, L2 minimization, you can see that it's -- you know, it's got blurring. L1, you know, as we know now, it gives a very robust estimation and it give us sharp edges. People also proposed, you know, very go very expensive using 3D which is combining different [inaudible], but, you know, much more expensive than what we have here, but still cannot compete with this visual quality. So that's what, you know -- but then one more thing we realized -- is there any question? Okay. So one more thing we realized is, well, we can do that well per frame, but then when we play the video, one thing we learned which is now become another major topic in my research group is how to exploit this temporal, right, because, remember, we're dealing with not just, you know, separate, you know, image of frames, we're dealing with videos now. And [inaudible] the depth camera sensors now that they provide us video rated information. So the question, well, how to incorporate temporal dependency and consistency has become a very interesting problem now. So just, you know, give a sense of a simple fix we do, and I will show the video shortly, is we look at the pixel here and then, you know, let's say we can assume that we can reconstruct this one here. And now, you know, we look at the next one here. It might also have a reconstruction. And we first, you know, use a very simple, like, you know, look good candidate optical flows, you know, get a simple -- find out what is optical flow, and based on the patch similarity, which is a technique now people know in non-local means, for example, very popular in de-noising and recover image. So we look at that patch similarity here and give that the weight and then from there we can influence, you know, what this pixel have to be. So very simple techniques, but to us it work amazingly well. And, you know, admit this is not the final result yet, but at least give us some, you know, insight about, you know, how to exploit this, you know, temporal consistency. So let me show now the -- so this is original video that taking the depth map from the low resolution camera and it simply do, you know, nearest neighbor interpolation. Okay? And if we do per frame, then -- per frame we can see that we have, you know, high quality image, but then it's -- you know, the temporal consistency here, you see that per frame? It still give that -- let me clear it to get that video. So you can see those part here, and so, you know, I think view as the whole video, you can see that there's a problem that come up. Now, if we can just apply this, you know, very simple techniques in enforcing that temporal consistency, then the edge are much more stable now. But then we also realize that we run into the problem of how to synchronize color and depth because this camera that we set up that they are triggered by two different, you know, USB, so when the person moving, for example, there's this [inaudible]. So that creates another issue. Okay. So I hope -- you know, it's not yet a coherent story, but, you know, I think that we just really searching, and as we deal with the real data we realize that, you know, there's more and more interesting open problems we have to deal with this particular type of data and sensing device. So let me just conclude by saying that, you know, to me as someone working in signal processing, it is really a paradise because we have multiple sensor, we have different modalities, you know, depth, color, and so we -- I collaborate with my colleagues in audio part, for example, detecting where is a person, extract a person out, and then we can [inaudible] audio at that location, for example. So this is got multiple modalities. Very exciting that we can exploit it. And the [inaudible] so a lot of data, a lot of computations. Some of the computations have to be done [inaudible], right, because of different node, different cameras. And coupling with communication, how can we compress the data. So I think it open a lot of very interesting problems for signal processing. And the applications are really exciting. So that really make us excited about that. It has been coming -- you know, a number of papers here if you want to look at the details of some of the work I present earlier on. So I want to -- you know, as Cha mentioned earlier on, the depth camera is very excited, you know, [inaudible] exercise. We play with them, we run them, but then we also realize that, you know, you're going to also, you know, change the way people going to run and deliver visual experience, communications. So two years ago, you know, my collaborator and I, we formed together and [inaudible] up a company, and let me just, you know, next minute or two quickly show you some of the result we develop through that, you know, you can the depth camera, and we want to provide a visual experience that, you know, much more engaging, much more present for the people, you know, either remotely giving a presentation or remote collaborations. So you can -- you know, there's software that we develop that can be [inaudible] with a connect camera. So just let me very briefly show that video and then I will conclude with questions. [Video played] >> Narrator: In this tutorial you will learn to get started using stage presence, so let's go ahead and give it a try. Make sure that your connect or Nuvixa camera ->> Minh N. Do: So we develop software that running on a connect now, something that we would not dream about like two years ago, which, you know, everyone can buy a camera, a connect. Ten more millions units out there. So the person here is -- so that's a typical scene here, and we can, you know, come out the background. Of course, you know, that, you know, the depth quality from the connect would have all the these artifacts along the object boundary so we also know how to fix those artifacts. And just showing you an example application, we can extract the person out, like, you know, they can be in anywhere in the office, we can take the person out with a very nice, clear cut. And then when we have that, we can drop the person into a presentation or -- you know, if we think about that, it's a [inaudible] chart that I can walk through some of my desktop sharing, for example. >>: [inaudible] >> Minh N. Do: Yes. So that's the next thing, because we can now change the eye gaze, so talking about, like, we can correct the viewpoint that the person always look directly -- yes, yes. So that's our next step. And, of course, you know, we can leverage on the depth that we can do a lot of tracking. So the person can just stand up there, give a talk, you know, the viewer will see the person be present, interact can the content. Yeah. So, you know, what I'm really excited about is now this new system device give us not just, you know, opportunity to do -- you know, research new problem but also opportunity to develop, you know, real new applications that -- we release this as sneak peek on our website, and it's the least -- in a month we have more than a thousand download. People try, you know, people do them on, you know, the typical application. We get [inaudible] people try to do for presentations. People also upload the video on YouTube. We have people that they try to, you know, develop software that people in rehabilitation, like they want to do their exercise, they see themselves, you know, like riding bicycles or they see themselves, you know, in the mountain. So a lot of these very interesting, you know, potential application could be done with these depth-enabled devices. Okay. So with that -- yeah, thank you very much for the attention. [applause] >> Minh N. Do: Any questions? Yes, please. >>: Can you give some detail about how you synthesize the independent view? So is it per pixel? >> Minh N. Do: Yes. We synthesize per pixel, yes. >>: So you talked about there's a surface somewhere. >> Minh N. Do: Right. >>: How do you get there and how do you get from there back to the image? >> Minh N. Do: Oh, yeah. Sorry. So we assume that the scene, the surface, is a [inaudible]. So at this pixel we have color and depth. So depth tells, you know, how far that pixel in the scene, and then we go there and then with that depth information, and we know what the other view, we can what we call is propagate that pixel. The color now can copy over to the new view. So that operation that allow us to -- with color alone, then, you know, we lost that information, you know, and that's why stereo reconstruction very difficult challenge. But now if I have a per-pixel, that information, then I can easily go to the scene and I [inaudible] it over here. But after that [inaudible] that, then, you know, there's a lot of occlusion, a lot of holes, and that subsequent steps try to correct for that. Yes? >>: So you do not try to zoom in or change the position so that you do get no gaps between the [inaudible]? >> Minh N. Do: We can zoom in, yeah. So, of course, you cannot provide a lot of, you know, zoom in because then, yeah, as you said, there's another -- density of the new image now much coarser. Then we have to do more interpolation to ->>: [inaudible] >> Minh N. Do: Yes. So it's simple, just like the zoom. So you see when we do that video, it's not just simply just, you know, turning sideways, but we can zoom in, zoom out. You see that fly through. Of course, you know, if you zoom in too much, like digital zoom, then you start having pixilated ->>: You are trying to build a service out of your depth points? >> Minh N. Do: No, no, yeah. We want to avoid that because I -- you know, the surface reconstruction is a very, you know, expensive operations, and then it also make the problem to be, you know, then it become like some kind of computer-generated surface. This one here, we want to capture the raw data measurement and just only very simple processing that can be done in realtime. Yes? >>: [inaudible] >> Minh N. Do: Yes. >>: [inaudible] >> Minh N. Do: Yeah. Great. Yeah. It's really, though, the missynchronization between the color and depth. So we have these two camera that each of them, you know, just give us 30 frame per seconds video, and then we just take the -you know, second [inaudible], you know, there's a [inaudible] or a skew between those two, and then we just pick, you know, one color and one depth. So when the object stays still, then there's no problem. But when there's moving here -- so for the depth, the hand is here but the colors end over here. Then we have that problem. We know that in the connect, there's no hardware synchronization, but for the camera, the [inaudible] we try earlier on, it's very simple. If the two cameras are on the same boat, they can be run by the same clock and trigger that -- you know, the capturing. Then we can have a perfect hardware synchronization. Then we have less of those problem. >>: Sorry. Now that I know that, can you play the temporally stabilized video again? >> Minh N. Do: Sure. >>: Thank you. >> Minh N. Do: Yes. So let's first see the one without. So -- yeah, so, you know, really the artifact we like to remove is, you know, something like, you know, that [inaudible] come in and out, but, you know, this is when we have a fast motion here, for example, that would be trouble. Okay. So this one here is the nearest interpolations, okay? So you can see that is very poor resolution especially. And the final one which -- okay. How can we -- interpolation. So, yeah, if there's a slow motion, then it is really good, but when there's a fast motion, then that's where we have a lot of trouble. But, you know, you see when this slow motion here -- oh, no, sorry, this is without. Okay. And without, now you can see that it -- it got even worse. So we correct a little bit, but, you know, we run it to the limit of the hardware. >>: [inaudible] >> Minh N. Do: Yes. So that temporal consistency that we do, you know, which we use, yeah, is one attempt. Maybe we have to -- think about this. Maybe we have to not just use one previous frame but maybe multiple frames, something that people encoded that know, you know, multiple reference could get a better prediction, for example. So, of course, it could improve the quality, but with some of the complexity, you know, pay off. But this certain -- we was very happy when we see a single image, but then when we play the video we realize that, yes, that's become now a major issue. A lot of algorithm, if you read all in the literature, [inaudible] goes on pair of image in color and depth. So I think the depth information now, the ability to capture and record in realtime, got this video, I think it creates this interesting new problem, you know, how to enforce consistency across time for the depth video. >>: I have another question. When you warp the depths to the texture point of view, there are certain areas the depth does not have information, and you mentioned using the interpolation without having any under-surface assumptions >> Minh N. Do: Right. >>: You do propagation, but what -- what is the result eventually if you have a gap in the occluded area? What does it result look like? Is it more curved or is it more like [inaudible] type of thing given your current [inaudible]? >> Minh N. Do: Yeah. So the -- again, remember, there's a picture that we ->>: [inaudible] that curvature continues >> Minh N. Do: Oh, I see. I see. Yes. Yes. >>: It could also be a plane. So what kind of result -- eventually if you get the interpolation you know [inaudible] turn around to see what it looks like. >> Minh N. Do: Yeah. Yeah. Yeah, I think that would be a big challenge. We only -- you know, let's say, you know, a small occluded area, if we don't have color information, that is very challenging, but now we have color and we use a little bit of the [inaudible] for the color, this sharp edge the color image provide, we can fill in. But, yeah, certainly we -- I think we try. I think we have a last occluded area, you know -- some artifacts that's showing up. And when you see the video, when, for example, the viewing, the fly-through, the viewpoint change slightly, then we don't see that. But as it's move around, it start those artifact pop up. So I think we -- maybe we -- after that understanding, of course, is a challenge to go ahead and fix those problem, you know, so that we can have more flexibility. But also we realize that, you know, if we have small view corrections, then the thing [inaudible] quite robust, and maybe the application for that -- that's why I show with the real data, if the application is just really video communication, like the typical camera here, and now, which is slightly changed the eye gaze so that the camera, the virtual camera, behind the scene directly look into the eye, then that can be done, you know, effectively. Yeah. So not completely free viewpoint, but slightly changed, I think that could be doable, yeah. Yes? >>: I have a question about how you synthesize these novel views. When you're rendering a new view, do you work the depth and color information from a single camera or do you work from two or three nearby cameras? >> Minh N. Do: Yeah. Great question. Yeah, we work from multiple viewpoints, yeah. And, you know, there's already a lot of literature techniques people propose when you work multiple viewpoint and then how you resolve the conflict, you know. So, again, there's some information from the depth that it can show that which pixel people would use, one which they discard after they walk through the virtual viewpoint and then some of the techniques how to fill in some of the, you know, missing pixels, some of the holes. The reason we use more than one view is exactly some question that, you know, was missing. If there's object that, you know -- with single viewpoint then is occlusion, then, you know, we cannot, but now there's another one. Then hopefully that awkward area would be visible by another point, so then it fill in. When there's a viewpoint that was seen by both, you know, we lose some type robust, you know, interpolation so it doesn't smear out. Yeah. But that technique, there's already well-established literature on that. Actually there's standard [inaudible] on how to do these view synthesis given there's a common depth either single viewpoint or multiple viewpoints. Yes. >>: [inaudible] >> Minh N. Do: Right. >>: [inaudible] >> Minh N. Do: Right, right, right. >>: [inaudible] >> Minh N. Do: Right, right. >>: [inaudible] >> Minh N. Do: Yes, yes. >>: [inaudible] >> Minh N. Do: Right, right. >>: [inaudible] >> Minh N. Do: Yes. >>: [inaudible] >> Minh N. Do: Great question. Yes. We didn't actually test with the connect, but, you know, we -- that's why we did the other problems that I showed last, which is try to attempt exactly the problem you mentioned. The depth is poor quality, you know, so then when you encode, you encode a lot of those noise and you pay dearly for that. So now if the [inaudible] have connect, connect not just have a depth, but there's a color next to it, and in the color we can use color information to fix those, you know, holes or occluded area. And then now we have a pair of, you know, post-process -- I'm sorry, pre-process. So, you know, the proposal we think is take those raw, let's say a pair of color depth by connect, use the color image, fix those, fill in those pixels and have a clean edge map and then use that processed image for encoding. So then we can have about 70 bit rate as well as, you know, enhanced image quality. So, yeah, not mean to just throw the data [inaudible] into the encoder. That would not work. >>: I'm curious just to see [inaudible] >> Minh N. Do: Right. >>: [inaudible] >> Minh N. Do: So, for example, that application here, what we show is -- let's say one of the object -- you can see that the way you can cut out object ->>: [inaudible] >> Minh N. Do: Yes. >>: [inaudible] >> Minh N. Do: Yes. You can see that we cut out object boundary very accurate now, and we do that with a connect camera. Why? Because the depth. But you know the depth if you just do that, you know, cutout that is very noisy. But we have a color, so we have a technique that do the realtime that cut that very nicely out. And we have that now. That information now is valuable about depth and color, and then we can use that to do future encoding. >>: [inaudible] >> Minh N. Do: Right, right, yes. >>: [inaudible] >> Minh N. Do: [inaudible]. >>: You're going to have lots of holes everywhere. That's going to [inaudible] >> Minh N. Do: So -- so Matt here can elaborate more, but the key, we don't have to find out all those edges. We only find some of the key edges, and that is expensive, and then we just use that information to guide the encoder. The more we have, the better, but, you know, you can just get a few of them that already have it to reduce the bit rate. Now, the rest we have to, you know, spend more bits on coding the residual. So, you know, like encoding, if you don't do a good job in most estimation, the residual will catch up or, you know, you have poor quality, but, yeah, it is not going to fail miserably, like it's just degradedly. >>: So from a coding point of view, you want to duplicate the data on the other side of this [inaudible] >> Minh N. Do: Yes. >>: Seems you have already put some process on the image. So you save some bits but then [inaudible] >> Minh N. Do: Right, right. >>: [inaudible] >> Minh N. Do: Great point. I'll explain. We realize that up-sample, remember, when you up-sample, the whole thing still a low-order polynomial, and even though, you know, you can have they loud signal, but if it's a piecewise [inaudible] and we know where the boundary, wavelet will just eat them very easily, you know. So in wavelets what happened is it do those transforms so, you know, go back to the very, low, low, low resolution, even much lower than the original, and then the rest is remained in high frequency or high sub n coefficient [inaudible]. So, yeah, it is [inaudible] you have cleaned them up, but when you encode them, [inaudible] take care of that. So the bit stream, you know, in the end, it is very small. So people actually learned that, you know, you take an image, you clean them up, you know, it turn out that much more efficient encode than the clean-up image and the original. Yeah. >> Cha Zhang: Let's thank the speaker. >> Minh N. Do: Thanks. [applause]