>> Eyal Ofek: Hi. It's my pleasure to invite Gaurav Chaurasia for a talk here. Gaurav is finishing his Ph.D. under this supervision of George Drettakis in INRIA area of image-based rendering. Gaurav. >> Gaurav Chaurasia: Thanks a lot for the introduction and a very good afternoon to everybody. I will be talking about my work on image-based rendering. So let's start from the basics. The basic pipeline in all of graphics is to first to scene modeling you create a model of the scene, then you need to do some editing of the materials, the BDIFs, design the illumination of the scene and that's when you get the final result. But all of these steps require intervention of trained artists depending upon how good an image you want to get that brings image-based rendering into the picture, because anybody can take images and if you could render the scene just from images, that would be really cool. So the problem with image-based rendering is that you can create novel views from input images only if you reconstruct the whole scene exactly from your input images. That turns out to be a very hard problem and is active research in reconstructing geometry, the materials and lighting of the scene but you can only do approximate novel views. So the challenge with IBR rendering is basically how do we create an approximate novel view and what is an approximate novel view. So I'll try to answer all of these questions in my thesis, my work is mostly focused on the first part which is how to create approximate novel views, and I was lucky to be part of the project and the team which were on perception analysis of image-based rendering. So let me start by explaining the problem statement in more detail. We are trying to do image-based rendering of urban scenarios. We're looking at datasets, at images which have a lot of architecture, trees, cars, geometry. We're trying to use hand-held images captured with hand-held cameras only. We don't want to depend upon laser scans. If we had this data, it would be great, but we don't want to rely on dense data. We would like to do this in as few images as possible. The example you see here is an example of a dense capture, lots of images captured of the scene. They'd like to do this with sparser and sparser captures, because that allows us to do very large scale image-based renderings of really huge datasets. And lastly we'll try to approach the problem of three viewpoint navigation. Most existing work interpolation where you jump from one input view to other input views, but doing free view point navigation is a much harder problem and we'd like to start working in this direction. So brief overview of image-based rendering started with light and image-based modeling way back in '96. There's been a lot of work as I said on view interpolation. This field has spawned a number of other applications like up sampling of videos and camera stabilization. As I said, all of my work is focused on IBR of urban scenes. And we have a lot of commercial products in this field already. And all of these products use a very basic form of image-based rendering which you have one plane, you jump from image one to image two, do basic blending and approach this problem as well as possible. A standard approach is to reconstruct the geometry of the scene and then any pixel in the target view can be recreated by back projecting on the geometry and reprojecting into the input views. And the earliest work was indeed based on this approach, and this actually works really well if your geometry is good. As it turns out the geometry of urban scenes and for that matter any dataset, with enough complexity, is not completely accurate. And that's why most the recent work has focused on alleviating the artifacts because of slightly inaccurate geometry. For example, in this work optical flow is used to fix ghosting artifacts. You've scheduled input images on to proxy. If there is alignment issue, opti flow can realign them back. This works well if your alignment is off by a few pixels. More recently this was the work done partly at MSR. Ambient point clouds are used for unreconstructed areas of the image. The main object of interest is very well constructed as you can see here and the background and the can fade in and out. We tried to multiple scenario. We know these approaches can give you sparse and noisy data if you don't have enough texture in your input images. If the texture is too random, too stochastic. If the geometry is too complex. And all of these issues are aggravated if your input images are wide baseline. If the spacing between input images is a lot and the overlap is quite little. So for point of view of IBR, this is an input image and a depth map which we got from existing multiple zero techniques. The problem is it can require dense in some regions, for example, the trees is almost not there. Few of depth examples and collusion boundaries are simply not there with the depth maps. With this we now know that the geometry estimation does not give us dense data everywhere. And for such areas we will try to use image-based approximation. We'll try to use image warping and the two main tissues are of course collusion boundaries and depth samples which we're missing. Why image warping? Image warping because it has been shown to work well for a small number of points. If you have a few pixels which can be reprojected into novel view you can warp your whole image with this in a Lees squared sense. So all of this was inspired by the camera stabilization paper of Liu et al, and we improved that to use a shape preserving board as I'll explain afterwards. And it's a more intuitive way of controlling how the input image is converted to a novel view. You can use depth to do that. But the effect of depth is more indirect. You don't know exact -- you can't understand the artifact you'll get because of noisy depth. But if you design your algorithm on the basis of image warps, this is a more intuitive way of controlling how the image will be projected into the target view. So I'll go first over my first project more briefly. This project was at EGSR a couple of years back. And this release is proof of concept that image warping indeed is useful. So we start with a bunch of input images. We reconstruct the depth maps for each of the input images by using standard off-the-shelf multi-based approaches and we mark all occlusion boundaries approximately by -- and we ask the user to mark the occlusion boundaries. I'll explain why we do this afterwards. So with this data we have depth samples and I say we're going to use the depth samples for constraints for image warping. Each pixel shown here in gray can be projected into the target view. All the white areas cannot be projected because they have no depth. At this moment, if you overlay a big warp on top of this, a grid and use each of the gray pixels to reproject the whole mesh, then you can probably handle the empty regions and that's what the shape preserving work is all about. We have the reprojection constraint which comes from depth sample. We overlay a big sort of [inaudible] over the top of the image 2-D image. Some pixels as shown here have depth. We can reproject these into any target view and we can warp the whole mesh along with it toward the mesh we have to hold the whole mesh together by some type of 2D constraint that we use as a shape preserving constraint which enforces that each triangles word of the mesh is only allowed to undergo a translation on a image plan rotation on isotropic scale. The end effect is each triangle does not undergo arbitrary defined formations and that preserves the local shape of the word mesh but this as you would, you would notice is a smooth image work and this will be a problem at occlusion boundaries because the background will move at a different speed compared to foreground and this indeed is a problem as you can see in this not well done illustration. If you have some points in the foreground in green and some in background in red and you move the novel camera in one direction, the point will try to fold over each other, which is the case of occlusion when the foreground occludes the background and the same thing happens if you move the camera in the other direction. In this case, the word mesh will stretch because the background will move in a different direction than the foreground. And this will lead to artifacts as shown here. This is an input image moving the camera in opposite directions will lead to these kind of agreed distortions. To handle this we embed a discontinuity in the warp mesh this is done in the form of elastic bed we mold them approximately by hand so we have these very nice edges around each foreground object. We insert a narrow band of triangle around these, around the foreground objects and inside these triangles we do not apply the shape preserving constraint. This makes sure that this area is allowed to deform arbitrarily but the rest of the mesh does have the shape preserving constraint and that space, the way it should. The end effect is the part above this elastic band shown here does not affect the part below and these two parts can move in opposite directions or they can even fold over. And this constraint was the main contribution of the paper. This allowed this showed us that you can even use smooth image warps for doing a discontinuous sort of effect. At the same time, you can maintain the shape of the foreground object. Imagine an object which has a right angle, horizontal and vertical edge and you want to make sure that this right angle stays the same while you're warping your image to novel view. To constrain that we constrain the angle between any two neighboring edges of the foreground object. With these foreconstraints we get the full image warp in function and we use and we solve it in Lees square sense and that, the effect of all of this is the earlier distortions which you saw are now contained inside the elastic band which we marked earlier. So the red region shown here was in the beginning just one pixel-wide but warping the image to different places allows all the distortion to be absorbed in that narrow band and we know where distortion is we have images to field content into these areas. So now I'll go with a rendering pipeline. I hope you can see the input cameras. Okay. So the input cameras are all shown in green target view for each target view we pick a few input images, warp them and blend them. I won't go into the details of blending. So here are some results. So on that side you'll see results of earlier approaches. That's unstructured lumigraph with the best quality model we could obtain, and you will notice that we get ghosting artifacts there because the model was not perfect. And here we don't have the same ghosting artifacts. The results aren't perfect but we manage to improve significantly on occlusion boundaries. In this case you'll see the whole tree moves at least as one block. And it's not broken up completely. All the novel camera parts here are not quite view interpolation. But almost very close. They're not very free viewpoint yet and the reason they're not, we could not do completely free viewpoint rendering is because we used a global image warp and this led to distortions when you move too far in or too far out. So I'll show some comparisons of these distortions afterwards. The major limitations were first it was a global image warp because, and this led to distortions which I'll explain later, and second was that we had to ask the user to mark all the edges manually and this led to a problem. I'm sorry? >>: In addition to marking just manually do you also have to somehow specify the Z order so the tree moves in front of the building or is that automatic. >> Gaurav Chaurasia: No that's automatic because do you have a reconstruction. So you can use that to infer the ordering of things. And as you can see, this would be a painful process. If you have to do the same thing for every single image of every single dataset makes this approach not very practical. And being a global image warp it was real time. But the system was quite big. If you're trying to solve a warp mesh for the whole system it became quite big and mostly because we had to embed these elastic bands. It was a conformnal Delaunay triangulation and it made it harder. It had numerical issues and other problems and lastly as I said we had distortions. So to address these issues, we did the paper which I'll present at SIGGRAPH in a few days mainly the main targets were to do something completely automatic not depend on intervention, be real time, make it easy. So for this we'll again use two base image approximation to get occlusion boundaries we use image over segmentation. Divide the image into lots of super pixels and these pixels are, they always capture all the occlusion boundaries. They have lots of other edges, but we can handle that, because we're using a shape preserving warp. Explain that in a bit. To handle areas with no depth, we add some fake depth, approximate depth in empty area. It's important to note that the depth we add is not for consistent, it's only useful for doing these proximity image warps. You cannot use this depth to help you in surface reconstruction. Third step is largely the same as image warp, but this time it will be a local warp applied on each pixel individually and this reduces the size of the warp and makes it much more easier to actually solve. It becomes much, much more lightweight. Last is adaptive blending where we try to alleviate artifacts which arise because of ghosting and popping. So the preprocess is likely the same. We again use multi-based area to reconstruct a depth map for each input image. The second step has changed now instead of applying -- instead of asking the user to mark it we use image segmentation and from this point onward it will upgrade to super pixels as the basic building block. The next step is to add depth into empty regions. Here's a depth map and super pixel segmentation. If you overlay these two things I hope you can see it over there that you have lots of areas which have lots of depth. That's all grand. That's all fine. There's also super pixels that have no depth which appear in white we cannot reproject them in any way in novel view if you bought pixels individually. To handle such cases we can find depth from other super pixels which have the same imagine content and are spatially close. So let's try to mark all of these problematic areas in green and look at any one of them, for example. So we have these two ideas to find approximate depth, spatial proximity and same visual content. First let's try to find the pixel that has the same visual content. These could be anywhere. If we rank pixels by the order of similarity in visual content, you found that the rank one, two, three, four, 5, were always in sort of very different parts of the image. You cannot completely rely on this. You had to incorporate spatial proximity. The problem with this is if you put spatial proximity across the image plane and RGB difference into one weighted distance metric you'll have to adjust the weights every time time for each image and for each dataset. So to solve this we use a super pixel graph. We have super pixels like this. We consider each pixel as a node in the graph and the edge between two super pixels is the distance between the histograms the RGB histograms. And what this basically means is that an edge between two super pixels on the wall will be quite small because they have the same visual content. And two super pixels on the tree will also be low. But anything between the wall and the tree will be quite high. And histograms give you a nice way of evaluating this metric, because the shapes and sizes of pixels are quite arbitrary. But their content is supposed to be very homogenous. So with this, you now have a graph of your whole image and you're trying to find the best neighbor of the targets where pixels are shown in red over here. You want to refine, you want to choose a few of these yellow guys where to you want to find where to choose depth from. Let's zoom into this area. And try to find any path from one of the red guys to one of the yellow guys, the path can be from anywhere in the image. And you can imagine the shortest path will be something that jumps over super pixels which always have the same visual content. If you stayed within the tree always, that part cost will be small. If you ever jumped outside the tree, you'll immediately incur a big edge weight over there. So we retain the -- we rank all of path cost and we can find shortest retain those which have the really same visual content because that's also closed in the image. They're object. these yellow guys by their shortest path for all the yellow guys and we smallest cost. And these have the what we started with. And they're also more or less on the same scene Now, it's important to understand that the reason we are choosing these three is not because they're close in Euclidian sense, they're close on this graph. This algorithm will converge to Euclidian sense if you had depth in enabling super pixels, but if you did not, it could go and find depth from a relatively far off place without trying to find depth with the red guy from the wall which is much closer, actually. So at this point we now have a super pixel where you wanted to add some depth. We found some candidates for it which have some depth samples and we can interpolate these depth samples to add one pixel with depth into the target super pixel. So and we can -- I won't go into the details of this interpolation. And we can add a few depth samples. You could go on and add depth with every super pixels but that's not required because we're about to use a shape preserving board and I'll explain why we need that. We started with something like this with sparse depth and we added some of these samples artificially. Now we have something in every super pixels and we're good to go the next step which is image warping. So if now that you have depth samples in every super pixel you could triangulate and project each of the vertices to get a target view. This will lead to problems because we added these approximate samples and even if we look at the original samples, they can be a bit of noise, a bit of inaccuracies. So the effect of this tiny bit of noise can be seen on the balcony part, which was well reconstructed, but depth samples do carry a little bit of numerical errors. And in the top part, you'll see problems because we added approximate depth same on the tree and the bush. The result of our shape preserving warp is that it does not leave as many cracks. Simply because we regularize the effect of depth samples. We're not -- it's a Lees squared smoothing, and as I said before the shape preserving warp gives it a straight forward intuitive way of controlling the effect of these depth samples. So let's jump into the warp itself. As I said, these are the areas where we actually added the depth samples in the previous depth and that's where we can see the best effect of the shape preserving warp. So the warp is exactly the same, the same reprojection constraints but within each super pixel be projected into the target view and you can warp the super pixel like that. The shape preserving is also the same. We tried to make sure that each warp -- each face of the warp mesh does not deform too much and with this we can now choose four input images to blend to work for each target view which you can see on one side and the final blended result on the other side. And let's now just take a brief look at the adaptive blending step. The basic weights for blending images for image-based rendering comes from unstructured lumigraph a very old approach but still holds good these weights are based on orientation of choosing cameras. The problem is they blend too aggressively and in other situations you always get ghosting artifacts if you blend too much and this is something that we found in another work we found in a little bit. So to handle this, to handle this notion of adaptive blending, which means we want to blend only when absolutely necessary, we use this idea of super pixel neighborhood across images so let's assume these two images zoomed in area of two images and two super pixels just mount artificially over there these two pixels will be considered neighbors if the depth samples are one reproject into the other. This means that there's some sign of correspondences between these two super pixels. They belong to the same part of the same scene object. If that is the case, it's probably okay to blend them. If at any pixel target we are trying to blend two super pixels, which are not neighbors and hence do not belong to the same region of the same object, of the scene object, I beg your pardon, it's not a good idea to blend them because you will end up with ghosting artifacts. In such cases we favor one candidate quite heavily, increase its blending weight artificially and that reduces the amount of ghosting quite a lot. And lastly for any small cracks left we just use basic hole filling to get the final result. Now I'll just show some free point results. In these cases you can see that we're actually moving the novel camera shown in red quite far from the input images. And in this case really far from the input images. This is taken from across the road from in between on the sidewalk. And the images which are warped are shown in blue. And you can see that we do get padlacks effects even though we move quite far from the input cameras. That's the main goal of this project. We experimented with our approach on a number of input scenes. We can find more details in the videos and paper. And I would request you do that. So we're not compared to existing approaches. This is come by floating textures uses optical flow to alleviate ghosting artifacts. In this case you can see it was not designed for cases where the artifacts are really big so it does not converge as well as it should have. And in this one we compare with ambient point clouds which uses this hazy point cloud for unreconstructed areas. Again, this was not designed for cases where your unreconstructed areas are actually on the main scene object, and then it looks like an artifact. And it's also important to note that a view interpolation technique doesn't handle free view navigation. Lastly we compared to our own approach the one I explained before and I said we get distortions when we move very far away from the input cameras as you can see in these cases, and it's also important to note that the new approach is completely automatic, with the old one required a lot of manual intervention aside from being a bit slower and heavier for the machine. So the main limitation of our approach is that we do not handle sharp depth gradients within super pixels. We assume that each super pixel does not have a big depth gradient. If you're looking at any surface of the scene at grazing angle that any super pixels on that surface will have a sharp gradient. If you try to warp this it will warp more in one direction than the other direction. And the isotropic scale of the shape preserving warp breaks down that's when we get artifacts. We like to solve this issue by redesigning the shape preserving warp itself and that's not completely inconceivable. So with this I'm done with one section, which was the main work of -the main two projects of the Ph.D.. I'll jump into two other projects where I was a co-author. So these were to understand perceptual issues in image-based rendering. The first paper was to understand the age old question of blending versus popping. So you'll notice that in both of these papers, we use very simple IBR scenario. We use simple scenes which have just a facade and a balcony in front. The proxy for the scene was just one big plane, as it is in applications like Street View. And lastly the IBR approach used is probably the oldest one ever, just apply texture map. And the reason we use this very simplified setting was to make sure we have complete control on the studies. We have no other variables. Obviously this has to be improved and [indiscernible] to arbitrary scenes but we had to start somewhere. So in this paper we'll try to understand this issue of blending versus popping. So that's basically what Street View does. You have panorama that at discrete places and novel views are constructed by choosing contributions from the two images, the scene geometry could be anything, but the proxy's always a flat plane. So in this study we showed people image-based rendering results and a reference video along the same path, and they were asked to rate the level of artifacts in these videos. So, for example, in this case, there's no blending. So you'll notice that the edge of the balcony really pops. And the popping is a big distance because it's a sparse capture. There's few input images it jumps over large distances. The second one, the popping is smaller but more frequent, because you have more input images. Some other case in terms of blending, we again have a sparse capture where you can notice lots of hazy effects, especially on the banners over there and lastly on this door. And part of the window, if you can see over there, has this big ghost, which is really in two different places simply because of the sparse capture. And the dense capture would bring the ghosts really close and merge them as you can see in this case. So we showed all of these one by one to the user, along with the reference view and ask them to rate all of this. So you've got a strong guideline that people like seeing a sharp image which pops a little bit compared to a ghosted image which is just always has low frequency, which is missing high frequency details, for example, in this case you are not using image one, and then pops to image two at some point then goes to image two. People like this more than something like this, which is a little a bit hazy. So you don't see the big artifact, the big jump but it's always a little bit bloody. In the next experiment, we tried to understand what could be a compromise between blending and popping. We use what is known as cross-fade where you use image one up to some point and fade between image one and two. And then you use image two. This turns out to be a good compromise between -- and the other cases, popping and blending are extreme forms of cross-fade. A pop is a cross-fade of length zero. Blend is cross-fade of length complete path. So we again showed all of these with the reference views and we got a strong guideline that cross fades for long durations are less acceptable than cross fades for shorter duration. Again for the same reason, people like looking at a crisp image with a small sort of change and again a crisp image. But this also shows that people are not good at noticing a perspective distortion. In this case you can see it's distorting, it changes, and it's distorted again. People don't seem to mind that or they don't notice that. That's the reason we did the next project which was to understand this issue of process peck I have distortion. Now there's no blending no popping, just a single input image, single proxy, and we project this on to the proxy, view it from different places, and we should see prospective distortions this paper will be at SIGGRAPH this year again in a few days. So, for example, in this case that does not look like a right angle. This is a single image mapped on to a flat plane. And the other one looks slightly more acceptable. What's the limit? At what point can we say that the user will object to it, at what point the user will say, yeah, that does not look like a right angle? So the two experiments were to understand this idea of distortions. So we showed in experiment one we showed images or these still images of facades with balconies and we asked the user what do you think that angle is. The way they answered this was using a hinge device. It took us a long time to figure out how to ask this question, because if you put sliders, people will make mistakes all the time. So this turned out to be a nice way. People would see it and actually replicate the angle in front of them. So we recorded all of this data and in experiment two we asked a simple question. Does that look like a right angle to you. And people were asked to rate it from 1 to 5. And both experiments should give us data that matches. And that's exactly what we got. I won't go into how we deal with this all of this data. But the end result of all of this was that we now have a model of predicting the level of distortions for any novel view position. For example, in this case on the far right you'll see a copy of the scene with the capture camera and the proxy as used. The two dots over there are, they are these two edges of the exclusion. And you will notice now that the user is trying to escape his place. Now the user is trying to sketch an IBR path over there and red means bad distortions and blue means acceptable. So this -- and we can decide the rating of this path using the data we already have. This is one of the first ways of predicting how bad the distortion will be without actually running the application itself. So I'll just let it go, again, in the beginning you can see it's quite bad and then it gets acceptable. It's a very big leap path and then it gets to be okay. So what's the use of this? We can use this prior knowledge to restrict where does the user put the navigation zone of users. For example, in this case we can only allow user to translate in acceptable -- you'll get a blink effect where we stop the navigation at that point. Don't allow the user to go any further. And we can do the same thing for turning. This allows the application to restrict navigation into acceptable zones. This is something we do in games a lot. We restrict the movement of people within a fair zone. So the conclusions of all of this work, first we should focus on fee-born navigation. We interpolate for a long time. That's something we can handle quite well. It's about time we jump to the next level. Second important idea that we want to push forward is we are trying to do image-based rendering which is approximate by definition. So we can make approximations over reconstruction, over data we get from reconstruction to do something more than that. Thirdly, because we were using some approximate data above depth, we should be using some form of regularization in the, maybe in the form of image warps or something else rather than just directly reprojecting this data into overall views. Lastly IBR choices should be based on perceptual analysis and not what the first author thinks is good looking. So this work is actually being used in an immersive space with stereo in a European project. This is actually work in progress. We are porting everything to the cave. In the future this case will be the starting application used for a European project which will work on using IBR for back drops of games. So there's all of this code is already where the games come completely working their way out of it. Open questions image-based rendering how do we use machine learning how do we exploit all this information we get from machine learning, there's a lot of work on semantic labeling of images and we simply are ignoring all of this. Can we use this data to improve results? Secondly, they have all these nice ideas for image-based rendering which have been around for quite a while. The interpolation work by Zipnick [phonetic] and the work on reflections last year and wide baseline and stuff. Is there an approach that can bring these nice ideas together? I'm not talking about a switch case statement where you do this when in this case and/or do something else in a different case, but is there a nice way to combine the advantages of all of these approaches. Thirdly, IBR has been studied in so many different contexts it's hard to compare one context with the other context. Light field for example is one end where you have hundreds of images, one object, very structured capture, you can do amazing things with this. Street level IBR two images for a whole building. Not much you can do it. Is there a way to go smoothly from one of the horizon to the other end? And how would artistic effects, we're trying to make things look good. We can use artistic effects to hide artifacts and give nice or immersive experiences and some work is already there in the form of ambient clouds which use nice effects to haze out the whole background. And every time we write an image-based rendering paper we always say it's real time, it's all very good. But the real question to ask is does this actually work on a mobile device? If the answer is no we have room for improving the efficiency of these approaches, make them very light, that's the only place where this can be useful for. And lastly we can use image-based rendering if all this works we can then use it as a nice tool for what your reality of your games where you have to make the backdrops in very quick time. After this, so all this was stuff I've been doing right now I'm working on a project called haillight, a programming language which isolates scheduling from algorithm and image processing pipelines so you can optimize these in different ways. The code does not get entwined. You don't have to mix your algorithm into the four loops and vectorization code so it's clean, it's supposed to make, it's being designed to make life easier for people who wrote code. So this is in collaboration with a group at MIT. In terms of future research, I would like to exploit multi-imagery in detail. People do take lots and lots of pictures but right now we're giving them tools to edit only one picture at a time, which means that if we edit only one picture you only want to keep one of them of all the pictures you took. At the same time, we are saying let's do image-based rendering which is a nice innovative way of visualizing lots of pictures at the same time. So we have to give them tools to edit whole datasets at the same time not just one photograph. And the edits we're talking about can be appearance manipulations these are some pictures I took and if I want to change how the statue looks, I would like to do it in one image and get it propagated to all other images automatically. This sounds a bit easy for static scenes but once your scenes become more dynamic, if the scene geometry changes, if I have a friend in the picture and I want to change my edits on my friend, it becomes a harder problem. Interesting nonetheless. And at the same time we can try to change the scene itself, for example, this is where we emptied a metro station in Paris to take pictures in that station. And the problem was they didn't like the advertisements, they wanted advertisements of the 1950s how do we remove advertisements in one image get that to propagate over all 50 images that we took over there. And this has to respect all the padlacks effects and everything. So that's a challenging problem. And what happens if I took very few pictures and I don't have enough data to propagate from one to the other can I use pictures which other people took to do the same thing? That's the next level of all of this research and finally can we think of these very large photo collections as unstructured light field. Light field is very structured capture small object hundreds of images and you can do amazing things after capture. Can we develop similar functionality for these very large group of images, can we think of them as unstructured light fields? Lastly, I would like to keep learning. I would like to explore machine learning and vision in more detail, work on image propagation and manipulation this is something I've wanted to do in my Ph.D. itself and I never had enough time to actually work on this properly, and I like being a part of groups which do lots of other things, a bit of work in systems and stereo is something which I would love to do, and my last project is sort of an example where I'm working with a compiler's group to learn more and more things. With this I'd like to thank you all for your attention. to answer any questions. [applause]. I'd be happy >>: So all this is right now running in real time? >> Gaurav Chaurasia: Yes, yes, yes. This is completely real time. The preprocessing stages, of course, are the whole reconstruction pipeline. That's all off line. But the actual rendering runs at 50 frames per second without any problems because the Lees squared problems are very small, easy to solve. Do parallel course and all kinds of stuff. It's all real time. But as I said the overhead is still there. You still have to store like 15 images to do rendering off a dataset. How do you optimize this how do you optimize the amount of -- how do you optimize the amount of memory some of the other things, that's what I mean by efficiency. >>: You do nice work on the analysis of how far can I go restore the images when we have -- why not -- have you tried to do the opposite and guide the people when they capture the images where they should be. >> Gaurav Chaurasia: Yes, this is actually something we've thought about. But we never got strong ideas. This is actually a work -- so there's a paper by MIT where they're trying to capture a light field and the system guides the person where to move the camera around. And that's very cool. But we would like to ideally you should be able to extend this into a multi-view environment -- in this work we get guidelines of how many images to capture. It doesn't tell us where to capture, but if you see like the heat map over there, if you have one image, that gives you this blue zone. If you have another image that can give you another blue zone. And then depending on how many imimages you want to use you could make the whole thing completely blue if you use lots and lots of images. That gives us a guideline on how many images to use. This is assuming you do a street capture where there's a car and there's not much you can do. You can't move forward or backward. But hand-held images it would be good to explore this field. >>: I had another question, when you moved from using a single global warp to using a single warp for the super pixels, doesn't that mean that the super pixels being warped independently can introduce lots of cracks. >> Gaurav Chaurasia: Yes, it does. That's exactly -- >>: [indiscernible]. >> Gaurav Chaurasia: Exactly. You do get this -- so we expected this problem to show up. Thankfully as a consequence of some accident and some design this hasn't happened, first of all, because we're using a shape preserving warp. The super pixels itself do not deform that badly. The way super pixels A deforms is quite the same way B deforms, if they were enabling an image at rest they sort of maintain their shape in the next image. >>: Same depth. >> Gaurav Chaurasia: Yes but you get super pixels from other images which don't do this which will introduce, which can be used to do some hole filling in the cracks left in between. That's exactly why you use multiple images to fill these small areas and give the parallax effect as you go along. >>: It's interesting that you don't notice a lot of either popping or costing in those cracks, I guess because of what you said, the warp is well behaved enough that ->> Gaurav Chaurasia: views. Right. It's consistent enough on neighboring >>: Similar warps. >> Gaurav Chaurasia: Yes. >>: Same thing you would expect if they were similar depth. If your depth was accurate, then it will always be pristine. But that's what the warp is supposed to alleviated if you don't have exactly the same depth. And some of it comes from the blending heuristics, because in some cases as I said if you do notice situations you're not supposed to blend these two guys, so which one do you favor, that's whether we say okay choose the background. A parallax error in the background is less noticeable than the parallax error in the foreground. It makes all the difference, very simple idea. But improves your experience by a fair margin. >> Eyal Ofek: Okay. Very nice work. Thanks. [applause]