15803 >>: Hello. Good afternoon. Welcome to the first afternoon session of the first day. This is going to be the first session where we have our RFP winners present some of their work. And this session is focusing on photographs, a lot of photos, especially laid out in various contexts for geography. The first two talks will talk about matching photos with either existing databases of photos or with a 3-D environment. And the first talk will be by Kristen, who is currently at University of Texas at Austin. Even though MIT doesn't appear on their intro slide, a lot of the work was initially conceived while Trevor was still at MIT. I'll let Kristen go ahead. >> Kristen Grauman: Can you hear me through here? Okay. Good afternoon. I'm real excited to be here to give this talk and also to meet a lot of you and talk about any intersecting areas we share. Seems to be there's a lot of them. So this work is done by myself and two students at UT Austin. Patrick Janbakrus (phonetic) and Brian Coolis (phonetic) and (inaudible). And I want you to consider this general problem of wanting to be able to do location recognition with an image-based search. This can be kind of the scenario to imagine why the techniques we're developing will be useful. So you take a picture with your mobile device. You send it to a system that's going to tell you where you are based on what you're looking at. So what are the technical challenges here? Well, we're going to have to do this indexing on a very large scale database. We'll come in with this image, look at some large repository like the Virtual Earth collection and try to find out what's relevant so we can return it. There's a complexity issue and robustness issue, because the person taking this photo, presumably, could have a completely variable viewpoint relative to the scene they're looking at and this gets down to the real core challenges in the recognition task itself, trying to deal with this real variability to provide the most robustness. So what I wanted to talk about today are some of our work on building sub linear time indexing techniques for what I'm calling good image metrics. So not necessarily the ones that are easy to define, but the ones that we know are going to be very valuable for comparing images. So these sub linear time search methods, the basic idea is we want to come right in and immediately isolate without doing any computation besides an initial hash those examples that are likely to be relevant for the query. And I'm going to talk today about two different directions in this project. One is how to do a fast search within such a large set of images. When you care about the correspondences between a bunch of local features in the image. The second, I'll talk about how then, if you actually learn a metric that you want to apply to do your search, how you can still guarantee a fast index on top of that. So in terms of robustness, we know of some good and reliable features that are local but will give us robustness even in the face of a lot of variations you see here. Extreme examples were object pose or occlusions are really changing the images we capture. Certainly in the global sense we know these images are so different we won't see a strong similarity. But if we look at local portions within them, there's going to be some repeated parts, and that's the basic idea of a lot of techniques today that are using local feature representations for recognition purposes. I'm just showing a collection of options you have for these local features. Some are specifically designed to be invariant to things like scale, rotation, translation. Some will capture shape, appearance, so on. But in general they give you ways to decompose the entire image into parts that can be useful and that you can describe in invariant ways. So what would be a useful thing to do when you have an image that you're describing in all these local pieces? One good thing to do is to try to see how well each one of those pieces matches up with some piece in another image. Because if I have a good one-to-one correspondence between all these local parts, then that suggests a good agreement overall in the objects within images. So that's a point set matching problem. We want to know the least cost matching between two sets of possibly high dimensional points. And that, to compute optimally, you really need to minimize the cost between the things that you pair up, cost cubic time in the number of points within one of the images. So in this work we showed a linear time approximation for that matching which means you can have very large sets of these features and still efficiently compute such a correspondence distance measure. So I'm going to give an intuition about the general approach between that approximation, because then I'm going to tell you how to go a step further and not just compare two images under the match, but actually take a large database and search it when what you care about is that correspondence matching. So here's the core of this idea, which is to quantize your feature space and I'll show you it's going to be at different resolutions but to quantize it so you can easily read off the number of possible matches at that given resolution. So to do this we're going to look at the intersection between two histograms, and the intersection is the sum of the minimum counts within every bin of two histograms. So what we have are these point sets where every point comes from some patch feature. So this could be a siftoscripter (phonetic) or some other kind of local point in the image, and we want to get that matching between the two. If I look at some partitioning of the feature space and then I intersect the histograms, so I'm looking at the minimum counts in each, that count actually reveals implicitly how many matches are possible at that resolution. So here we have an overlap of three or an intersection value of three, and implicitly that means there's possibly three matches that could be formed. So the pyramid match which is our approximate algorithm then does this at multiple scales. And this is scale within whatever feature space you're using. So starting from the fine resolution we'll have some number of intersections. There will be few because this is a very -- these are very small bins with which we can match. So we'll record how many points are matching based on the intersection. And then start making these bins larger and larger. So more and more points are now going to be intersecting. And we'll keep counting them. And in fact we'll subtract off counts that we've seen before. So the basic image is that by using this multi-resolution partition of the feature space you can efficiently, in linear time the number of points, read off how many matches are possible as you increase distance. So it gives us this approximate measure. The optimum matching entirely could be slightly different in terms of the one-to-one correspondence, but the cost is what we're getting that's approximately, and we have bounds for this. This is going to give you what you would have gotten with the optimal measure. So now the new problem is is if I have this image that I've described with local patches of some kind and I have a huge database of images described the same way, then I don't want to search through every single one, even with my efficient matching process. I want to find in some linear time those examples that have the highest matching similarity. So in other words it's the indexing task but now the metric that we care about is this one-to-one partial correspondence between points. So how can we do this in sub linear time so I only address a few of these examples. So the basic idea we have is to try and take these points sets and map a single point set to some vector embedding where that embedded space is going to guarantee that an easier to compute distance. It's going to be that product will give me an approximation of the original matching score. So every input that we have is a list of vectors. Those point descriptors, and indexing in with them all together to find the match. So we're going to develop an embedding function that will take such a list of vectors, map it to a single high dimensional point, where distance against that single high dimensional point and other such embedded points is the correspondence match that we care about. Once we are able to do this we may be able to leverage techniques such as locality sensitive hashing. This is the approximate nearest neighbor technique, based on the idea if I have a hash function H that guarantees that the probability that two examples X and Y fall into the same hash bin is equal to the similarity measure that I care about, then I can do a guaranteed sub linear time search with the trade off between accuracy and the time that I'm willing to spend. So if I have a hash function like this and apply it to all these examples, put them into this hash table, there's a good chance things we can guarantee that things are similar or likely to fall together in the table and then when we come in with a new query, we're going directly to those examples that are similar to it. Doing that, then, I only have to search very small number of points and compare them to the query and sort them to find the nearest, the approximate nearest neighbors. So this is a very general technique and very appealing because we'd like to be able to use this general format to guarantee a good search from any interesting measure. However, it's so far defined for metrics like L2, L1, that product. And right now what we're trying to do is set our similarity measure to be that correspondence matching. So the important thing to know is we need to be able to design a hash function that maintains this kind of guarantee. And that's what we're providing for, the correspondence measure. So in order to do this, we look at a property of random hyperplanes and the idea is if you have two vectors and a small angle between them, if you take some random hyperplane you select uniformly at random in terms of its orientation here, it's likely not to fall between them. If you do the same with vectors that are very distant or that have a low dot product, low inner product, then it's likely that they'll be split. So specifically if you have these two vectors. VI, VJ, the red and blue vectors, then the probability that the sign of their dot product with this random vector, which is the dotted line here is different, is equal to this expression. So the probability of having the same sign on that dot product depends on how distant these initial red and blue vectors are. So why is this going to be useful for doing this sub linear time search? This pro property is going to let you design a hash function where you can independently come in with one of your vectors and compute a dot product with some random vector and store it according to the sign. So a zero one bit. And next time you come in with a new vector, you do the same thing and the probability that they fall together or have the same hash bit depends on what their dot product with one another was going to be. And that's this VI.VJ. So that's defining a hash function that will work for a dot product. But we're not at a dot product yet. We still have that correspondence match that we're trying to handle. So now I'll show you how we can map the correspondence match problem into an inner product. And that's going to let me do that embedding so then I can hash under the matching. So to do this we need this property of intersection. So remember that kernel I briefly defined is based upon the idea of intersecting histograms. If I intersect the histograms, the mean value I've got one, zero, three, and we can expand that out in a unary kind of coding. If I say I've got a one here and pad it with some zeros, this one has three. I've got three ones, pad that. Do that all the way out. That's a vector for representing this entire histogram. And the same thing over here, two, three. And zero in the middle. Now, notice that if I then take the inner product of these two unary encodings, I actually get the intersection value itself, which is the sum of those minimums, which is four in this case. So this is then what's going to allow us to make an implicit mapping from our point sets into these very large vectors, which we could then use this dot product hash function to match. This is the actual definition for the pyramid matched kernel, which is that approximate matching function. It goes over all levels of a multi-level histogram and counts the number of matches based on difference and intersection values. Then I can rewrite this, so it's simply just a weighted sum of intersection values. So that means then I can take a point set, map it to that multi-res histogram in the feature space, but then stack up the items that were in each level of that histogram into a single vector and apply the right weights. So on the slide before I had these subtracted weight terms multiplied by a weight section. Now if I scale these histograms concatenated by the right weights in each place, I'm now at the point where a dot product between these implicit unary encodings is going to give me exactly that original pyramid match value kernel. So we started with point set X. We got histogram. We stacked it up. We weighted it in the right way so that now when I take the inner product between two such encodings, I'm getting the original pyramid match kernel. Now I'm doing all this so I can go use that hash function that's defined for the inner product to do a guaranteed sub linear time search. So the embedding of those two point sets, X and Y, the inner product is the pyramid match kernel. So here's our hash function, then. It's parameterized by some random vector R. And given one of our embeddings for a point set, F of X, it will return zero one depending on the sign of the inner product of our bedding with that random vector. This function is independent of any other input. You just take a single input and map it to a hash bit. And now we're guaranteed that the probability that the hash bits are going to be equal to any two sets of points is equal to this expression, which basically is proportional to the pyramid match kernel between the original inputs. So in completing that picture, now we're able to go and do sub linear time search with a guarantee because we've been able to come up with a function that says the probability of collision and hash table is equal to the similarity measure that we wanted. And so we'll hash all these examples into this table, come in with a new query, apply our same hash function. Come up with a small set and get our approximate nearest neighbors. So I'll show you one result with this part of the method. And it's on an object retrieval problem. So the goal is to be able to take an image of an object, find out what it is based on what it's matching to some collection of images. This result is using a database of 101 different categories that was developed at Cal Tech. It's like the images you see here, I'm showing examples from three different kinds of categories. What we do is extract local features from all the images. We use a sifter scripter and densely sample the entire image. What we're doing is searching under the matching according to this hash, pyramid hashing that I just described. So here's a result. We are looking at the error in the classification according to those nearest neighbors that we indexed to relative to this parameter that we're allowed to tune, which says how much accuracy we're willing to sacrifice for speed. And this is the epsilon parameter related to locality sensitive hashing, basically the more we go to the right we'll search things faster but we'll have more error. The more we go to the left we'll search more things, it will be slower, but we're guaranteed to have lower error. The error in the straight line is the best one possibly could do. That's with a linear scan. Actually took every input, looked at every other input in database and ranked them and the classification error would sit here. Whereas, when we do the hashing, we will search much smaller fraction of the database and our error will increase with that parameter I just described, but essentially we'll be searching two percent maybe or less of the entire database and still getting accuracy closer to that linear scan. >> Question: What's the scan on the axis, the error rate what is that? >> Kristen Grauman: This is percent error in the nearest neighbor classification. This is 101 way classification. So one percent chance error. So right now here it's at 70 percent error, which is 30 percent accuracy on the Cal Tech 101. And this is on nearest neighbor search using that correspondence. If we were to use, if we're actually to learn from this kernel, this would be more like 50, 55 percent accuracy. So I just showed you how to do this. We can do this indexing when we care about a set-to-set correspondence. And now I'll talk about some more recent work where we actually want to learn a metric based on some constraints someone's provided and use that learned metric to still guarantee fast search. So, for example, we know that not all the time will the appearance in an image reveal what's truly relevant or what's the semantic relationship across examples. I could have a very good correspondence here between these points and some here, but actually have a really good correspondence here between these, let's say. So which of these similarities are actually relevant? That's something that we might get constraints externally about. So what if someone says, well, all these in your training examples, these are all relevant to one another. But this one's not. So we want to be able to exploit that information and use it to affect the distance function we use to do the indexing. So the idea is we can turn to metric learning techniques that will take those constraints that we're given and remap the feature space, essentially, so that similar examples fall close together under your distance function and the ones that are like those that are dissimilar are going to be more distant, even, despite the fact that of what you actually measured could have made it as similar to these others. There are a number of techniques to do metric learning and kernel learning. And many of them allow, want to provide these pair of constraints where you say these two are similar, these two are not. Or A is closer to B or C. I've listed a number of techniques that can be deployed to actually do the learning of the metric. But what we care about now is to take that metric you've learned and do a fast search on top of it. Well, when you have a specialized distance function, you can't immediately plug into these locality sensitive hashing methods. And we know that in high dimensional spaces exact nearest neighborhood search techniques will break down and not be able to provide better than linear scan time in the worst case. And the functions that have already been defined like the inner product we went through or others do not work for these specialized functions. So our goal, though, is to be able to guarantee you the fast search; but when you're providing some parameterization of a learned metric, specifically we'll focus on a [high noblis] parameterization. The way we accomplish this is to let the parameters of the distance you learned affect how you make that selection of the randomized hash function. So we're going to bias our hash functions according to whatever biases are in the constraints about similarity to similarity that someone gave you ahead of time. So let's look at this in a visual way. Let's say we have these images I think of hedgehogs, I don't know, beaver and hedgehog, yeah. So I'm a little unsure because these animals and these images look pretty similar. So it would be quite likely that an image-based metric is going to map them close together. It would be useful if someone came and told us these two are similar, these two are different classes, so these are dissimilar. Now, the generic hash functions that I mentioned before are going to be selecting these hyperplanes uniformly at random. So this red circle means that we're likely to compute or select any rotation of that hyperplane, or equally likely. Now, instead, if I have this information, I want to bias that selection. So what we'll do is make a distribution that's going to guide us to be more likely to choose a hyperplane but does not separate those examples that are similar. And other examples that are like them. So we're looking at the [INAUDIBLE] distances passive metrics where basically you want to provide some parameters A. This is a dividing matrix for D dimensional points, to give a proper scaling on the points so that you -- essentially we can learn this metric. Generally you might put in inverse covariant metrics here, but there's a number of techniques you can learn the matrix parameters so you can better map similar points closer to similar points. This is defined in terms of distance between of vectors X-I and X-J. We can map that equivalently to the dissimilarity, if we look at this product. So suppose I transpose that metric from A of X sub J. So I just want to give a taste of the main idea how we'll accomplish this biasing of the hash functions. We have this learned metric A, and it's a positive definite matrix. Vectorize it like this, essentially look at the square root of this matrix G. And now we'll parameterize this hash function not just by some random vector I but also according to the matrix A. And we're going to put that square root matrix which we're calling G into the hash function. So essentially we have that random vector, also those parameters. And X. And you can think of this as the hash function now carrying information about those constraints that you learned. Because if I embed or hash my database points using half or square root of this metrics, then when I come in with another input that carries the square root of the matrix, we take the dot product, we get that full expression we want, which was the X transpose A X sub J. So now that I fit it into a dot product, I get to go back to this relationship between inner product and the angle of the vectors and we'll have the guarantee that the collision is proportional to the similarity under the learned metric. This is the main idea. The problem is this A matrix could be extremely high dimensional. If it's manageable, this is exactly what we'll do. And we'll compute it explicitly. But if A is 100,000, millions of dimensions we can't manage it explicitly. What we've been able to do is develop an implicit update for the hash functions but allow you to never explicitly represent this A matrix, which likewise means that we never explicitly represent the G matrix. By not having this matrix we'll still be able to accomplish the function of what that G matrix is doing in the explicit case but we'll do it by looking only at comparisons between our input and some existing samples that were constrained during the metric learning. So this equation is basically showing the end result of what we can do implicitly. Over here we have that R transpose G that we had before in the explicit case. Now this is representing our possibly high, very high dimensional input in the feature space. Over here we're computing this without ever touching G. We do it in terms of this matrix of coefficients that's indexed over all our C constraint points and also comparisons between our new input X and existing inputs XI. These are among the constraint points. So the main message here is that we can do this hashing when you provide a parameterization of a learned metric. We can do it whether or not that parameterization is explicitly representable. So this is the polish before. Now I'll expand it. This was our accuracy using that matching kernel. Whether we did a linear scan or we hashed. And now we see this motion in these numbers. So now I'm showing you linear scan -- actually error rate. If we are to learn from that kernel, so if someone gives us examples that we match but also constraints, so we improve our kernel to include those constraints. Now our error rate is going from 70 to under 50 percent on the 101 class problem. Similarly, we can provide the hashing which we have control over errors versus speed. And we're actually coming quite close to linear scan. In fact, doing this kind of learning on top of kernels, it's not limited to the matching kernel I described. You can apply this metric learning on top of any existing kernel. So we've done it with the pyramid match kernel, but another correspondence kernel defined by a group at Berkeley. What we're able to show is that we're improving accuracy on the Cal Tech 101 relative to all state-of-the-art single kernel metrics. This shows you the number of examples you use per category versus accuracy. And accuracy goes up as you train with more data. These are lots and lots of previous techniques. And here, as we learn on top of this kernel, we improve the accuracy and similarly for the paring match. Right now this is the best result for a single kernel method. Finally, I wanted to show you some results using our fast indexing data for some photo tourism data. Got nice introduction and exposure to this product. And certainly there's a very large scale search problem involved where you have these images with some interesting patches and you want to go into all your previously seen patches and try to make some kind of match. So you can do that 3-D reconstruction. With our technique what we'll allow you to do is to take descriptors you have for the patches and immediately isolate a small point in the database which you need to search to find those that are relevant. Here's a result using our technique to learn a function of comparison of patches and also to do the fast search on top of that learned metric. What these curves are showing you is recall rate. I want a high recall rate meaning I get the head of the guy at the Chubby Fountain, I get all other patches ahead of the guy at the Chubby Fountain. Our learned metric does improve the recall rate which you can see between this initial curve which is a non-learned ler metric and a metric that's learned on top of it. In addition, what I'm showing in these very nearby curves is accuracy with, again, linear scan, which is pink. And if you do a hashing, which is very near which is what we want. We don't want to sacrifice too much accuracy, but we want to guarantee that you're going to search a lot less of the data. What I've overviewed today is some of our techniques for fast indexing. This is based on some initial work based on continuing matches between local features. But then I showed you how we can can do large scale search in sub linear time on top of matching and also how you can learn a metric and do the search on top of that. And I have some pointers to the relevant papers. But I hope I'll get a chance to talk to many of you if you have questions about this work and I could hear about your interesting problems where it may be applicable. Thank you for your attention. [APPLAUSE] >>: Great. We have time for one question. Anybody? Frank. >> Question: So the LSH work, locality sensitive hashing, normally uses key hashes and L hash tables. Is that also what you do? Or are you using single hash? >> Kristen Grauman: We're using single hash table but we're using K bits. So we'll draw K random vectors to establish K random functions and then concatenate them to a single hash key. So far in our experience we used a single table. But you could certainly try to improve that by duplicating and just by having more tables. >> Question: The paper seemed to indicate at least if you use many more tables performance goes up in terms of [INAUDIBLE]. >> Kristen Grauman: Definitely right. Just like with the epsilon parameter, which is giving you speed versus accuracy, the more searching you're willing to do the more guarantee you have of hitting what you need to hit when you do the hash. >> Kristen Grauman: Thank you, Kristen. If anybody has any other questions please try to catch her during the break. We'll go on to the next talk. This is given by Johannes Kopf. It's work with Olivier Dosin (phonetic) at the University of Constance. They're going to talk about what you can do when you match an image with a terrain model. >> Johannes Kopf: All right. Thank you. Yes, so this is about our cooperation with Virtual Earth, called Deep Photo, combining photographs with digital terrain and building models. And actually worked on this together with a whole bunch of people, that is, Boris and Oliver from my university. And Billy Chien from Virtual Earth. And Madam Michaels from Microsoft Research and Danny Denny from the university team. So when you're taking a photograph of a scene, you're essentially capturing all those rays from the outside world that run in the optical center of the camera and project it on to a flat piece of form. You end up with a bunch of pixels where you starve color for each of these rays but fortunately you lose the depth. This is kind of a sad thing, because if you would have the depth along with your photo, you could do really nice things to your photo. For example, you could remove haze from the photo, since haze is essentially a function of the depth? Since having depth means you know about the geometry in your photo, you can reline the photo. Approximate word would be when you have different elimination from a different position (phonetic). So applications such as these ones here is what this work is about. Now, unfortunately, inferring depth from a single photo is a challenging and most usually computer version problem. That's why most computer researchers worked on manual painting interfaces to augment your photo with depth. The classical (inaudible) picture paper, for what it means to give the user a method to put a simple spidery mesh on top of your photo to give a hint of the depth. And, interestingly, even with these simple depth maps, you can sometimes already create a very compiling prenavigation experience. And later on, there have been a couple of other papers that provided more sophisticated means to kind of enhance your photo with depth. There's been even some automatic methods like the ones by Hoy, et al which are able to, in some scenes, completely automatically infer depth maps for single photos. But in this work we choose to do a completely different approach. Rather than using computer vision methods or many other editing faces we turned this into a registration problem. By doing so we made use of two big trends that exist nowadays. The first trend is geo tagging of photos. So there are already a couple of cameras out there like this one here that have frozen GPS devices. When you take a photograph with one of these cameras, the photo will be augmented with a tag that starts the location of the orientation of the camera at the point of in time you shot the photo. Second trend is availability of very high resolution and very accurate terrain building models such as Virtual Earth that we used here. Now, the idea is to location of the photo to precisely register this photo to the models. When we do this, we get a very accurate depth map here. But it's even more of a depth map, because we have a geometric model. So we can look behind the things that are visible in the web. We've got these things like textures from Virtual Earth which are based satellite images. And so this work's about a whole bunch of interesting exciting applications that get enabled by the geo registration from the photographs. I'm going to show a couple of limitations of applications ranging from image enhancement to (inaudible) synthesis and information facilitation in this talk here. Now, since the emphasis is clearly on the applications here, I will only talk briefly about the registration methods, which is right now manually. So we use the tool for this developed by channel virtual earth. And so we assume we know the rough location of our photo. And then we ask the user to specify four more point-to-point correspondences, as you can see here, for example. So this is enough to solve the system of equations which finally gives us the position, the post and focal length of our photograph. So this is, for example, the result in the case. Now the photo is G registered. This automatically allows us to stop up the applications. I'll talk first about image enhancement. So I guess you know the situation when you take like auto photographs of well known landscape or city scape, for example, but unfortunately on that particular day you had like bad weather conditions. You see a lot of haze here and the lighting is boring. There are no shadows in this photographs here. With our system, as it is going to show you in a couple of seconds, you can take these photographs and enhance them, for example, by removing the haze or changing the lighting in the photographs. It's a bit dramatized here. Let's start with haze, which is one of the most common defect in auto focus photography. Has is due to two effects, which are called outscattering and inscattering. You can nicely capture them them with the analytic model that I'm going to explain. The first effect, outscattering, means that part of the light that gets emitted or reflected from our object gets lost on its way to the camera. This is because some of the photons hit air molecules, aerosols, and gets scattered out of the line of sight. This reduces the contrast of our scene. We can model this by simply multiplying our original intensity with this unknown [INAUDIBLE] which depends on the distance of the object. The second effect is called inscattering. And this is caused by ambient skylight which gets scattered into the line of sight. So this causes distant objects to get brighter. And, again, we can simply model this by having an air light coefficient, called A, which you multiply by 1 minus the iteration function. It's pretty clear if we have a good estimate of F and A, that we could easily invert this equation here to kind of get the dehazed intensity of the photo. One common approach to do this is to assume that we live in a constant atmosphere. And this case we can analytically solve this equation, because the intonation function reduces to an exponential function with a single parameter here. But you see we have some problems here with this approach. First of all, the exponential functions go quickly down to zero. And that means that by inverting them, we get some numerical problems and some pixels that blow up in the background that you see right here. Also, another problem is the assumption of the constant function is not true because in the real world we have spatial and mixtures of aerosols and you see no matter how I choose this parameter here, there's no setting that works for the whole (inaudible) at once. Of course, you can do things to kind of try to avoid this problem. For example, you could use regulation terms to fight the (inaudible) simplification here or you can try to estimate especially varying parameters, for example, from multiple images. But, luckily, in our scenario there's a much simpler solution. So let's see how this works. So here's, again, our paste model. And now we can rewrite this to put F on the left-hand side. If we ignore A for now, actually the only unknown we have here is the dehazed intensity that we want to recover. We can't do much here. We don't know this. But now the trick in our system is that we can use instead of the dehazed intensity, we can use the model textures that are associated with Virtual Earth models to get a good estimate of F. So these textures here are based on satellite data, aerial images taken from planes. And there are all kinds of errors. There are color misalignments they have low solution, partially, and the artifact, the shadows in them. But still they're good enough to get a good estimate, estimate of F. And the trick here to avoid suffering from all these artifacts is by averaging the values of the hazy image and the model texture image over a large depth range. And by doing that, and if we simply set A to one for now we can recover these haze curves here for this image here. This is our intonation function. If we apply these curves to this image I get this result here. And you see this is almost a perfectly dehazed image. We have almost glowing colors here and there's no haze anymore in this image here. And it's done completely automatically. Here's again the original image. And the dehazed version. So here's another example of an image taken in Yosemite Valley of Half Dome, if I apply the automatic dehazing method I get this result here. And even the original image and dehazed image. So sometimes when you apply this automatic method, we experience a slight color shift in the background. And this is because we haven't set the correctional lied color. We simply set it to 1. And but it's not a big deal, actually, because in this case we simply offer the user standard color picking dialogue and it's only a matter of seconds to find the correct setting of the allied, because you get interactive feedback to that. And this is the final result for this image here. And again, this was the original image. Okay. So another cool thing that you can do with these haze curves is you can no longer use them to remove haze from your photos, you can also use them to add haze to new objects to be put into your photo. And so here's a simple interface that I prototyped where I can duplicate a building and move it around. You can see on the left-hand side I put haze on this object. And you see this (inaudible) perfectly emersed into the scene on the right-hand side where you don't change the color of the objects. You see it totally sticks out. All right. So another very important aspect of taking interesting picture is the lighting. And I have to admit when I go out and take pictures most of these look like these ones here. They were taken about noon so you have pretty dull lighting, no shadows. Just looks boring. In contrast, most professional photographers take the photos and the so-called golden hour, shortly before sunrise or after sunset, and they get these much more dramatic photos here. And so with our system we can kind of fake this effect or approximate what our photos would look like at a different time of day. For example, here I took this photo again of Yosemite and kind of approximately what it would look like at this romantic, well, sunset here. And you can even animate this to get an image and make this one here. So I won't go into all the details here. I will just show you a quick overview of the pipeline for this effect here. So we start with our input photo. The first step is to apply automatic dehazing method to move all the haze from the photo. Then we modulate the colors with light map that we have computed for the new sun position. We change the global color of the photo and change the sky to one that fits the time of day and finally we add haze back into the photo. And so this is the final result here. And this was the original photo. And I'm not necessarily saying that this is a better photo here. It's just a different one. But maybe this is the photo that I really wanted to capture here on this day. So with this system you can kind of go creative and create all kinds of effects. So here's another version of this photo here. Here's another example. This is a photo of lower Manhattan. And here's a lighted version that I created for this photo. And here's another one. So here I added (inaudible) to the scene, and another one. And, of course, you can also create animations where you move the center around and circle around and keep going. Here's another photo. I like this one, because this one is a perfect example for total boring and dull lighting scene. And with our system you can turn it into a sunny day or more dramatic day. Or here's another animation created like Miami style waterfront animation. Let me quickly browse through these ones here. One last thing I wanted to say about this, there's a big difference between relighting a photo and simply putting the Virtual Earth models into the illumination. Here you can see a comparison of the original photo and the relighted version and the Virtual Earth models with the Virtual Earth texture is put under the same illumination. Hopefully you agree with me that this one here looks actually much better here with these artifacts. Okay. So there was image enhancement application. Now I want to quickly talk about two other applications. The first one is novel use synthesis, and this is actually pretty much the standard application that people show you when they're talking about photos with depth. And now the cool thing is with our system, because we have such accurate depth maps, we get almost perfect results here. But with our system you can do even more interesting things. So one thing we can do is we can use our system to extend the frame of a photograph. So, for example, you might have this photo here of Yosemite and so with our system we can extend its frame and kind of synthesize or approximate what the photo would look like if we had like a bigger view. And, of course, the cool thing here is the stuff is not arbitrary, but it's based on a real world model that we get from Virtual Earth. So what we do here is we use the Virtual Earth model to divide the guided synthesis process, where we kind of use the colors from the Virtual Earth texture to steer the process. We do it on a cylindrical domain, which means we can turn our head in every direction. And we also synthesize it on depth measure. This is the data that allows us to represent both the visible and invisible things in our photograph. It means we also have texture for like say a mountain that is obscured by another mountain in the photo. And so here's video of the system left in with the system. So only the center part of this is the original photo. Everything else is synthetic. Now I can move around, look in all directions. And it can also change the viewpoint. And you see whenever the texture here on the ground would get truly started, I'm automatically fading over to the Virtual Earth textures. The same thing happens when I move the camera up. And so this kind of allows me to get a better feeling of the geometry I see in our photo. I can get in and look around and move back out. So let me use the remaining time to quickly talk about the last application. So this is actually a whole bunch of applications, information utilization. So because the geo registration of the photo gives us the exact perfect geo location of every pixel in our photo, we can actually fuse the photo with all kinds of GIF databases we get off the Internet. For example, there are huge databases that have the lat long coordinates of all kinds of things, famous buildings, mountains and so on. There are databases with street networks, and even some Wikipedia articles are tagged with the lat long coordinates. We can fuse all these information with our photograph. And here's a program interface that we built for this. So it shows next the photo map where we highlight the location of the camera, the location we clicked on, and the reinforcement of the photo. And in another view mode we show the depth profile for a horizontal line here. See this line I'm moving around with the mouse? Here's your depth profile. So I have to admit this is a bit more useful for landscape photography. It's a bit kind of hard to understand what's going on in this city scene here. So here's another interesting thing. So here we overlay street network over our photo and we both show it in the top-down view and we also overlay it with the actual photo. And so we see whenever a street is obscured by a building, we render it semi transparently, and we also highlight the names of the streets under the mouse as we move around. So this is another cool thing that my fellow worker Boris implemented. All these bars are Wikipedia articles that have associated lat long coordinates. So I can display them in the photo and I can actually browse these articles that are visible in this photo and I can see where, for example, this building is now a photo, it's here. So here's another one. So these are labels of building in our photograph. And as I move the mouse around, it will always show me the 10 or so closest labels here that are visible in the photograph. All right. So this is the last application. This is an object picking to us. So this is based on the object models that we have from Virtual Earth. It always highlights the full building that is under the mouse position. So you can use this, for example, to select a single building and apply further image processes to it. Okay. So let me sum up what I showed you today. I showed you a system for combining photographs with digital terrain models by using geo registration, and I show you that this enables a whole bunch of interesting applications. For example, I show you how to remove haze or add haze to new objects, how to relight the photos and how to expand the field or change the viewpoint or fuse it with GIF data. Actually, all the demos I showed you here are videos on my machine. And so tomorrow in the demo session I'd be happy to show you this to you on my computer here. So let me finish up by talking a bit about future work that we'll do in the future. So actually we believe that the applications that I showed to you here represent only kind of the tip of the iceberg. And there are many more interesting things that we could do. For example, you could use these 3-D models that we have of Virtual Earth for all kinds of computer version imaging tasks. Potentially use them to reduce noise in your photo or shop, refocus it after recapturing a photo or recovering under or overexposed areas of the photo. We also only use a small amount of the available GIF data off the Internet. For example, the database that contain different ground materials, so we could potentially label different things like water grass pavements in our photograph. Then in turn use this information to improve other applications. For example, tool manipulation. So another important thing to do is to think about what we can do with multiple images. All these applications I showed you here use a single image. But we believe that by using multiple images all these applications could actually gain a lot. For example, with image enhancement we could use multiple images to learn the illumination of a scene and then transfer this illumination to another photograph showing the same scene. Or, for example, in the novel view synthesis application, we can think of a photo tourism where we use the technique to provide for better transitions between the images. And, finally, with the information utilization application, we could use multiple photographs to transfer user provided labels between images. So one thing that I would like to do more in the immediate future is to prove the registrations. Right now it's done manually. But I think it will be impossible to do this automatically. You would start from GPS data and then use feature-based approach to automatically snap the location of the virtual camera to the right position. And also one thing is that right now we're using a rigid registration. That means we sometimes get slight misalignments with our photograph. And the models from Virtual Earth and we would like to actually try to snap address from our model always to high gradients in our photograph to provide better registration. With that I would like to thank you all for listening and if you have any questions. Thanks. (Applause) >> Question: How important is it that you don't have cast, strong cast shadows in the original image, can you simply remove them if you do, or are you mostly using kind of overcast skies that diminish that? >> Johannes Kopf: Right now we don't do anything to remove shadows from our photograph. You can use existing techniques to do that but we didn't do it. >> Question: Hi, I wonder if you thought about adding dynamic elements that you could get off databases like waves or wind or traffic? >> Johannes Kopf: Traffic is one I thought of. I think it's useful to show traffic patterns in photographs. Maybe waves also, yeah. Looks like that's it. Thank the speaker again. (Applause) >> The last talk in the session will be by Frank Dellaert by Georgia Tech, he's going to talk about ongoing work on their 4-D CD project. >> Frank Dellaert: All right. So I'll be talking about actually City Capture in general, which is capturing information about cities from large collections of images. This is work with primarily Grant Schindler is my student. He's from the east side, a consultant on the project from Georgia Tech. Let me see if this works. The community goals is really about capturing reality, capturing from images, aerial images, Flickr, the whole of Flickr, whatever the Microsoft equivalent is of Flickr, what is it? MSN Photo Database, I don't know. And integrating all this data in a single consistent geometric model and then interacting with that model and prime examples are Virtual Earth, of course, and Google Earth, which have really done, have put this community on the map. Now literally that is. I'll talk a little bit more about incorporating time in this whole endeavor. So I'm going to focus on urban capture, because urban capture is really the most, that speaks to the imagination of people. Many people live in urban environments. They are interested in the history of their cities and so there's lots of applications in urban environments. And that's also why a lot of the effort of Google and Microsoft is actually focused on these urban markets. So there's lots of applications. Radically new interface like photo sense, like on the previous slide. Urban planning, historic preservation and historic renovating and historically relevant way and virtual tours and lots of applications. Just to focus our attention, if you just ask for a planter in downtown tags on Flickr, you get these are just 1600 first images you get and then the question becomes how can we stitch this into a consistent model and things like photo tourism, try to do this automatically. And it's an extremely hard problem. And so we've tried a couple of things to try and at least localize these images in model. For example, if you have a model like Google earth or Virtual Earth and you have a picture which has some tags associated with it, for example, these are pictures that were taken straight from Flickr and they were tagged by users as, with the names of these buildings. So in this image there were three or four tags. And just using those tags like the G.E. building or the Cantor building, if you have a model of the city, you can actually figure out using visualization techniques or techniques on figuring out what possible viewpoints could you be when somebody tags that image. So there's only a couple of places where you can see, for example, the G.E. building and the Cantor building at the same time. So you could at least say, well, it must be taken from around this part of the city or it must be taken from around this part of the city. So it's very rough localization that can at least help you in the same way as Kristen, try and get the nearest neighbors as to where you actually could be. So it's a little bit like a hash function in the sense that you could use those tags to at least put you in the right location. So that's work which doesn't yield a lot of accuracy. So this picture had a number of tags and these are -we simply sampled from the possible locations in the city where this picture could have been taken and then resynthesized this image using a very rough model of the city. And if the system works well, then the synthesized images should look a little bit like the picture. And so you can see that in the top picture it roughly matches in the bottom picture it roughly matches as well. You can do things like saying well pictures are mostly taken from street level or taken from top of buildings. And very rarely are they taken from an airplane and very rarely are they taken somewhere halfway between a building level and street level, because physically it's very hard to take pictures there. So when we looked at urban environments, we noticed something else which is we wanted more accuracy than that. So the previous tag based localization only gives you a very rough estimate as to where you are. If you want to build 3-D models you want to get closer to where you actually are. So we tried throwing wide baseline matching methods to this problem. White baseline matching methods are you extract a lot of features from the image. You have a large database where you extracted those features from those images, and you try to index them to that database. And as Kristen talked how you could do this efficiently to get at least the recall where you can then do a more geometrically consistent match. Wide baseline matching methods do not like at all repeated features. It turns out that if you take a typical image from a typical urban environment, most of the structure here is repeating. So instead of having lots of unique features that you can use for indexing into the database, what you have is the same feature 100 times on this building, the same feature 200 times on this building and you have to try and match them up. And the typical way of doing that is in fact throwing it away. Saying oh this feature occurs a lot in this image, it must be repeating feature, hence I cannot use it so I throw it away. Which is, of course, exactly the wrong thing to do if you're looking at this image. Because most of it is repeating. So we talk about well could we use the repetition to match an index into that database. So we submitted something to CVPR, which used that ID, which is we simply extract the repeating patterns from the image, which you can do. There's several papers that tell you how to extract repeating lattices from the images. And then if you have a database just like what is now available in these interactive 3-D environments of textures, you can also do the same lattice extraction from the database of your 3-D database, your 3-D textures, and then try to match up the lattice in the query image with the lattices in your 3-D database. Now, if you do that and you get a match, but there's still a little bit of a problem. This match is between the lattice here and the lattice here, and you can match the lattice in many different ways. And each of these different matches yields a hypothesized camera location in space. Now, luckily if you have two matches, so you have one building that you match and you have a family of camera locations, and you match another building of a family of locations you have two grids of possible camera locations and you can intersect those to find camera locations that are good candidates for the actual location. And so that's exactly what Grant actually did. This is also work in collaboration with Penn State, in fact. Yang She Yu is an expert on extracting repeated patterns from images and she helps us a lot on this project. So if you do that, here is a database of images that we, buildings that we have. It's a fairly small database. You can see that the blue here is a Grant reposition and the red is our estimated position. And now instead of being simply in the right block, we're at least on the right pavement. Now you can get it more accurate. Just for illustration purposes, here's a query image which extracted lattices and here's a synthesized view from that location that we got from that using that technique. So this is all 3-D. And this is about trying to use all these images and extracting 3-D models from the data. And there's a lot of competition in that world. The whole Virtual Earth group of Microsoft is trying to compete with us, and everything's backwards, of course. We are in academia almost running behind the industry. It wants a lot of money funnels into making 3-D automatic. It's inevitable that in academia we're going to be struggling to keep on that effect. So also try to focus on an aspect that has not been traditionally looked at in the industry, although Blaze hinted at it this morning, which is time. So all of what I talked about you can do with 3-D. You can also do it with historical imagery. And in the 4-D CDs project at Georgia Tech we've been focusing on historical imagery for four to five years right now with help from National Science Foundation and also some help from Microsoft Research, where the input images are taken over time. So here's a bunch of images. My laser pointer doesn't work anymore. And some of them are 100 years old. So this image is 100 years old. And these images are just recent images. And some unique problems start popping up if you deal with historical images. Okay. So Grant at last year presented a paper where we took a bunch of images and assumed that we could do the 3-D reconstruction well, which is not actually a given. There's still a lot of work, and then there's Grant's flip images into their correct time order. So here are the ideas. You take a bunch of images, you have a 3-D reconstruction, try to rearrange the images in the order by which they're taken. And this is something that you cannot do exactly, right. So images that are taken 10 minutes apart like the images on the back here cannot be distinguished, right, because no building went up or was destroyed in the time that the images were taken. So if you used the 3-D structure of the city to order your images, you can only get so far. You can only get a partial order. But you can actually get quite far. Here's an example of this with 20 images. This was presented at CVPR. I'm not going to go into detail as to what all these blue and red dots are. In essence, what they are is the constraints about buildings, saying that a building goes up and exists for a while and then goes down. And you make use of those constraints of the urban fabric to order your images in time. And so that's work that's very exciting. By the way, there was, of course, if you use the 3-D structure for every ordering that we recover here, there's also a second ordering immediately which is by flipping the time Xs around, in fact in Grant's algorithm he nowhere uses time as a constraint. And so we always have two, at least two solutions to this problem. And here I'm supposed to give a live demo. In fact, something which Blaze hinted at and I wanted to call out, I'm going to show it this afternoon, I can show you as a demo here. And Grant tomorrow, during the lunch demo session, will be available to -- let me just restart it to show -- to let you interact with the applet. It's something that Grant created really in the last month and a half to show off the results of our research. Let me see here. So here's a little time animation. This is Atlanta over time. And Atlanta, somebody asked me, wouldn't it be more interesting to take it in New York or San Francisco as an example of a city instead of Atlanta, which most people I guess find not so interesting. But Atlanta is very cool in the sense that if you go back to 1864, there basically was no Atlanta. So if you know your history, Sherman was a Union general. Let me hide everything else. Came and conquered Atlanta from the south and proceeded to burn it down. Hence, Atlanta from a research point of view is interesting because we can start from scratch and see the whole city pop up from 1864. So here's the oldest picture we have, which is by a photographer that came in with Sherman. And now we can go in time. So, again, this whole model is built using structure motion techniques, (inaudible) photo sense. If we go from one image to the next we now actually see the historical photos in context. We can actually see where they were taken. So here's some other pictures in a little bit or here you see the refugees fleeing Atlanta. One of the next, this is the old station. More refugees. All right. Take a look at this building here which is the station basically where they kept all the trains. Here's Sherman troops back in the middle of the city. And then pretty soon now they will start destroying things. So this is the same station building but now destroyed. And you can get a vivid sense of where all this is taken. If you see these images out of context you actually don't know where they were taken, right? And I never had a visceral connection with these images because I never knew where they were taken but you can do orbits around and now know that image was taken right here. That still doesn't tell you the whole story, but if I turn on the city as it is now, you can see exactly where that image is in the current modern fabric of time. And so with the slider here, I can move time and go and add more and more with pictures that are taken over time. And then I can, if I stop this, I can go to any image here and zoom into it, right? >> Question: Hit A. >> Frank Dellaert: And just like Johanness, in fact, we can click on a building, and what happens is it shows all the images in which this building is visible. So, for example, a building that is no longer with us, I have to kind of find -- so Grant is the demo wizard here. So I have to try and find a building that was destroyed. So this one is still there. Let me see, go to an older picture here. And what about this one? Now you can see the range of images is much less than the 200 which is the full database, and this building existed from here to here where it was demolished. So this is 1978 and it was built around 1921. So we can go to 1921 image and this is -- the red focuses your attention as we're moving between images. And it shows you where the image is. And again, also because we have the 3-D model, the red in this image is a nice -- it's not a perfect imitation, because everything is not in the model but it's almost like a perfect imitation of the pixels in this image. In a historical context, we know when each building was demolished and when it was not. Although, it turns out that lots of the data on these buildings, on these images which we get from the Atlanta History Center, are actually wrong, and our model cannot actually tell the data on this picture must be wrong because this building wasn't there, et cetera. So it's interesting the Atlanta History Center People we had a meeting with them and we told them okay this building and this building or this image and this image is wrongly labeled. So I'm not doing this demo justice. Tomorrow Grant demo wizard will show this and allow you to interact with it. Let me go back here. So now the part that we've actually been funded for under the RFP, which is another cool idea that came out of Microsoft Research labs, these gigs of pan know ram to stitch hundreds of thousands of images in a consistent large panorama which enables a totally different way of interacting with a panorama, instead of simply going pan and tilt we can now go deep and it's almost adding a different dimension to the images. And unfortunately I am running Mac OS on this laptop at the moment so I can't show you their HD view. But I can show you that the whole community has actually sprung up around the idea of giga pixel panoramas. And people are motivated to go out and share panoramas share them with the world. This is a Carnegie Mellon project called Giga Pan, where you can actually interact with these -- I should be online. This is not a particularly good panorama, because there's a lot of dynamic structure in this scene. So the mosaicing was really crazy. You can see there's some privacy issues to these things as well. And in fact we've put one of these, the idea that we got funded for under the RFP was to put one of these giga pixel sensors in the city with the idea of trying to capture the evolving city as it is changing. And the genesis of this idea is because I was driving around just like Grant I always have a camera close to me when I'm driving around in the city. And every time I see something interesting, I try to take the pictures as to not let this moment in time, this sample in space go away. And I realize it's completely impossible to keep up with the growth of a city like mid town Atlanta where skyscrapers are constantly going up and down, as a matter of fact. Every time I see something new being built in Atlanta I'm not here with my camera I can't capture it. So the idea was let's take 100 of these giga pixel sensors and put them all over the city, have them consistently capture the city as it is changing. And so we've made small inroads in this, and this is very much a work in progress. We basically only have a solid center of work in it. So this is going to be more what could be rather than what is at the moment. But in motivating this project towards my students, I made an analogy between the giga pixel sensor and an eye, right? This is the sensor that we bought is just a commercial off-the-shelf pan tilt camera, which acts as a server. You can put it on the net and we can grab images from it. And the total field of view is actually unlimited, but here's a 56-degree view on a scene where we stuck the camera. And at the highest zoom level, the field of view is one to two degrees, which is exactly in fact what your eye does. You have a very large field of view. Almost 180 degrees. In fact, a little more than 180 degrees if you take both eyes, if you can believe that. But the fovea, the high resolution part of your eye, only has a field of view of two degrees. So you're constantly putting your fovea on different parts of the scene and updating your idea of what the scene is based on what moves, what changes, et cetera. And we thought that we needed to do this as well because if you put a giga pixel center somewhere in a urban fabric and it's almost 100 -- almost 98 percent of the scene is going to be static. What's going to change is the weather, it's a 24 mode load cycle there. Shadows is a big problem there, in fact, when you think about it. What we want to do is focus our attention on the things that change. And simply record those. There was also an interesting problem that we have not tackled yet at all which is if you capture giga pixel images of the same scene over and over and you do this over 10 years you probably won't store them as a sequence of simple images, in the compression of storing of these things, you can probably make use of the fact that you're imaging a single scene. Right? So eye movement makes perception comprehensive over the scene and intentionally focuses on changing parts of the scene. The hardware is a Sony pan tilt camera. I'm not going to go into this slide. And what we did to try this idea out is we put this camera there. I think we've been continuously recording two giga pixels a day for about a month. And we take a hierarchical swath through the scene. So what we do is so we do this at multi-resolution. So we take two wide angle shots and eight intermediate angle shots and say I think it's a thousand shots at highest resolution. And here is a little movie of what one of them looks like. It's just going all the way across. This is probably level two here, or maybe it's already level three. So probably this is level three where you are very highly focused into the scene, highly zoomed in. Here's a time lapse of the raw imagery, what you see is the images are constantly shifting around. The pan tilt zoom cameras off the shelf, if you tell them to go to this pan tilt zoom, they don't actually do it, they add some noise to the process. So instead of doing the regular giga pixel where you take all the images and stitch them together, what we want is a stable base in which to register all the images. So here's a little time lapse of just taking of the first image as a base and registering the other images in there. Let me show you an animation of what is happening. So the first image becomes the base in which the subsequent images are being registered. And this is simply using homography matching. And this is what we continuously do. And updating this mosaic. What we have to think about is recursive giga pixel center. In a bayesian framework what you want is in fact you constantly want to update your model. Your model is a giga pixel representation of the scene given the images at all times, which is all your pan tilt zoom images at every resolution or across the hall of time and you can simply do it by Bay's law. Here's a (inaudible) on the giga pixel times the likelihood of the current measurement times your predictive density on your model. And you realize that there's something bad about this and the bad thing is N is, because it's a giga pixel image has a million dimensions times 3-D. So this small equation here of Bay's law turns into something horribly intractable, if you try to do anything but the simplest probability model. So that's what we did. In fact, we used a model for now which is simply modeling each pixel is independent. So we have a mean and we have a variant for each image, for each pixel. And as we fit in more and more images into the scene, you will see that we are getting more and more confident on some pixels in the scene, because we've seen it many, many times. But there's also areas where there is a flag in the image that is constantly changing. And so given our current model of the scene, we see that the flag is not evading the image model and we reset all the variances to zero meaning we have zero confidence in the pixel at that time. So where things are dark, here is a building going up, in fact, is what we -- we see something that is unexpected. And one of the biggest problems, and I saw some people nodding about the question about shadows, shadows, of course, always change in the scene, and so we get away by actually taking the panorama always at the same time, roughly at the same time during the day. And then this is my last two slides because I think I'm going over time. The idea was, then, if you have an evolving representation and the density of the giga pixel models, then your next measurement should be the measurement that reduces your uncertainty the most. And just that question is very easily, the arc max, is easily answered in theory. We simply want the measurement that is going to maximize the information in our (inaudible) at the next time step. And this is simply for the next time step. You can imagine larger things that do sensor planning over the long-term. But if this density is anything but a very simple density, this becomes a very complicated question. So the student did something -- this is Dan Woo, which is simply grab the image where there is the most black variance, meaning this flag is constantly being changed. So we constantly take images there and once in a while we take a large wide angle shot because that has the highest information value. And here's actually a building going up. Here we're constantly resetting those variances. So the camera spends a lot of time sampling this part of the image and trying to follow the building as it goes up. Right? And that's the animation of the large giga pixel fovea eye in the sky which is -- it's actually a very, very simple thing we did. But it shows the promise of this. In future plans, we want to make the densities actually more informative than simply independent pixels. And there's some very interesting work in that area that we're trying to implement. But there's also some interesting research about interacting with these giga pixel images which have independently changing pieces, right? So this is giga pixel panorama here, has a little piece here that is constantly changing. So the system could figure that out because it has put all these images there and give you a little player interface in this part of the scene. And because this building is going up in a different part, the player would be only limited to that piece of the building here, and you can independently play these little embedded movies in the giga pixel panorama, whereas the rest of the scene is static and is not considered interesting to put interaction. And then, as I said, another interesting thing is we only put in one sensor right now but we'd like to put in at least three to see what we can do with multi-view capture over time. Okay. I'm going to stop and acknowledge especially Kevin, who did the tag localization, Grant, who did the cool for the demo and the two CVPR papers and Dan who implemented all the giga pixel stuff. So I'm going to leave it there. Thank you. (Applause) >> Question: I have one question. In your big panoramas, you want to factor out things that are periodic with the time of day, I would guess, like shadows and maybe even traffic and what's happening in the sky. So I wonder if you've thought at all about how you can correlate, say, pixel brightness with just time of day, might find some periodic correlations. >> Frank Dellaert: Right. Absolutely. I mean so that seems to be the way in which to attack shadows, right? Because shadows are not perfectly periodic. Right? They change by time of year. So you're not going to totally get away from it. But you could add a time of -- model it at 24 and model at sun time, especially if you capture these over many, many years. So I don't know how to do it is the answer. But definitely I think that's the right direction to think in in trying to factor that out. >> Great. Thank you very much. So we have a 30-minute break. We'll start the next session at exactly 4:00 p.m. sharp. There should be snacks outside. [End of segment]