15803 >>: Hello. Good afternoon. Welcome to the...

advertisement
15803
>>: Hello. Good afternoon. Welcome to the first afternoon session of the first day. This is going to be the
first session where we have our RFP winners present some of their work.
And this session is focusing on photographs, a lot of photos, especially laid out in various contexts for
geography. The first two talks will talk about matching photos with either existing databases of photos or
with a 3-D environment.
And the first talk will be by Kristen, who is currently at University of Texas at Austin. Even though MIT
doesn't appear on their intro slide, a lot of the work was initially conceived while Trevor was still at MIT.
I'll let Kristen go ahead.
>> Kristen Grauman: Can you hear me through here? Okay. Good afternoon. I'm real excited to be here
to give this talk and also to meet a lot of you and talk about any intersecting areas we share. Seems to be
there's a lot of them.
So this work is done by myself and two students at UT Austin. Patrick Janbakrus (phonetic) and Brian
Coolis (phonetic) and (inaudible).
And I want you to consider this general problem of wanting to be able to do location recognition with an
image-based search. This can be kind of the scenario to imagine why the techniques we're developing will
be useful.
So you take a picture with your mobile device. You send it to a system that's going to tell you where you
are based on what you're looking at.
So what are the technical challenges here? Well, we're going to have to do this indexing on a very large
scale database. We'll come in with this image, look at some large repository like the Virtual Earth collection
and try to find out what's relevant so we can return it.
There's a complexity issue and robustness issue, because the person taking this photo, presumably, could
have a completely variable viewpoint relative to the scene they're looking at and this gets down to the real
core challenges in the recognition task itself, trying to deal with this real variability to provide the most
robustness.
So what I wanted to talk about today are some of our work on building sub linear time indexing techniques
for what I'm calling good image metrics.
So not necessarily the ones that are easy to define, but the ones that we know are going to be very
valuable for comparing images.
So these sub linear time search methods, the basic idea is we want to come right in and immediately
isolate without doing any computation besides an initial hash those examples that are likely to be relevant
for the query.
And I'm going to talk today about two different directions in this project. One is how to do a fast search
within such a large set of images. When you care about the correspondences between a bunch of local
features in the image.
The second, I'll talk about how then, if you actually learn a metric that you want to apply to do your search,
how you can still guarantee a fast index on top of that.
So in terms of robustness, we know of some good and reliable features that are local but will give us
robustness even in the face of a lot of variations you see here. Extreme examples were object pose or
occlusions are really changing the images we capture.
Certainly in the global sense we know these images are so different we won't see a strong similarity. But if
we look at local portions within them, there's going to be some repeated parts, and that's the basic idea of a
lot of techniques today that are using local feature representations for recognition purposes.
I'm just showing a collection of options you have for these local features. Some are specifically designed to
be invariant to things like scale, rotation, translation. Some will capture shape, appearance, so on. But in
general they give you ways to decompose the entire image into parts that can be useful and that you can
describe in invariant ways.
So what would be a useful thing to do when you have an image that you're describing in all these local
pieces? One good thing to do is to try to see how well each one of those pieces matches up with some
piece in another image. Because if I have a good one-to-one correspondence between all these local
parts, then that suggests a good agreement overall in the objects within images.
So that's a point set matching problem. We want to know the least cost matching between two sets of
possibly high dimensional points. And that, to compute optimally, you really need to minimize the cost
between the things that you pair up, cost cubic time in the number of points within one of the images.
So in this work we showed a linear time approximation for that matching which means you can have very
large sets of these features and still efficiently compute such a correspondence distance measure.
So I'm going to give an intuition about the general approach between that approximation, because then I'm
going to tell you how to go a step further and not just compare two images under the match, but actually
take a large database and search it when what you care about is that correspondence matching.
So here's the core of this idea, which is to quantize your feature space and I'll show you it's going to be at
different resolutions but to quantize it so you can easily read off the number of possible matches at that
given resolution.
So to do this we're going to look at the intersection between two histograms, and the intersection is the sum
of the minimum counts within every bin of two histograms.
So what we have are these point sets where every point comes from some patch feature. So this could be
a siftoscripter (phonetic) or some other kind of local point in the image, and we want to get that matching
between the two.
If I look at some partitioning of the feature space and then I intersect the histograms, so I'm looking at the
minimum counts in each, that count actually reveals implicitly how many matches are possible at that
resolution.
So here we have an overlap of three or an intersection value of three, and implicitly that means there's
possibly three matches that could be formed.
So the pyramid match which is our approximate algorithm then does this at multiple scales. And this is
scale within whatever feature space you're using. So starting from the fine resolution we'll have some
number of intersections. There will be few because this is a very -- these are very small bins with which we
can match. So we'll record how many points are matching based on the intersection. And then start
making these bins larger and larger. So more and more points are now going to be intersecting.
And we'll keep counting them. And in fact we'll subtract off counts that we've seen before. So the basic
image is that by using this multi-resolution partition of the feature space you can efficiently, in linear time
the number of points, read off how many matches are possible as you increase distance.
So it gives us this approximate measure. The optimum matching entirely could be slightly different in terms
of the one-to-one correspondence, but the cost is what we're getting that's approximately, and we have
bounds for this. This is going to give you what you would have gotten with the optimal measure.
So now the new problem is is if I have this image that I've described with local patches of some kind and I
have a huge database of images described the same way, then I don't want to search through every single
one, even with my efficient matching process.
I want to find in some linear time those examples that have the highest matching similarity. So in other
words it's the indexing task but now the metric that we care about is this one-to-one partial correspondence
between points.
So how can we do this in sub linear time so I only address a few of these examples. So the basic idea we
have is to try and take these points sets and map a single point set to some vector embedding where that
embedded space is going to guarantee that an easier to compute distance. It's going to be that product will
give me an approximation of the original matching score.
So every input that we have is a list of vectors. Those point descriptors, and indexing in with them all
together to find the match. So we're going to develop an embedding function that will take such a list of
vectors, map it to a single high dimensional point, where distance against that single high dimensional point
and other such embedded points is the correspondence match that we care about.
Once we are able to do this we may be able to leverage techniques such as locality sensitive hashing.
This is the approximate nearest neighbor technique, based on the idea if I have a hash function H that
guarantees that the probability that two examples X and Y fall into the same hash bin is equal to the
similarity measure that I care about, then I can do a guaranteed sub linear time search with the trade off
between accuracy and the time that I'm willing to spend.
So if I have a hash function like this and apply it to all these examples, put them into this hash table, there's
a good chance things we can guarantee that things are similar or likely to fall together in the table and then
when we come in with a new query, we're going directly to those examples that are similar to it.
Doing that, then, I only have to search very small number of points and compare them to the query and
sort them to find the nearest, the approximate nearest neighbors.
So this is a very general technique and very appealing because we'd like to be able to use this general
format to guarantee a good search from any interesting measure.
However, it's so far defined for metrics like L2, L1, that product. And right now what we're trying to do is set
our similarity measure to be that correspondence matching.
So the important thing to know is we need to be able to design a hash function that maintains this kind of
guarantee. And that's what we're providing for, the correspondence measure.
So in order to do this, we look at a property of random hyperplanes and the idea is if you have two vectors
and a small angle between them, if you take some random hyperplane you select uniformly at random in
terms of its orientation here, it's likely not to fall between them.
If you do the same with vectors that are very distant or that have a low dot product, low inner product, then
it's likely that they'll be split. So specifically if you have these two vectors. VI, VJ, the red and blue vectors,
then the probability that the sign of their dot product with this random vector, which is the dotted line here is
different, is equal to this expression.
So the probability of having the same sign on that dot product depends on how distant these initial red and
blue vectors are.
So why is this going to be useful for doing this sub linear time search? This pro property is going to let you
design a hash function where you can independently come in with one of your vectors and compute a dot
product with some random vector and store it according to the sign. So a zero one bit.
And next time you come in with a new vector, you do the same thing and the probability that they fall
together or have the same hash bit depends on what their dot product with one another was going to be.
And that's this VI.VJ.
So that's defining a hash function that will work for a dot product. But we're not at a dot product yet. We
still have that correspondence match that we're trying to handle. So now I'll show you how we can map the
correspondence match problem into an inner product. And that's going to let me do that embedding so
then I can hash under the matching.
So to do this we need this property of intersection. So remember that kernel I briefly defined is based upon
the idea of intersecting histograms. If I intersect the histograms, the mean value I've got one, zero, three,
and we can expand that out in a unary kind of coding. If I say I've got a one here and pad it with some
zeros, this one has three. I've got three ones, pad that. Do that all the way out. That's a vector for
representing this entire histogram.
And the same thing over here, two, three. And zero in the middle. Now, notice that if I then take the inner
product of these two unary encodings, I actually get the intersection value itself, which is the sum of those
minimums, which is four in this case.
So this is then what's going to allow us to make an implicit mapping from our point sets into these very
large vectors, which we could then use this dot product hash function to match. This is the actual definition
for the pyramid matched kernel, which is that approximate matching function.
It goes over all levels of a multi-level histogram and counts the number of matches based on difference and
intersection values.
Then I can rewrite this, so it's simply just a weighted sum of intersection values.
So that means then I can take a point set, map it to that multi-res histogram in the feature space, but then
stack up the items that were in each level of that histogram into a single vector and apply the right weights.
So on the slide before I had these subtracted weight terms multiplied by a weight section.
Now if I scale these histograms concatenated by the right weights in each place, I'm now at the point where
a dot product between these implicit unary encodings is going to give me exactly that original pyramid
match value kernel.
So we started with point set X. We got histogram. We stacked it up. We weighted it in the right way so
that now when I take the inner product between two such encodings, I'm getting the original pyramid match
kernel.
Now I'm doing all this so I can go use that hash function that's defined for the inner product to do a
guaranteed sub linear time search. So the embedding of those two point sets, X and Y, the inner product is
the pyramid match kernel.
So here's our hash function, then. It's parameterized by some random vector R. And given one of our
embeddings for a point set, F of X, it will return zero one depending on the sign of the inner product of our
bedding with that random vector.
This function is independent of any other input. You just take a single input and map it to a hash bit. And
now we're guaranteed that the probability that the hash bits are going to be equal to any two sets of points
is equal to this expression, which basically is proportional to the pyramid match kernel between the original
inputs.
So in completing that picture, now we're able to go and do sub linear time search with a guarantee because
we've been able to come up with a function that says the probability of collision and hash table is equal to
the similarity measure that we wanted.
And so we'll hash all these examples into this table, come in with a new query, apply our same hash
function. Come up with a small set and get our approximate nearest neighbors.
So I'll show you one result with this part of the method. And it's on an object retrieval problem. So the goal
is to be able to take an image of an object, find out what it is based on what it's matching to some collection
of images.
This result is using a database of 101 different categories that was developed at Cal Tech. It's like the
images you see here, I'm showing examples from three different kinds of categories.
What we do is extract local features from all the images. We use a sifter scripter and densely sample the
entire image. What we're doing is searching under the matching according to this hash, pyramid hashing
that I just described.
So here's a result. We are looking at the error in the classification according to those nearest neighbors
that we indexed to relative to this parameter that we're allowed to tune, which says how much accuracy
we're willing to sacrifice for speed. And this is the epsilon parameter related to locality sensitive hashing,
basically the more we go to the right we'll search things faster but we'll have more error. The more we go
to the left we'll search more things, it will be slower, but we're guaranteed to have lower error.
The error in the straight line is the best one possibly could do. That's with a linear scan. Actually took
every input, looked at every other input in database and ranked them and the classification error would sit
here.
Whereas, when we do the hashing, we will search much smaller fraction of the database and our error will
increase with that parameter I just described, but essentially we'll be searching two percent maybe or less
of the entire database and still getting accuracy closer to that linear scan.
>> Question: What's the scan on the axis, the error rate what is that?
>> Kristen Grauman: This is percent error in the nearest neighbor classification. This is 101 way
classification. So one percent chance error. So right now here it's at 70 percent error, which is 30 percent
accuracy on the Cal Tech 101.
And this is on nearest neighbor search using that correspondence. If we were to use, if we're actually to
learn from this kernel, this would be more like 50, 55 percent accuracy.
So I just showed you how to do this. We can do this indexing when we care about a set-to-set
correspondence. And now I'll talk about some more recent work where we actually want to learn a metric
based on some constraints someone's provided and use that learned metric to still guarantee fast search.
So, for example, we know that not all the time will the appearance in an image reveal what's truly relevant
or what's the semantic relationship across examples. I could have a very good correspondence here
between these points and some here, but actually have a really good correspondence here between these,
let's say.
So which of these similarities are actually relevant? That's something that we might get constraints
externally about. So what if someone says, well, all these in your training examples, these are all relevant
to one another. But this one's not.
So we want to be able to exploit that information and use it to affect the distance function we use to do the
indexing.
So the idea is we can turn to metric learning techniques that will take those constraints that we're given and
remap the feature space, essentially, so that similar examples fall close together under your distance
function and the ones that are like those that are dissimilar are going to be more distant, even, despite the
fact that of what you actually measured could have made it as similar to these others.
There are a number of techniques to do metric learning and kernel learning. And many of them allow, want
to provide these pair of constraints where you say these two are similar, these two are not. Or A is closer
to B or C.
I've listed a number of techniques that can be deployed to actually do the learning of the metric.
But what we care about now is to take that metric you've learned and do a fast search on top of it. Well,
when you have a specialized distance function, you can't immediately plug into these locality sensitive
hashing methods.
And we know that in high dimensional spaces exact nearest neighborhood search techniques will break
down and not be able to provide better than linear scan time in the worst case. And the functions that have
already been defined like the inner product we went through or others do not work for these specialized
functions.
So our goal, though, is to be able to guarantee you the fast search; but when you're providing some
parameterization of a learned metric, specifically we'll focus on a [high noblis] parameterization. The way
we accomplish this is to let the parameters of the distance you learned affect how you make that selection
of the randomized hash function.
So we're going to bias our hash functions according to whatever biases are in the constraints about
similarity to similarity that someone gave you ahead of time.
So let's look at this in a visual way. Let's say we have these images I think of hedgehogs, I don't know,
beaver and hedgehog, yeah. So I'm a little unsure because these animals and these images look pretty
similar. So it would be quite likely that an image-based metric is going to map them close together.
It would be useful if someone came and told us these two are similar, these two are different classes, so
these are dissimilar. Now, the generic hash functions that I mentioned before are going to be selecting
these hyperplanes uniformly at random. So this red circle means that we're likely to compute or select any
rotation of that hyperplane, or equally likely. Now, instead, if I have this information, I want to bias that
selection.
So what we'll do is make a distribution that's going to guide us to be more likely to choose a hyperplane but
does not separate those examples that are similar. And other examples that are like them.
So we're looking at the [INAUDIBLE] distances passive metrics where basically you want to provide some
parameters A. This is a dividing matrix for D dimensional points, to give a proper scaling on the points so
that you -- essentially we can learn this metric. Generally you might put in inverse covariant metrics here,
but there's a number of techniques you can learn the matrix parameters so you can better map similar
points closer to similar points.
This is defined in terms of distance between of vectors X-I and X-J. We can map that equivalently to the
dissimilarity, if we look at this product. So suppose I transpose that metric from A of X sub J.
So I just want to give a taste of the main idea how we'll accomplish this biasing of the hash functions. We
have this learned metric A, and it's a positive definite matrix. Vectorize it like this, essentially look at the
square root of this matrix G. And now we'll parameterize this hash function not just by some random vector
I but also according to the matrix A. And we're going to put that square root matrix which we're calling G
into the hash function.
So essentially we have that random vector, also those parameters. And X. And you can think of this as
the hash function now carrying information about those constraints that you learned. Because if I embed or
hash my database points using half or square root of this metrics, then when I come in with another input
that carries the square root of the matrix, we take the dot product, we get that full expression we want,
which was the X transpose A X sub J.
So now that I fit it into a dot product, I get to go back to this relationship between inner product and the
angle of the vectors and we'll have the guarantee that the collision is proportional to the similarity under the
learned metric.
This is the main idea. The problem is this A matrix could be extremely high dimensional. If it's
manageable, this is exactly what we'll do. And we'll compute it explicitly. But if A is 100,000, millions of
dimensions we can't manage it explicitly.
What we've been able to do is develop an implicit update for the hash functions but allow you to never
explicitly represent this A matrix, which likewise means that we never explicitly represent the G matrix.
By not having this matrix we'll still be able to accomplish the function of what that G matrix is doing in the
explicit case but we'll do it by looking only at comparisons between our input and some existing samples
that were constrained during the metric learning.
So this equation is basically showing the end result of what we can do implicitly. Over here we have that R
transpose G that we had before in the explicit case. Now this is representing our possibly high, very high
dimensional input in the feature space.
Over here we're computing this without ever touching G. We do it in terms of this matrix of coefficients
that's indexed over all our C constraint points and also comparisons between our new input X and existing
inputs XI. These are among the constraint points. So the main message here is that we can do this
hashing when you provide a parameterization of a learned metric. We can do it whether or not that
parameterization is explicitly representable.
So this is the polish before. Now I'll expand it. This was our accuracy using that matching kernel. Whether
we did a linear scan or we hashed. And now we see this motion in these numbers. So now I'm showing
you linear scan -- actually error rate. If we are to learn from that kernel, so if someone gives us examples
that we match but also constraints, so we improve our kernel to include those constraints. Now our error
rate is going from 70 to under 50 percent on the 101 class problem. Similarly, we can provide the hashing
which we have control over errors versus speed. And we're actually coming quite close to linear scan.
In fact, doing this kind of learning on top of kernels, it's not limited to the matching kernel I described. You
can apply this metric learning on top of any existing kernel. So we've done it with the pyramid match
kernel, but another correspondence kernel defined by a group at Berkeley.
What we're able to show is that we're improving accuracy on the Cal Tech 101 relative to all state-of-the-art
single kernel metrics. This shows you the number of examples you use per category versus accuracy.
And accuracy goes up as you train with more data. These are lots and lots of previous techniques. And
here, as we learn on top of this kernel, we improve the accuracy and similarly for the paring match. Right
now this is the best result for a single kernel method.
Finally, I wanted to show you some results using our fast indexing data for some photo tourism data. Got
nice introduction and exposure to this product. And certainly there's a very large scale search problem
involved where you have these images with some interesting patches and you want to go into all your
previously seen patches and try to make some kind of match. So you can do that 3-D reconstruction.
With our technique what we'll allow you to do is to take descriptors you have for the patches and
immediately isolate a small point in the database which you need to search to find those that are relevant.
Here's a result using our technique to learn a function of comparison of patches and also to do the fast
search on top of that learned metric.
What these curves are showing you is recall rate. I want a high recall rate meaning I get the head of the
guy at the Chubby Fountain, I get all other patches ahead of the guy at the Chubby Fountain. Our learned
metric does improve the recall rate which you can see between this initial curve which is a non-learned ler
metric and a metric that's learned on top of it.
In addition, what I'm showing in these very nearby curves is accuracy with, again, linear scan, which is
pink. And if you do a hashing, which is very near which is what we want. We don't want to sacrifice too
much accuracy, but we want to guarantee that you're going to search a lot less of the data.
What I've overviewed today is some of our techniques for fast indexing. This is based on some initial work
based on continuing matches between local features. But then I showed you how we can can do large
scale search in sub linear time on top of matching and also how you can learn a metric and do the search
on top of that.
And I have some pointers to the relevant papers. But I hope I'll get a chance to talk to many of you if you
have questions about this work and I could hear about your interesting problems where it may be
applicable.
Thank you for your attention.
[APPLAUSE]
>>: Great. We have time for one question. Anybody? Frank.
>> Question: So the LSH work, locality sensitive hashing, normally uses key hashes and L hash tables. Is
that also what you do? Or are you using single hash?
>> Kristen Grauman: We're using single hash table but we're using K bits. So we'll draw K random vectors
to establish K random functions and then concatenate them to a single hash key.
So far in our experience we used a single table. But you could certainly try to improve that by duplicating
and just by having more tables.
>> Question: The paper seemed to indicate at least if you use many more tables performance goes up in
terms of [INAUDIBLE].
>> Kristen Grauman: Definitely right. Just like with the epsilon parameter, which is giving you speed
versus accuracy, the more searching you're willing to do the more guarantee you have of hitting what you
need to hit when you do the hash.
>> Kristen Grauman: Thank you, Kristen.
If anybody has any other questions please try to catch her during the break. We'll go on to the next talk.
This is given by Johannes Kopf. It's work with Olivier Dosin (phonetic) at the University of Constance.
They're going to talk about what you can do when you match an image with a terrain model.
>> Johannes Kopf: All right. Thank you. Yes, so this is about our cooperation with Virtual Earth, called
Deep Photo, combining photographs with digital terrain and building models. And actually worked on this
together with a whole bunch of people, that is, Boris and Oliver from my university. And Billy Chien from
Virtual Earth. And Madam Michaels from Microsoft Research and Danny Denny from the university team.
So when you're taking a photograph of a scene, you're essentially capturing all those rays from the outside
world that run in the optical center of the camera and project it on to a flat piece of form. You end up with a
bunch of pixels where you starve color for each of these rays but fortunately you lose the depth. This is
kind of a sad thing, because if you would have the depth along with your photo, you could do really nice
things to your photo.
For example, you could remove haze from the photo, since haze is essentially a function of the depth?
Since having depth means you know about the geometry in your photo, you can reline the photo.
Approximate word would be when you have different elimination from a different position (phonetic).
So applications such as these ones here is what this work is about.
Now, unfortunately, inferring depth from a single photo is a challenging and most usually computer version
problem. That's why most computer researchers worked on manual painting interfaces to augment your
photo with depth.
The classical (inaudible) picture paper, for what it means to give the user a method to put a simple spidery
mesh on top of your photo to give a hint of the depth. And, interestingly, even with these simple depth
maps, you can sometimes already create a very compiling prenavigation experience. And later on, there
have been a couple of other papers that provided more sophisticated means to kind of enhance your photo
with depth.
There's been even some automatic methods like the ones by Hoy, et al which are able to, in some scenes,
completely automatically infer depth maps for single photos. But in this work we choose to do a completely
different approach. Rather than using computer vision methods or many other editing faces we turned this
into a registration problem. By doing so we made use of two big trends that exist nowadays.
The first trend is geo tagging of photos. So there are already a couple of cameras out there like this one
here that have frozen GPS devices. When you take a photograph with one of these cameras, the photo
will be augmented with a tag that starts the location of the orientation of the camera at the point of in time
you shot the photo.
Second trend is availability of very high resolution and very accurate terrain building models such as Virtual
Earth that we used here. Now, the idea is to location of the photo to precisely register this photo to the
models.
When we do this, we get a very accurate depth map here. But it's even more of a depth map, because we
have a geometric model. So we can look behind the things that are visible in the web. We've got these
things like textures from Virtual Earth which are based satellite images.
And so this work's about a whole bunch of interesting exciting applications that get enabled by the geo
registration from the photographs. I'm going to show a couple of limitations of applications ranging from
image enhancement to (inaudible) synthesis and information facilitation in this talk here.
Now, since the emphasis is clearly on the applications here, I will only talk briefly about the registration
methods, which is right now manually. So we use the tool for this developed by channel virtual earth. And
so we assume we know the rough location of our photo. And then we ask the user to specify four more
point-to-point correspondences, as you can see here, for example.
So this is enough to solve the system of equations which finally gives us the position, the post and focal
length of our photograph. So this is, for example, the result in the case. Now the photo is G registered.
This automatically allows us to stop up the applications.
I'll talk first about image enhancement. So I guess you know the situation when you take like auto
photographs of well known landscape or city scape, for example, but unfortunately on that particular day
you had like bad weather conditions. You see a lot of haze here and the lighting is boring. There are no
shadows in this photographs here. With our system, as it is going to show you in a couple of seconds, you
can take these photographs and enhance them, for example, by removing the haze or changing the lighting
in the photographs.
It's a bit dramatized here. Let's start with haze, which is one of the most common defect in auto focus
photography. Has is due to two effects, which are called outscattering and inscattering. You can nicely
capture them them with the analytic model that I'm going to explain.
The first effect, outscattering, means that part of the light that gets emitted or reflected from our object gets
lost on its way to the camera. This is because some of the photons hit air molecules, aerosols, and gets
scattered out of the line of sight. This reduces the contrast of our scene.
We can model this by simply multiplying our original intensity with this unknown [INAUDIBLE] which
depends on the distance of the object.
The second effect is called inscattering. And this is caused by ambient skylight which gets scattered into
the line of sight. So this causes distant objects to get brighter. And, again, we can simply model this by
having an air light coefficient, called A, which you multiply by 1 minus the iteration function. It's pretty clear
if we have a good estimate of F and A, that we could easily invert this equation here to kind of get the
dehazed intensity of the photo.
One common approach to do this is to assume that we live in a constant atmosphere. And this case we
can analytically solve this equation, because the intonation function reduces to an exponential function with
a single parameter here.
But you see we have some problems here with this approach. First of all, the exponential functions go
quickly down to zero. And that means that by inverting them, we get some numerical problems and some
pixels that blow up in the background that you see right here.
Also, another problem is the assumption of the constant function is not true because in the real world we
have spatial and mixtures of aerosols and you see no matter how I choose this parameter here, there's no
setting that works for the whole (inaudible) at once.
Of course, you can do things to kind of try to avoid this problem. For example, you could use regulation
terms to fight the (inaudible) simplification here or you can try to estimate especially varying parameters, for
example, from multiple images.
But, luckily, in our scenario there's a much simpler solution. So let's see how this works.
So here's, again, our paste model. And now we can rewrite this to put F on the left-hand side. If we ignore
A for now, actually the only unknown we have here is the dehazed intensity that we want to recover. We
can't do much here. We don't know this. But now the trick in our system is that we can use instead of the
dehazed intensity, we can use the model textures that are associated with Virtual Earth models to get a
good estimate of F.
So these textures here are based on satellite data, aerial images taken from planes. And there are all
kinds of errors. There are color misalignments they have low solution, partially, and the artifact, the
shadows in them. But still they're good enough to get a good estimate, estimate of F. And the trick here to
avoid suffering from all these artifacts is by averaging the values of the hazy image and the model texture
image over a large depth range.
And by doing that, and if we simply set A to one for now we can recover these haze curves here for this
image here. This is our intonation function. If we apply these curves to this image I get this result here.
And you see this is almost a perfectly dehazed image. We have almost glowing colors here and there's no
haze anymore in this image here. And it's done completely automatically.
Here's again the original image. And the dehazed version. So here's another example of an image taken
in Yosemite Valley of Half Dome, if I apply the automatic dehazing method I get this result here.
And even the original image and dehazed image. So sometimes when you apply this automatic method,
we experience a slight color shift in the background. And this is because we haven't set the correctional
lied color. We simply set it to 1.
And but it's not a big deal, actually, because in this case we simply offer the user standard color picking
dialogue and it's only a matter of seconds to find the correct setting of the allied, because you get
interactive feedback to that. And this is the final result for this image here.
And again, this was the original image. Okay. So another cool thing that you can do with these haze
curves is you can no longer use them to remove haze from your photos, you can also use them to add
haze to new objects to be put into your photo.
And so here's a simple interface that I prototyped where I can duplicate a building and move it around. You
can see on the left-hand side I put haze on this object. And you see this (inaudible) perfectly emersed into
the scene on the right-hand side where you don't change the color of the objects. You see it totally sticks
out.
All right. So another very important aspect of taking interesting picture is the lighting. And I have to admit
when I go out and take pictures most of these look like these ones here. They were taken about noon so
you have pretty dull lighting, no shadows. Just looks boring.
In contrast, most professional photographers take the photos and the so-called golden hour, shortly before
sunrise or after sunset, and they get these much more dramatic photos here. And so with our system we
can kind of fake this effect or approximate what our photos would look like at a different time of day.
For example, here I took this photo again of Yosemite and kind of approximately what it would look like at
this romantic, well, sunset here. And you can even animate this to get an image and make this one here.
So I won't go into all the details here. I will just show you a quick overview of the pipeline for this effect
here. So we start with our input photo. The first step is to apply automatic dehazing method to move all
the haze from the photo. Then we modulate the colors with light map that we have computed for the new
sun position.
We change the global color of the photo and change the sky to one that fits the time of day and finally we
add haze back into the photo. And so this is the final result here. And this was the original photo.
And I'm not necessarily saying that this is a better photo here. It's just a different one. But maybe this is
the photo that I really wanted to capture here on this day. So with this system you can kind of go creative
and create all kinds of effects.
So here's another version of this photo here. Here's another example. This is a photo of lower Manhattan.
And here's a lighted version that I created for this photo.
And here's another one. So here I added (inaudible) to the scene, and another one. And, of course, you
can also create animations where you move the center around and circle around and keep going.
Here's another photo. I like this one, because this one is a perfect example for total boring and dull lighting
scene. And with our system you can turn it into a sunny day or more dramatic day. Or here's another
animation created like Miami style waterfront animation.
Let me quickly browse through these ones here. One last thing I wanted to say about this, there's a big
difference between relighting a photo and simply putting the Virtual Earth models into the illumination.
Here you can see a comparison of the original photo and the relighted version and the Virtual Earth models
with the Virtual Earth texture is put under the same illumination.
Hopefully you agree with me that this one here looks actually much better here with these artifacts.
Okay. So there was image enhancement application. Now I want to quickly talk about two other
applications. The first one is novel use synthesis, and this is actually pretty much the standard application
that people show you when they're talking about photos with depth.
And now the cool thing is with our system, because we have such accurate depth maps, we get almost
perfect results here. But with our system you can do even more interesting things.
So one thing we can do is we can use our system to extend the frame of a photograph. So, for example,
you might have this photo here of Yosemite and so with our system we can extend its frame and kind of
synthesize or approximate what the photo would look like if we had like a bigger view. And, of course, the
cool thing here is the stuff is not arbitrary, but it's based on a real world model that we get from Virtual
Earth.
So what we do here is we use the Virtual Earth model to divide the guided synthesis process, where we
kind of use the colors from the Virtual Earth texture to steer the process. We do it on a cylindrical domain,
which means we can turn our head in every direction.
And we also synthesize it on depth measure. This is the data that allows us to represent both the visible
and invisible things in our photograph. It means we also have texture for like say a mountain that is
obscured by another mountain in the photo.
And so here's video of the system left in with the system. So only the center part of this is the original
photo. Everything else is synthetic. Now I can move around, look in all directions. And it can also change
the viewpoint. And you see whenever the texture here on the ground would get truly started, I'm
automatically fading over to the Virtual Earth textures. The same thing happens when I move the camera
up.
And so this kind of allows me to get a better feeling of the geometry I see in our photo. I can get in and
look around and move back out.
So let me use the remaining time to quickly talk about the last application. So this is actually a whole
bunch of applications, information utilization.
So because the geo registration of the photo gives us the exact perfect geo location of every pixel in our
photo, we can actually fuse the photo with all kinds of GIF databases we get off the Internet.
For example, there are huge databases that have the lat long coordinates of all kinds of things, famous
buildings, mountains and so on. There are databases with street networks, and even some Wikipedia
articles are tagged with the lat long coordinates.
We can fuse all these information with our photograph. And here's a program interface that we built for
this. So it shows next the photo map where we highlight the location of the camera, the location we clicked
on, and the reinforcement of the photo.
And in another view mode we show the depth profile for a horizontal line here. See this line I'm moving
around with the mouse? Here's your depth profile.
So I have to admit this is a bit more useful for landscape photography. It's a bit kind of hard to understand
what's going on in this city scene here.
So here's another interesting thing. So here we overlay street network over our photo and we both show it
in the top-down view and we also overlay it with the actual photo.
And so we see whenever a street is obscured by a building, we render it semi transparently, and we also
highlight the names of the streets under the mouse as we move around.
So this is another cool thing that my fellow worker Boris implemented. All these bars are Wikipedia articles
that have associated lat long coordinates. So I can display them in the photo and I can actually browse
these articles that are visible in this photo and I can see where, for example, this building is now a photo,
it's here.
So here's another one. So these are labels of building in our photograph. And as I move the mouse
around, it will always show me the 10 or so closest labels here that are visible in the photograph.
All right. So this is the last application. This is an object picking to us. So this is based on the object
models that we have from Virtual Earth. It always highlights the full building that is under the mouse
position. So you can use this, for example, to select a single building and apply further image processes to
it.
Okay. So let me sum up what I showed you today. I showed you a system for combining photographs with
digital terrain models by using geo registration, and I show you that this enables a whole bunch of
interesting applications. For example, I show you how to remove haze or add haze to new objects, how to
relight the photos and how to expand the field or change the viewpoint or fuse it with GIF data.
Actually, all the demos I showed you here are videos on my machine. And so tomorrow in the demo
session I'd be happy to show you this to you on my computer here.
So let me finish up by talking a bit about future work that we'll do in the future. So actually we believe that
the applications that I showed to you here represent only kind of the tip of the iceberg. And there are many
more interesting things that we could do.
For example, you could use these 3-D models that we have of Virtual Earth for all kinds of computer
version imaging tasks. Potentially use them to reduce noise in your photo or shop, refocus it after
recapturing a photo or recovering under or overexposed areas of the photo. We also only use a small
amount of the available GIF data off the Internet. For example, the database that contain different ground
materials, so we could potentially label different things like water grass pavements in our photograph.
Then in turn use this information to improve other applications. For example, tool manipulation. So
another important thing to do is to think about what we can do with multiple images. All these applications I
showed you here use a single image. But we believe that by using multiple images all these applications
could actually gain a lot.
For example, with image enhancement we could use multiple images to learn the illumination of a scene
and then transfer this illumination to another photograph showing the same scene.
Or, for example, in the novel view synthesis application, we can think of a photo tourism where we use the
technique to provide for better transitions between the images.
And, finally, with the information utilization application, we could use multiple photographs to transfer user
provided labels between images.
So one thing that I would like to do more in the immediate future is to prove the registrations. Right now it's
done manually. But I think it will be impossible to do this automatically. You would start from GPS data
and then use feature-based approach to automatically snap the location of the virtual camera to the right
position.
And also one thing is that right now we're using a rigid registration. That means we sometimes get slight
misalignments with our photograph. And the models from Virtual Earth and we would like to actually try to
snap address from our model always to high gradients in our photograph to provide better registration.
With that I would like to thank you all for listening and if you have any questions. Thanks.
(Applause)
>> Question: How important is it that you don't have cast, strong cast shadows in the original image, can
you simply remove them if you do, or are you mostly using kind of overcast skies that diminish that?
>> Johannes Kopf: Right now we don't do anything to remove shadows from our photograph. You can use
existing techniques to do that but we didn't do it.
>> Question: Hi, I wonder if you thought about adding dynamic elements that you could get off databases
like waves or wind or traffic?
>> Johannes Kopf: Traffic is one I thought of. I think it's useful to show traffic patterns in photographs.
Maybe waves also, yeah.
Looks like that's it. Thank the speaker again.
(Applause)
>> The last talk in the session will be by Frank Dellaert by Georgia Tech, he's going to talk about ongoing
work on their 4-D CD project.
>> Frank Dellaert: All right. So I'll be talking about actually City Capture in general, which is capturing
information about cities from large collections of images.
This is work with primarily Grant Schindler is my student. He's from the east side, a consultant on the
project from Georgia Tech. Let me see if this works.
The community goals is really about capturing reality, capturing from images, aerial images, Flickr, the
whole of Flickr, whatever the Microsoft equivalent is of Flickr, what is it? MSN Photo Database, I don't
know. And integrating all this data in a single consistent geometric model and then interacting with that
model and prime examples are Virtual Earth, of course, and Google Earth, which have really done, have
put this community on the map.
Now literally that is. I'll talk a little bit more about incorporating time in this whole endeavor.
So I'm going to focus on urban capture, because urban capture is really the most, that speaks to the
imagination of people. Many people live in urban environments. They are interested in the history of their
cities and so there's lots of applications in urban environments. And that's also why a lot of the effort of
Google and Microsoft is actually focused on these urban markets.
So there's lots of applications. Radically new interface like photo sense, like on the previous slide. Urban
planning, historic preservation and historic renovating and historically relevant way and virtual tours and
lots of applications.
Just to focus our attention, if you just ask for a planter in downtown tags on Flickr, you get these are just
1600 first images you get and then the question becomes how can we stitch this into a consistent model
and things like photo tourism, try to do this automatically.
And it's an extremely hard problem. And so we've tried a couple of things to try and at least localize these
images in model. For example, if you have a model like Google earth or Virtual Earth and you have a
picture which has some tags associated with it, for example, these are pictures that were taken straight
from Flickr and they were tagged by users as, with the names of these buildings. So in this image there
were three or four tags. And just using those tags like the G.E. building or the Cantor building, if you have
a model of the city, you can actually figure out using visualization techniques or techniques on figuring out
what possible viewpoints could you be when somebody tags that image.
So there's only a couple of places where you can see, for example, the G.E. building and the Cantor
building at the same time. So you could at least say, well, it must be taken from around this part of the city
or it must be taken from around this part of the city.
So it's very rough localization that can at least help you in the same way as Kristen, try and get the nearest
neighbors as to where you actually could be. So it's a little bit like a hash function in the sense that you
could use those tags to at least put you in the right location.
So that's work which doesn't yield a lot of accuracy. So this picture had a number of tags and these are -we simply sampled from the possible locations in the city where this picture could have been taken and
then resynthesized this image using a very rough model of the city.
And if the system works well, then the synthesized images should look a little bit like the picture. And so
you can see that in the top picture it roughly matches in the bottom picture it roughly matches as well.
You can do things like saying well pictures are mostly taken from street level or taken from top of buildings.
And very rarely are they taken from an airplane and very rarely are they taken somewhere halfway
between a building level and street level, because physically it's very hard to take pictures there.
So when we looked at urban environments, we noticed something else which is we wanted more accuracy
than that. So the previous tag based localization only gives you a very rough estimate as to where you are.
If you want to build 3-D models you want to get closer to where you actually are.
So we tried throwing wide baseline matching methods to this problem. White baseline matching methods
are you extract a lot of features from the image. You have a large database where you extracted those
features from those images, and you try to index them to that database. And as Kristen talked how you
could do this efficiently to get at least the recall where you can then do a more geometrically consistent
match.
Wide baseline matching methods do not like at all repeated features. It turns out that if you take a typical
image from a typical urban environment, most of the structure here is repeating. So instead of having lots
of unique features that you can use for indexing into the database, what you have is the same feature 100
times on this building, the same feature 200 times on this building and you have to try and match them up.
And the typical way of doing that is in fact throwing it away. Saying oh this feature occurs a lot in this
image, it must be repeating feature, hence I cannot use it so I throw it away.
Which is, of course, exactly the wrong thing to do if you're looking at this image. Because most of it is
repeating. So we talk about well could we use the repetition to match an index into that database.
So we submitted something to CVPR, which used that ID, which is we simply extract the repeating patterns
from the image, which you can do. There's several papers that tell you how to extract repeating lattices
from the images.
And then if you have a database just like what is now available in these interactive 3-D environments of
textures, you can also do the same lattice extraction from the database of your 3-D database, your 3-D
textures, and then try to match up the lattice in the query image with the lattices in your 3-D database.
Now, if you do that and you get a match, but there's still a little bit of a problem. This match is between the
lattice here and the lattice here, and you can match the lattice in many different ways. And each of these
different matches yields a hypothesized camera location in space.
Now, luckily if you have two matches, so you have one building that you match and you have a family of
camera locations, and you match another building of a family of locations you have two grids of possible
camera locations and you can intersect those to find camera locations that are good candidates for the
actual location.
And so that's exactly what Grant actually did. This is also work in collaboration with Penn State, in fact.
Yang She Yu is an expert on extracting repeated patterns from images and she helps us a lot on this
project.
So if you do that, here is a database of images that we, buildings that we have. It's a fairly small database.
You can see that the blue here is a Grant reposition and the red is our estimated position. And now instead
of being simply in the right block, we're at least on the right pavement.
Now you can get it more accurate. Just for illustration purposes, here's a query image which extracted
lattices and here's a synthesized view from that location that we got from that using that technique.
So this is all 3-D. And this is about trying to use all these images and extracting 3-D models from the data.
And there's a lot of competition in that world. The whole Virtual Earth group of Microsoft is trying to
compete with us, and everything's backwards, of course. We are in academia almost running behind the
industry. It wants a lot of money funnels into making 3-D automatic. It's inevitable that in academia we're
going to be struggling to keep on that effect.
So also try to focus on an aspect that has not been traditionally looked at in the industry, although Blaze
hinted at it this morning, which is time.
So all of what I talked about you can do with 3-D. You can also do it with historical imagery. And in the
4-D CDs project at Georgia Tech we've been focusing on historical imagery for four to five years right now
with help from National Science Foundation and also some help from Microsoft Research, where the input
images are taken over time. So here's a bunch of images. My laser pointer doesn't work anymore.
And some of them are 100 years old. So this image is 100 years old. And these images are just recent
images.
And some unique problems start popping up if you deal with historical images. Okay. So Grant at last year
presented a paper where we took a bunch of images and assumed that we could do the 3-D reconstruction
well, which is not actually a given. There's still a lot of work, and then there's Grant's flip images into their
correct time order.
So here are the ideas. You take a bunch of images, you have a 3-D reconstruction, try to rearrange the
images in the order by which they're taken. And this is something that you cannot do exactly, right. So
images that are taken 10 minutes apart like the images on the back here cannot be distinguished, right,
because no building went up or was destroyed in the time that the images were taken.
So if you used the 3-D structure of the city to order your images, you can only get so far. You can only get
a partial order. But you can actually get quite far.
Here's an example of this with 20 images. This was presented at CVPR. I'm not going to go into detail as
to what all these blue and red dots are. In essence, what they are is the constraints about buildings, saying
that a building goes up and exists for a while and then goes down. And you make use of those constraints
of the urban fabric to order your images in time.
And so that's work that's very exciting. By the way, there was, of course, if you use the 3-D structure for
every ordering that we recover here, there's also a second ordering immediately which is by flipping the
time Xs around, in fact in Grant's algorithm he nowhere uses time as a constraint. And so we always have
two, at least two solutions to this problem.
And here I'm supposed to give a live demo. In fact, something which Blaze hinted at and I wanted to call
out, I'm going to show it this afternoon, I can show you as a demo here. And Grant tomorrow, during the
lunch demo session, will be available to -- let me just restart it to show -- to let you interact with the applet.
It's something that Grant created really in the last month and a half to show off the results of our research.
Let me see here. So here's a little time animation. This is Atlanta over time. And Atlanta, somebody
asked me, wouldn't it be more interesting to take it in New York or San Francisco as an example of a city
instead of Atlanta, which most people I guess find not so interesting.
But Atlanta is very cool in the sense that if you go back to 1864, there basically was no Atlanta. So if you
know your history, Sherman was a Union general. Let me hide everything else. Came and conquered
Atlanta from the south and proceeded to burn it down. Hence, Atlanta from a research point of view is
interesting because we can start from scratch and see the whole city pop up from 1864. So here's the
oldest picture we have, which is by a photographer that came in with Sherman.
And now we can go in time. So, again, this whole model is built using structure motion techniques,
(inaudible) photo sense. If we go from one image to the next we now actually see the historical photos in
context.
We can actually see where they were taken. So here's some other pictures in a little bit or here you see
the refugees fleeing Atlanta. One of the next, this is the old station. More refugees. All right. Take a look
at this building here which is the station basically where they kept all the trains.
Here's Sherman troops back in the middle of the city. And then pretty soon now they will start destroying
things. So this is the same station building but now destroyed.
And you can get a vivid sense of where all this is taken. If you see these images out of context you actually
don't know where they were taken, right?
And I never had a visceral connection with these images because I never knew where they were taken but
you can do orbits around and now know that image was taken right here. That still doesn't tell you the
whole story, but if I turn on the city as it is now, you can see exactly where that image is in the current
modern fabric of time.
And so with the slider here, I can move time and go and add more and more with pictures that are taken
over time. And then I can, if I stop this, I can go to any image here and zoom into it, right?
>> Question: Hit A.
>> Frank Dellaert: And just like Johanness, in fact, we can click on a building, and what happens is it
shows all the images in which this building is visible. So, for example, a building that is no longer with us, I
have to kind of find -- so Grant is the demo wizard here. So I have to try and find a building that was
destroyed. So this one is still there. Let me see, go to an older picture here. And what about this one?
Now you can see the range of images is much less than the 200 which is the full database, and this
building existed from here to here where it was demolished. So this is 1978 and it was built around 1921.
So we can go to 1921 image and this is -- the red focuses your attention as we're moving between images.
And it shows you where the image is. And again, also because we have the 3-D model, the red in this
image is a nice -- it's not a perfect imitation, because everything is not in the model but it's almost like a
perfect imitation of the pixels in this image.
In a historical context, we know when each building was demolished and when it was not. Although, it
turns out that lots of the data on these buildings, on these images which we get from the Atlanta History
Center, are actually wrong, and our model cannot actually tell the data on this picture must be wrong
because this building wasn't there, et cetera.
So it's interesting the Atlanta History Center People we had a meeting with them and we told them okay
this building and this building or this image and this image is wrongly labeled.
So I'm not doing this demo justice. Tomorrow Grant demo wizard will show this and allow you to interact
with it.
Let me go back here. So now the part that we've actually been funded for under the RFP, which is another
cool idea that came out of Microsoft Research labs, these gigs of pan know ram to stitch hundreds of
thousands of images in a consistent large panorama which enables a totally different way of interacting
with a panorama, instead of simply going pan and tilt we can now go deep and it's almost adding a different
dimension to the images.
And unfortunately I am running Mac OS on this laptop at the moment so I can't show you their HD view.
But I can show you that the whole community has actually sprung up around the idea of giga pixel
panoramas. And people are motivated to go out and share panoramas share them with the world. This is
a Carnegie Mellon project called Giga Pan, where you can actually interact with these -- I should be online.
This is not a particularly good panorama, because there's a lot of dynamic structure in this scene.
So the mosaicing was really crazy. You can see there's some privacy issues to these things as well.
And in fact we've put one of these, the idea that we got funded for under the RFP was to put one of these
giga pixel sensors in the city with the idea of trying to capture the evolving city as it is changing.
And the genesis of this idea is because I was driving around just like Grant I always have a camera close to
me when I'm driving around in the city. And every time I see something interesting, I try to take the pictures
as to not let this moment in time, this sample in space go away.
And I realize it's completely impossible to keep up with the growth of a city like mid town Atlanta where
skyscrapers are constantly going up and down, as a matter of fact. Every time I see something new being
built in Atlanta I'm not here with my camera I can't capture it.
So the idea was let's take 100 of these giga pixel sensors and put them all over the city, have them
consistently capture the city as it is changing. And so we've made small inroads in this, and this is very
much a work in progress. We basically only have a solid center of work in it. So this is going to be more
what could be rather than what is at the moment.
But in motivating this project towards my students, I made an analogy between the giga pixel sensor and
an eye, right? This is the sensor that we bought is just a commercial off-the-shelf pan tilt camera, which
acts as a server. You can put it on the net and we can grab images from it.
And the total field of view is actually unlimited, but here's a 56-degree view on a scene where we stuck the
camera. And at the highest zoom level, the field of view is one to two degrees, which is exactly in fact what
your eye does.
You have a very large field of view. Almost 180 degrees. In fact, a little more than 180 degrees if you take
both eyes, if you can believe that.
But the fovea, the high resolution part of your eye, only has a field of view of two degrees. So you're
constantly putting your fovea on different parts of the scene and updating your idea of what the scene is
based on what moves, what changes, et cetera.
And we thought that we needed to do this as well because if you put a giga pixel center somewhere in a
urban fabric and it's almost 100 -- almost 98 percent of the scene is going to be static. What's going to
change is the weather, it's a 24 mode load cycle there. Shadows is a big problem there, in fact, when you
think about it.
What we want to do is focus our attention on the things that change. And simply record those. There was
also an interesting problem that we have not tackled yet at all which is if you capture giga pixel images of
the same scene over and over and you do this over 10 years you probably won't store them as a sequence
of simple images, in the compression of storing of these things, you can probably make use of the fact that
you're imaging a single scene. Right?
So eye movement makes perception comprehensive over the scene and intentionally focuses on changing
parts of the scene. The hardware is a Sony pan tilt camera. I'm not going to go into this slide.
And what we did to try this idea out is we put this camera there. I think we've been continuously recording
two giga pixels a day for about a month. And we take a hierarchical swath through the scene. So what we
do is so we do this at multi-resolution. So we take two wide angle shots and eight intermediate angle shots
and say I think it's a thousand shots at highest resolution.
And here is a little movie of what one of them looks like. It's just going all the way across. This is probably
level two here, or maybe it's already level three. So probably this is level three where you are very highly
focused into the scene, highly zoomed in.
Here's a time lapse of the raw imagery, what you see is the images are constantly shifting around. The pan
tilt zoom cameras off the shelf, if you tell them to go to this pan tilt zoom, they don't actually do it, they add
some noise to the process.
So instead of doing the regular giga pixel where you take all the images and stitch them together, what we
want is a stable base in which to register all the images.
So here's a little time lapse of just taking of the first image as a base and registering the other images in
there. Let me show you an animation of what is happening. So the first image becomes the base in which
the subsequent images are being registered.
And this is simply using homography matching. And this is what we continuously do. And updating this
mosaic.
What we have to think about is recursive giga pixel center. In a bayesian framework what you want is in
fact you constantly want to update your model. Your model is a giga pixel representation of the scene
given the images at all times, which is all your pan tilt zoom images at every resolution or across the hall of
time and you can simply do it by Bay's law. Here's a (inaudible) on the giga pixel times the likelihood of the
current measurement times your predictive density on your model.
And you realize that there's something bad about this and the bad thing is N is, because it's a giga pixel
image has a million dimensions times 3-D.
So this small equation here of Bay's law turns into something horribly intractable, if you try to do anything
but the simplest probability model.
So that's what we did. In fact, we used a model for now which is simply modeling each pixel is
independent. So we have a mean and we have a variant for each image, for each pixel. And as we fit in
more and more images into the scene, you will see that we are getting more and more confident on some
pixels in the scene, because we've seen it many, many times.
But there's also areas where there is a flag in the image that is constantly changing. And so given our
current model of the scene, we see that the flag is not evading the image model and we reset all the
variances to zero meaning we have zero confidence in the pixel at that time.
So where things are dark, here is a building going up, in fact, is what we -- we see something that is
unexpected. And one of the biggest problems, and I saw some people nodding about the question about
shadows, shadows, of course, always change in the scene, and so we get away by actually taking the
panorama always at the same time, roughly at the same time during the day.
And then this is my last two slides because I think I'm going over time.
The idea was, then, if you have an evolving representation and the density of the giga pixel models, then
your next measurement should be the measurement that reduces your uncertainty the most. And just that
question is very easily, the arc max, is easily answered in theory. We simply want the measurement that is
going to maximize the information in our (inaudible) at the next time step.
And this is simply for the next time step. You can imagine larger things that do sensor planning over the
long-term. But if this density is anything but a very simple density, this becomes a very complicated
question.
So the student did something -- this is Dan Woo, which is simply grab the image where there is the most
black variance, meaning this flag is constantly being changed. So we constantly take images there and
once in a while we take a large wide angle shot because that has the highest information value.
And here's actually a building going up. Here we're constantly resetting those variances.
So the camera spends a lot of time sampling this part of the image and trying to follow the building as it
goes up. Right?
And that's the animation of the large giga pixel fovea eye in the sky which is -- it's actually a very, very
simple thing we did. But it shows the promise of this. In future plans, we want to make the densities
actually more informative than simply independent pixels. And there's some very interesting work in that
area that we're trying to implement.
But there's also some interesting research about interacting with these giga pixel images which have
independently changing pieces, right? So this is giga pixel panorama here, has a little piece here that is
constantly changing. So the system could figure that out because it has put all these images there and
give you a little player interface in this part of the scene.
And because this building is going up in a different part, the player would be only limited to that piece of the
building here, and you can independently play these little embedded movies in the giga pixel panorama,
whereas the rest of the scene is static and is not considered interesting to put interaction.
And then, as I said, another interesting thing is we only put in one sensor right now but we'd like to put in at
least three to see what we can do with multi-view capture over time.
Okay. I'm going to stop and acknowledge especially Kevin, who did the tag localization, Grant, who did the
cool for the demo and the two CVPR papers and Dan who implemented all the giga pixel stuff. So I'm
going to leave it there. Thank you.
(Applause)
>> Question: I have one question. In your big panoramas, you want to factor out things that are periodic
with the time of day, I would guess, like shadows and maybe even traffic and what's happening in the sky.
So I wonder if you've thought at all about how you can correlate, say, pixel brightness with just time of day,
might find some periodic correlations.
>> Frank Dellaert: Right. Absolutely. I mean so that seems to be the way in which to attack shadows,
right? Because shadows are not perfectly periodic. Right? They change by time of year. So you're not
going to totally get away from it. But you could add a time of -- model it at 24 and model at sun time,
especially if you capture these over many, many years. So I don't know how to do it is the answer.
But definitely I think that's the right direction to think in in trying to factor that out.
>> Great. Thank you very much. So we have a 30-minute break. We'll start the next session at exactly
4:00 p.m. sharp. There should be snacks outside.
[End of segment]
Download