>> Moderator: Okay. Good afternoon. My...

advertisement
>> Moderator: Okay. Good afternoon. My name is (indiscernible) and it's my pleasure to invite
you to the afternoon session. The first presenter today is Craig Knoblock from the University of
Southern California. The subject is integrating online maps with aerial imagery.
>> Craig Knoblock: Okay. Great. Thank you. All right. So I'm going to talk about finding and
integrating maps with aerial imagery. And I want to acknowledge my colleagues on this.
This is joint work with Cyrus Shahabi, who is sitting back there somewhere, Yao-Yi Chiang, Aman
Goel, Matthew Michelson, all from USC. And then Jason Chen, who's at Geosemble
Technologies, which is a spin-off company from USC. And this is research sponsored by a
combination of the National Science Foundation, Air Force Office of Scientific Research and
Microsoft Research.
Okay. So here is the problem we're looking at. The first part of the problem is there's lots of
information out there. And so one of the things we wanted to do is say okay, we'd like to go out
and find maps that we can actually put on top of imagery and find them automatically. And we
can go out to an image search engine like MSN here has an image search engine. And I entered
here Redmond maps. And I get back, as you can see, some things that are maps. Let me try
this pointer. So this a map and this is a map, but that's not really a map. And so you can see
there's some combinations of maps and things are not actually maps in this image. And so one
of the problems we looked at was automatically classifying things into maps. That's the first part
of the problem.
And the reason we wanted to do that was for the second part of the problem, which was then to
take these maps and automatically align them with the imagery. So here's an image that was
automatically processed where we didn't have to tell the system anything about the map itself
except its general location. So you could get it out of a map search engine and then the problem
then was can we actually determine the exact coordinates of the map, find the control points and
then superimpose it on top of Virtual Earth in this case. So that's the problem that I'm going to
talk about today and how we actually solve this problem.
And there's really two parts to my talk. First I'm going to review previous work that we've done in
automatically aligning maps with imagery and I'm going to do this pretty quickly.
This is stuff I talked a little bit about at the last Virtual Earth Summit. So I don't want to spend too
much time on that. And then I'm going to talk about the newer stuff we're doing on automatically
identifying maps.
So let me dive in and start that. Okay. So here's the problem, as I said. On the top we have
some image of the world and on the bottom we have some map that we found from somewhere,
maybe we got it out of one of the image search engines. And this map -- a lot of these maps are
in raster format so we don't actually know the metadata for the map. If you know the metadata,
great, you don't need to go through this process. But a lot of times you might go out and you
might find maps about all kinds of different things. You know, might be layouts or locations of oil
and gas wells that could be real estate type maps. Lots of interesting maps that are available, but
they're in raster format. And so you don't have any information about the coordinates of the map
or even the details like the scale and locations and things like that.
So we want to put these things together so that you get the combination and you can get the
information and many times one of the most useful types of layers in the maps is simply the text
layer, which has all the labels on the roads and buildings and so on.
Okay. So here's another map. This is a map we got this map from the Washington, D.C.
Transportation website. And it's PDF. There's no metadata for it. You just go there and you can
download the map. And it has the location of all the bus lines and you say well I'd like to
superimpose that on top of the imagery so I could see the location of all those things. But you'd
have to do it manually today to do an alignment.
Okay. So what we do is we developed an approach where we essentially use the vector layer,
which is essentially the road layer, as the glue to essentially align the map with the imagery.
Okay. And so the way this works is we start with a map and an image and we'd like to bring
these two things together. And so we also have some kind of road network layer here and the
first step is we have to take this road network and align it with the imagery. And one of the
problems that we find is that a lot of these kinds of layers and stuff, they're not aligned and we
need fairly accurate control points for this process. So the first step is to actually -- automatically
align the road layer with the image so that we know where all the intersection points are on the
image.
The second step is then to find -- take the map and find all the intersection points on the map.
And we want to do this automatically. We don't want to have to go through a manual process
where the user finds each of these control points because it's just too time consuming to do this
over large numbers of maps. Once we have that then the next step is to actually do some kind of
-- do a process we call point pattern matching where we're going to determine the relationship of
the points here to the points over here. And in some sense what's going on here is we're using
the layout of the intersections as the key or the sort of fingerprint of the map to actually figure out
the relationship to the image. So we know the general area that this is a map of some city, let's
say. But we don't know the exact location. So we're using this layout of these intersections and
intersections are over a large enough area are going to tend to form a unique pattern. So we're
exploiting that property here. So we have a point pattern matching algorithm that comes up with
the matching between these so then we can actually superimpose the map on the intersection -the map on top of the imagery or vice versa and do a process called conflation, where you
actually stretch the map so it actually fits over the image.
Okay. So here's the first part of the process. So the first problem then I'm going to just very
quickly review is work we published in 2006. But it essentially goes through this process of taking
the road network and aligning it with the imagery. And the basic problem here is that you know
as I said if we just take this image -- take the road network and superimpose it on the imagery it's
actually not aligned. If we go through some automatic conflation process then the intersection
points are actually on intersections, which is important for the next step of the process.
And just to give you a sense -- I'm not going to go through the details, but just to give you a sense
of how the process works we go through a general process here where we first identify a set of
control points. And I'll describe how that works next. But that's really the heart of the process,
which as we find a set of control point pairs and we're using the intersections on the vector data
and finding the corresponding intersections here on the image. We do some filtering so that we
get rid of any noise that we might have introduced to this and then go through that final sort of
triangulation rubber sheeting so that we end up with the road network superimposed or correctly
aligned with the imagery.
So the key then as I said is this control point detection and that process is essentially we start
with the road network and we're really explaining the fact that we don't have to just do road
extraction from the imagery, which we all know is hard. And instead what we're doing is exploring
the fact that we know something about the approximate locations of the road network so it is
closed, it's just not aligned perfectly. And we also know something about the general shapes of
these intersections. And so we, you know, the intersection is here, but the real intersection of the
imagery is here and what we're doing is we're looking for the intersection within some radius of
that original intersection.
And so the first part of this process is we're going to take the image and do simple machine
learning to basically classify pixels as either road -- on-road pixels or off-road pixels so that that's
what's shown here on the right where we have the white is the road pixels and the black are the
off-road pixels. And you can see things like trees create some noise in there and roofs
sometimes look like roads. The Usual kinds of problems we have there.
Then what we do is we take the vector data of this road network and we build some kind of
template here that we can then align with the image. And now we can combine these two,
essentially searching for the best match within some radius of that. So that is one possible
location. So you can see it goes through a search process and decides this okay the is actual
best match of that intersection point on to the actual intersection. So that we can get a nice
alignment between the two different layers and that will then tell us where the intersections are on
the image. All right.
So the end result -- so here's a sample of the result of this. So the red vector, the original vector
data here, I think this comes from the Missouri Department of Transportation. And the blue line
here is the actual line imagery. You can see the alignment is quite good after we go through the
process. We know where the intersections are. Now that brings us to that. We've sort of gotten
to this point here. So we've got this image here and the all the little red dots shown there are the
actual intersections on the roads.
Now we go through the other part of this process where we have to now find the intersections on
the map and align it with the image. And just very quickly I want to give you a sense what is
going on here. So this in general is hard and we don't want to have to train this for every map.
So the idea is we go through a process where we first do some kind of automatic thresholding
where we can pull up the background color and we end up with just all the foreground pixels,
which contain the roads and text. Then we're going to remove noisy information, which in this
case is the text. They tend to interfere with finding the road intersections. We go through a set of
transformations where once we remove the text we end up breaking the lines and so we'll go
through some morphological operations and we'll clean that up and reconnect the roads. We find
the corner points where the things that are likely to be intersections and then we finally have
some wave actually testing which things are really intersections by checking how many roads are
actually coming together in here.
And so what we end up with is a set of detected intersections now which are actually shown on
the map. And there's a whole paper that describes this work and I just don't have time to go into
it today. But I want to give you a sense that this is an automatic process. We feed it in and find
the intersection on the map. And now we know where the map intersections are and the next
step then is to do this point pattern matching where now the problem is that I've got all these
intersections here and all these intersections over here. And I need to figure out the relationship
between these two.
And this is a fairly -- can be a fairly search intensive process but we exploit some properties, so
for example if we know -- or we assume the orientation of the map because most are oriented
north and if they're not they usually indicate what north is. So we assume we know the
orientation. But we need to worry about the translation and scale of the maps. If we know the
scale then we can give it to the system, but if we don't know it then we can still handle that.
So what happens here now is you said you take the point from this and we're going to look for the
corresponding matching to the points over here and find that in fact the best fit for this set of
points is going to be right there and that's going to tell us exactly -- not only where the map goes,
but it's going to give us the -- you know it's going to give us the scale and the translation for this.
And, this is very important, it gives us the control point pairs. So if we figured out what the
mapping was that means that we know the mapping from each intersection point on the map to
each of the corresponding intersection points over here on the image and we can use those for
the final process, which is called conflation, where we're basically doing triangulation process and
rubber sheeting where we're essentially taking each of these triangles and essentially stretching
them to fit on top of the image. So in some cases this means we're going to have to draw up
some pixels and in some cases we're going to have to fill in a few pixels. But the end result might
look something like this where we had that original map and now we've stretched it to fit on top of
the image. This map is slightly distorted partly because of the fact there was probably some
photographer involved who moved things around a little bit to make things lineup. But you can
see -- now you can see the relationship with the bus lines directly on top of the imagery for
Washington, D.C.
Okay. So that's my very brief review of sort of the general technique for automatically aligning
the maps with the imagery. Now I want to get into the new stuff we've been working on, which is
how we can automatically identify these maps. All right. So what we want to do is basically be
able to go out and harvest these images from the web or maps from the web. But we really start
out with some type of images. So we might use something like The mSN image server where it
will generate a set of images and we can specify the area. So one useful technique we use is we
often say okay here is the name of the city and we want -- give the key word "maps" and maybe
about half the things that return are actually maps.
And the problem then is we want to basically classify these things and actually generate a
database of the maps and then we can go on and do some additional processing and kind of
processing is what I just described where as then we go through those processes of actually
trying to pull out the intersection points and then doing alignment. That's an expensive process.
So we don't want to do that on a bunch of random documents, which is the reason we do
classification first.
Now what we assume here is we have a couple repositories. We have a repository of maps that
we've seen before. So we've done a bunch of classification of things that we say are maps and
we have another repository of you know non-map images. These are things that are not maps.
And we're not talking about maps. I'm talking about street maps. So we usually classify them as
you know they have to have a road network on if we're going to classify them as a map because
there are many maps that just are generic kinds of maps.
Okay. So the basic approach that we take to this problem is called -- is using what's called
content-based image retrieval. And the idea is that we're using the similarity of the image that
you want to classify to other images that are in the repositories. And so the nice thing about this
approach is it's fast. It's based on sort of information retrieval techniques and it's quite scaleable.
And I'll talk a bit more about that in a minute. But the way this works is that we go and we use the
image we're trying to classify to essentially query the repository of maps and non-maps and it
returns some set of images and in fact what happens is the set of images returned essentially are
voting whether or not it thinks this thing is a map or not a map.
So this is based on sort of canurous(phonetic) neighborhood type search and we compare this
and I'll show you in the results you know we -- an earlier version of the system we developed
using a machine learning technique called sport vector machines and we compare it to the use of
For SVM in this and we found that it actually works better.
So here's the way content-based information retrieval works. So you have some image set and
have you some query image and then you use similarity function and I'll talk about similar function
we use for maps in a minute. And you go out and using this image and the similarity function it
goes out and finds those images that are closest to your image. And it you know has to rate them
based on similar function and returns the closest to us.
Now in terms of the features we need some similarity function that's actually going to work well on
maps and we've experienced with a variety of them. And on of the one we've found that works
the best is using what are called water filling features. And you can think of these as sort of
similar to water going through a set of pipes, where you have forks in the pipes and you have
some sort of flow for the amount of water, that's why it is called water filling. And this work well
on images that have a very strong edge structure. And one of the properties of maps that you
know if you think about a map you can often recognize at least a street map from a distance, the
fact they have this pattern with lines that are very clear and one of the advantages of using the
edge structure instead of things like the color on the map is that we get this color invariance. We
don't want to have to have seen this map before in order to recognize it as a map and so we
found that you basically have maps of every possible color and so we wanted to abstract away
from the color itself.
So here, okay, so the first step in this to use this sort of water filling algorithm is we're going to run
this on what are called Edge Maps and so we use a standard sort of vision algorithm called the
Kenny Edge Detector. And This just shows an example of maps. So here's a map on the left
and on the right is the result of running the can Kenny Edge Detector on that. And you can see
the edge structure in this map is very clear. Right. It's really basically a whole bunch of lines that
come together various places with a little bit of text superimposed in different places.
But that's the first step. And then the next step is to actually compute these sort of water filling
features. So let me just explain the water filling by example here. So here is just a set of
snippets of you know what could be a map. For a straight line like this might have a filling time,
which is based on the number of pixels. So if this is 50 pixels in length then this would have a
filling time of 50 here and then the fork count for this example would be one and that's simply
because in this example there's no branching.
If you go to the next example in the middle here you can see that okay this has an increased
filling time because there's more pixels. You have twice as many pixels it's going to give you a
filling time of 100. And then you have a branch. So each branch is going to increase your fork
count by one and so now we have a fork count of three.
Now on the third example you can see it's a little smaller so maybe this has a filling time of 70,
but a higher fork count and you can see that there's more branching in this particular image. In
fact it is nine. So those are what the features are computing and actually what we do is we
compute with these sort of algorithmic features based on these water filling metrics. So we
compute four things. One is the maximum filling time and its corresponding fork count. So the
image is sort of broken up into subareas and for each of the subareas you're going to find
whichever one has the maximum filling time and then for that one you're going to compute its fork
count. And this is a measure of the longest edge and sort of the complexity of the edge. The
second one is the maximum fork count and its filling time. So instead of computing the maximum
filling time or actual fork count this is going to give us a measure of the highest complexity of
some part of the map and then its corresponding length. And then we also create this filling time
histogram and a fork count histogram for the whole image and the areas are broken up into two.
Okay. Those are the basic features that get used. Then what happens in the actual
content-based imagery process here is we started with some image we want to classify. And
then what we do is we use this CBR. We have these two repositories. The map repository and
the non-map repository. And then we use the content based information -- image retrieval to find
the nine most similar images. I could only fit five here, but imagine there were nine. But in this
case we return the first three, which have been previously classified as maps. So they came out
of the image repository and we had labels on them. It said these three were identified as close.
Then we also return these other two things that for some reason it decided was close to this.
These are non-maps. And essentially what happens, we just vote based. I mean, based on what
came back. So these three are voting as it's a map. These two are voting it's not a map. So we
decide you know we take the majority. In this thing we say okay this thing we're going to
determine to be a map.
Okay. So how does this compare to doing this machine learning? So in machine learning we do
something similar, where we need to train the system. We need to give it some training data.
And one of the things that comes up in machine learning is okay how many classes do you have?
And we can do it as a binary classification. It's either a map ore non-map. But one problems we
had with training the machine learning process to do this was that maps are all very -- I mean you
have a huge range of things that come up on maps and there is a huge variety of them. And
using something like support vector machines, it would end up with either low precision or low
recall depending on how you did the classification to get you know fairly high accuracy.
So the machine learning is trying to use one model to sort of distinguish the maps and the
non-maps and it was challenging to do that. The other thing you could do is try to break it up into
sort of separate models. You say well there's different classes of maps because different maps
look different. But that requires a lot more work to do the labeling. Trying to figure out which
maps form which class, which is actually quite hard. So but we did want to compare it to this.
And there's been a -- well, some related work here. So the SVM work done previously for solving
this problem on classifying maps was done by another student of mine few years ago. And here
we didn't use water filling. We used a different metric here. We used what was called loss
textures. And these look at the texture map. Again we're trying to be color invariant. It is just a
different feature. That instead of looking at the edge features we're really looking at the texture
type features in the map.
Other sort of related work here is the CBR work, CBR based KNM type work and this was used
previously for classifying images in the medical demand and the medical images have different
kinds of properties and stuff and so we found that the features that they were using didn't work
that well in our domain.
And the other piece of work here that is relevant there has been a fair amount of work using water
filling features for other measures of CBR. And so some of these types of things are features that
we may want to look at in the future to basically improve how we're doing. But the edge features
we'll see in a second work quite well.
Okay. So let me run briefly through this set of experiments we ran on this. So we took quite a
variety of maps. So we collected maps for the set of cities shown here and then we took the Cal
Tech 101 database, which I saw in another talk someone else used the same database. But it's
a very nice image database of things that are all not maps. You can see the total number of
images in repository. It's about 3000. And then we have about 1700 are maps and I don't know
how we -- 3000 that are for non-maps. I don't know why the numbers don't add up actually, but
the experiments we ran were to test two things. First we wanted to show or test the hypothesis
that the content-based image retrieval is better than the support vector machines. So to test this
we want to compare them using the same feature. So we ran the machine learning stuff that we
had done before using the water filling features. And so we took 1600 training images. So we
randomly picked them from the repository and then 1600 testing images, which are distinct from
the training ones. And in both cases we took equal numbers of maps and non- maps.
The second hypothesis we wanted to test was that the water filling features actually work better
than the textures and so we compared -- well, we actually ran both the content-based image
retrieval and the SVM on both types of features. They worked -- the loss textures worked terribly
on the content-based informational image retrieval so we don't report the results here. But we
also compared the support vector machines on both types of features. And these are the results,
so just summary of the results here.
So in the first line here we have the CBR with water filling and you see this we got the highest
precision in recall. So we have 82% F measure, which is the combined precision and recall. The
support vector machine with water filling, which is the same features we're using here, slightly
higher precision, but the difference here was not statistically significant. It's very close. It's
basically the same precision, but you can see the recall is much lower. So it's almost, well, more
than 20% lower on the actual recall, which is really significant in finding a map. So you can see
we end up with lower F measure for the results here.
The second comparison was running SVM across these two types of features. And again the
water filling helped SVM quite a bit. We are able to get better results using water filling features
because the edges seemed to work well for map-type images.
Okay. So you can see in summary the CBR with the water filling really worked the best for this.
We are getting pretty good results on this. I think we can do better. So I'll talk about things in a
moment, the next steps. But here is just a quick graph summarizing what happens. Okay, on the
X axis here as we increase the number of the amount of training data. So what's happening
along the X axis is that we're increasing the number of maps and non-maps training the system
on. So 200 here would correspond to 100 maps and 100 images that are not maps and so on.
And what wanted to see the impact on additional training data. The green line on the top is
showing the CBR with water filling, the orange line in the middle there is SVM with water filling
and then the gray is SVM using loss texture. So you can see they are all improving with
additional training data. We're only going to get so far. It hasn't quite got there, but it's not going
to get to 100% using this.
So the question is all right we started looking at what kinds of things do we need to do to really
get the performance to get the next level of performance to get up into the 90 percentile. And one
of the things we discovered for a lot of the maps that were misclassified or images that were
misclassified is that we had what we call culprit images. Okay. They were things in the database
that would cause the system to misclassify a map. And usually a lot of these misclassifications
were essentially off by one, meaning it was really close. The errors tended not to be just clearly a
map or a non-map. Those were easy. They tended to be in the middle. If they were not things
classified it would be 5 to 4 or 4 to 5. And what we discovered was there was a certain set of
images in our database that over and over again actually could be held responsible for
misclassification. So here's an example. So if you look on the left. Okay. You say, okay, well
we see a picture of a person there. But on the right you can see that there is this road network
behind his head that caused the system to essentially say, okay, these look like roads to me
being connected up and had similar properties to maps and stuff. We found a lot of these things
we find culprit images here. Essentially were caused by kind of features in the background that
you wouldn't even necessarily normally pick out.
So one of the things we're planning do we have system we've been building is to actually go back
and be able to very quickly look at the image its is making mistake s on and reclassify the
mistakes, but also start to look for these -- for the ones that's making mistakes on, look to see if
there is some pattern in the images that are causing the misclassification and then remove them
from the repository.
Okay. So to conclude here so we essentially have developed this method for automatically
harvesting the maps on the web that's accurate. I mean fairly accurate. I think there's a lot more
we can do in terms of improving the accuracy there, as I said. Fast and scaleable. Scaleable in
the sense that it's really easy for us to add additional maps to it to improve the overall accuracy
without hurting performance of the system. So future work we're going to look at resolving these
things that we call corporate images, exploring other features. I mentioned other features people
have explored in the past. So we will try to incorporate additional features to improve the
classification and then finally plug it into our automatic sort of geo referencing framework where
we take the images and pass them through the whole process where we then pull out the
intersection through the alignment with the imagery and we have -- our vision then is that you
have a whole repository of maps for a given area that you automatic harvested.
That's it. Let me take some questions.
(Applause)
>> Moderator: Thank you. Questions?
>> Question: Hi. In your testing set the (inaudible) images most of them were photographs or
some of them were also ->> Craig Knoblock: No, there's a whole variety of things. It -- Cal Tech image repository there's
a lot of photographs, but there's a lot of other stuff in there, too, that are not photographs.
>> Question: Okay.
>> Craig Knoblock: Might be paintings and other kinds of things. That repository is available
online in Cal Tech.
>> Question: I presume a lot of the maps that you find are going to be geocentric. I mean,
they're not sort of (inaudible) (inaudible) in the system?
>> Craig Knoblock: What? Sorry.
>> Question: Most of the maps you find online are sort of generic digital mapping tools, the
assumptions of geo (inaudible) (inaudible) on a plane.
>> Craig Knoblock: Yes.
>> Question: That's another -- have you considered that pattern as bias when you're trying to
identify something that is on the map? I mean yeah you will find things in strange projections, but
most of the time ->> Craig Knoblock: Yeah. In fact what would happen with things like strange projections and
stuff is that we may be able to identify it's a map and then we would probably end up failing to, I
mean depends on how strange it would, but we might end up having trouble ->> Question: (Inaudible).
>> Craig Knoblock: Right. Well, you know, actually the real problem is we have maps that as
long as they're drawn to scale and we can do the processing of the map we'd be able to do the
mapping. One thing that happens with things like the projections and stuff is because we are not
just looking for the corner points on the map, right, we end up finding the control points that in
some sense we can actually correct for the different types of projections and stuff. So you'd end
of stretching the map. If you could find the mapping you would end up stretching the map and
then it would align with the image that you are looking at. The real problem comes up in the class
of maps that are out there that are not to scale for example. Right. There is a whole set of things
or you know one of the things we haven't dealt with yet is dealing with maps that are really high
levels of extraction. Right. So you are only seeing sort of the very high level road network and
those kinds of things, so those are the places that we're going to have trouble with.
Any more questions?
>> Moderator: Anyone else?
>> Craig Knoblock: Okay. Great.
>> Moderator: Thanks.
(applause)
>> Moderator: The next speaker is Horst Bischof from Graz University of Technology Austria and
the subject is semantic enhancement of street side metric and enrichment.
>> Horst Bischof: So thank you for this introduction. So -- lights? Is there someone turning on
the -- okay. I don't want to fall you asleep, huh? (Laughter)
>> Is that okay? More?
>> Horst Bischof: I think this is fine. Thank you.
Okay. So thank you for inviting me here and giving me the opportunity to talk here. I have also to
thank the Microsoft technical support that saved my talk because my laptop crashed and
fortunately they were good enough to save -- to get off this data from the laptop so that we see a
presentation here. So it's really great. Otherwise I would have to give a presentation without
slides and this would be less interesting, you know. I hope so.
Okay. So what I'm talking about here is semantic enrichment of street side data. And this is work
we have done recently. So you have seen here a lot of presentations like Steve's slides. Steve
Stansel(phonetic) showed what he can do without his technology generating massive data. So
go there, take photos, do these through 3D reconstruction, like here, and you get various sorts of
things. But and I have to thank Wolfgang Walker for his presentation yesterday because he
paved my way. This is just data. These are triangles, points, pixels. You can nothing do with the
data except looking at it, navigating through it, browsing through it. And Wolfgang had some nice
examples yesterday where he showed well, what you really want to do is you want to query this
data, you want to ask questions. Where are the windows? Where is the parking meters? Where
are the cars? Where are the trees? Think me -- moving close to a tree or close to three trees,
which means we need some semantics there.
And Wolfgang had this nice red line there and said well we need to cross this red line and I hope
to show you something how we can do that with some of the technology. Of course I'm not
saying we have solved the problem, but I would like to give you some hints how we can do that.
So as I said, data isn't enough. So we have to recognize objects and I will show you something
on street side data we did recently.
So the talk will be split in two parts. First spent briefly about recognizing compact objects like
cars, pedestrians or as you have seen probably in the day or more these parking meter things.
So where you have really compact things. So this is type of street furniture you would like to
recognize and I will talk very briefly about it and show you how we can do with one of these
approaches we have reason to develop like online tools did.
The second half of the talk will be devoted to recognition by shape. This is very recent work
basically done by Mike Donaldson in his PhD thesis where we have developed a new efficient
shape matching method called IS Shape -- IS Match, sorry. And I will show you how this works
and explain you the algorithm and then I will show you some recent results we have done with
That window detection, which is one of the things you really want to do on street side images.
Okay. Before doing that I would like to report to you an additional success story we had recently
because one of the goals of this Microsoft grant is also to foster additional research. And we
have been very successful in that. We have been recently granted a project for a grant from the
Austrian Ministry of Science for a project called City Fit, which stands for High Quality Urban 3D
Reconstruction by Fitting Shape Grammars to Images and Derived Texture from 3D Point
(inaudible). So long title. So the story is you really want to generate building grammars from data
you acquire from things. And this is a three-year project with the Department of Microsoft and
Graz and then Computer Graphics Institute in Graz and our Institute in Graz. And the total
project money we get for that is more than 600,000 euros and with the new euro since the dollar
is so weak this is more than a million dollars. So this is quite some money.
So basically if you calculate what you invested in this research grant it basically gives you a factor
of 25 what -- so for $1 you put in research grant you get now $25 back. So this is worth the
money, I guess.
So the challenge we have here is we really want to do building reconstructure, building grammars
for also these old-type style buildings where you have really a lot of the same structures, just very
detailed facades and you would automatically derive grammars that construct you the buildings.
So this is a real challenge and will keep us busy for at least three years, probably a little longer.
But this is the goal we have and this is some of the data we recently acquired in Graz.
And modernizing this project so there is a competitive quarrel call. I think there were about 30
submissions to this call and only seven of them were granted. And actually we won also the first
prize. So this is certificate that stands like first prize like in German. So quite successful project.
Okay. So this was somehow that work is meant now to the real stuff. Let's talk about recognition
of compact objects. So a compact object, I mean object which I can recognize appearance
where they have compact shape so like rectangle type shape, not sort of spread out in space.
And typical examples will be cars, parking meters, signs, lampposts, something like that.
And so a common approach of doing that in the machine vision literature is like you take some
features and then you use a classifier to do that and typical examples we have also seen in the
park before, but create like you would use for example SVM, support vector machines to do that
and classify that. Or another approach would be like boosting type methods to do that.
The problem with that is if you follow that thing you need a lot of training data and we know from
the literature when you have enough training data these things work nicely. So if someone is
sitting there and labeling you hundred thousand cars and hundred thousand long cars, perfect.
But of course we have a lot of street furniture which we'd like to recognize. So maybe you will
find someone doing the cars, but you don't find someone doing all the parking meters and you
don't find someone doing all the lamps and all these things. So one of the goals we have here is
reducing the training effort. And by doing that we have to developed over the years a method
called online boosting. So maybe you are familiar with this viola-type boosting things which have
been very popular in the literature.
So what we can do on top of that is we can incrementally and online train such a classifier as new
data arrives. And using derived features basically things that can't be calculated very fast.
Typical examples would be highway floods, inter (indiscernible) histograms, low cabana
repetance, basically everything you can calculate fast is integral data structure. You can do this
in realtime. So this is one of the advantages. And basically what we are doing is here and what
we have shown in the demo and in the poster, we are putting this booster since we can do it now
fast in an active learning framework. So we have images. We have that online booster. Run the
booster, get some (inaudible) and then we have the teacher, in this case a human that, says well
this is correct. This is not correct. And this provides the labels. And so we then finally answer
the classified.
The big advantage of doing it that way is you only label the necessary data. You don't label
unnecessary data. So this really reduces the labeling effort quite a lot. Here are some examples
and you have probably seen it in the demo. So here are parking meter images. So if you look
closely, so there is a quiet park. Like here you have this type of parking meters we are interested
in. And of course just look in every city, they look different. So you really have a demand, you
cannot put one universal parking meter detector. Every city has to have its own. There has to be
some training effort.
So if you click once, this is basically like a parking meter. This is the parking meter -- this is the
positive response you get. Then you say, well, this is wrong, this is wrong. You click a few clicks
more and then you can see you detect it, but you still get some false positives and as you go
along the number of false positives reduces until you then finally end up with the perfect detector.
And then basically if you do this on this type of images it basically means like 50 clicks and you
have already a fairly good texture, which is much more efficient than doing it off line. This is one
of the examples.
We have been also using that on aerial images, also from this Excel camera for car detection. So
basically here you see cars and one of the goal is to detect the cars, either for removal or for
counting. And so training was one example. We'll give you lot of false detections. 10 samples,
58 samples. You always get a perfect car detector and yeah, so doing some post-processing like
(inaudible) gives you perfect car detector. So this is a very universal tool which you can now use
for different type of things and hope you are happy with that, you will see it is something you can
do with ->> Question: (Inaudible).
>> Horst Bischof: Sure. Sure. Sure. So this is one of the tools developed for this type of
compact object.
Let me now come to the second part, which is about shape metric. And shape is another very
useful tool. Not all of this object can be characterized by appearance, by texture, like looking at
this type of images we as humans are very good. We see this is a face.
This is a horse. This is a plane. And the only feature we have there despite all of the occlusions
and the things is shape. And so what we are interested in is we need a partial shape matcher, a
matcher that work when is we have partial segmentation occlusions and that thing. And things
we want to apply this on lots of data meaning street side data. It has to be efficient. So This is
the goal we set up.
So what I will present you is just IS Match that can work, can match this type of shapes, even if
they are coded and give you perfect shape matching. So the IS Match basically has two types of
characteristics. It uses sample points. So the shaper presentation is just take your points and
sample on the regular basis along the shape. So you don't have to detect like high curvature
points or interest points, longer shapes You just do regular sampling. It exploits the
neighborhood information so the point plus neighbor so it is related to shape context things. But it
allows also for aclusion which unlike shape context which basically puts histogram
representation, cannot do that. And it is really efficient.
So the goal is to provide competitive matching results, compared to the best shape metrics that
are around, but it reduced complication of costs. How we do that I show you. So basically the IS
Match consists of five steps. The first thing is we have to select a shape representation, an SS
set. We just sample points regularly along the contour. So you have to shape here and just pick
points and say, I want to have 30 points along this contouring to sample now. So this is
straightforward. Then second thing is shape description and I show you how we do that, then
how we do the matching and the strength of the detector is really in two parts. We have red dot
shape description together with the shape matching gives you really the power of that. Once you
have that, the second step is you have to calculate the similarity measure, how similar is your
shape to your target shape and then finally you could also like to shape completion by once you
have the matches you transform the target shape to the shape and complete the shape. This is
straightforward so we want to talk about these things.
So I will spend time on these three points. So how does the shape descriptor like? Remember
we have order of sequence of sample points along the shape, P1 to PM. So what we do is take a
point on the shape. Then we take all the points along the shape, so this PI, PJ. And we take a
point that is minus delta points away from that shape. And we take that angle as a
representation. And so basically you start here with the first point, get this triangle here first to the
first, of course the angle is zero and then you move delta away. So you get here this angle which
you enter here. Then you take the second point, third point so you get one line here. The same
thing you do with the second point you get another line here of angles. So this is of course
representation and another possibility instead of taking the angle you could also take the cross
ratio of the lengths of the court. The only parameter you have here is this delta, meaning how far
are you away from that thing. And this is basically like a (inaudible) parameter. If this delta is 1
so you go in like one point away you finally getting all the fine details. You take delta larger you
of course sample the (inaudible).
So you end up with the matrix A with all these angles. Now this is over complete representation
of the shape because basically one line would be sufficient to reconstruct the shape. But you
have many of these things. And this is very important that you have this over complete because
in other shape representations what they would do is histogram on top of this type of things, but
of course then you can no longer handle occlusions because you are mixing up the things. So
the important thing is that you really have that here.
So you have the matrix and of course depending on where you start this matrix will look different.
So if the first point you start five points away. This will be a different point. But important thing is
that the matrix is just a shift of the matrix it would get.
Since you are taking angles of cross ratios this is invariant of translation, rotation and scale. So
yeah. So the next step is how we do so we have now shape representation of the test matrix. So
how do we match now two shapes together? How do we -- so since we have these ordered
points here it's basically just an order preserving alignment of the shape. So we have this order
point so we need to solve this order preserving assignment problem. So formulating that
mathematically so here is this angle of metrics. Here is this cross ratio metric. So you can deal
with one or use both of them. So this is not important. So what you are doing is you try to find,
you would like to minimize this measure, which basically means you are trying to find some
matrixes that give you the minimal distance of these two-shape representations.
Which doing that basically means have you to move over three different types. You have the
sequence assignment. You have the starting point. And you have the chain link. So if you are
doing that in a trivial way have you would have to program a loop over four chain links for starting
point of over sequence assignment. So this is a complex process and would take forever.
But now we can make use of one thing is we can basically think about this integral data structures
because basically what you are doing here is just like putting signs over the submatrixes, so
looking at this matrix what we have here. So everything would -- this A sub mean you calculating
here sums over some submatrixes. So if I'm now putting this thing in an integral data structure I
can calculate the sums more efficiently just by look up.
So if we now for each starting point we calculate such an integral thing we can very fast calculate
these differences, but just look how the shape matrixes. So this allows a very efficient descriptor
calculation. So we are getting rid of two of these inner loops there. So we have only one loop
related there, just a starting point. So we can calculate it very efficiently, the shape matching
measure. Yeah. That is basically the whole trick around that. So you can't calculate it that seem
very efficient link.
Here you see some of the results. So this would be the target shape and here is the line shape
and you see here this is basically the shape, the subshapes that match along this contour.
Last point is shape similarity. You need somehow to assess how similar two shapes are and
what we did here we just borrowed this measure that (inaudible) used in this shape context so he
has like four different types of measures so the descriptor difference this is what we calculated
before. Then when you -- once have you this alignment you can basically put the shapes on top
of each other so you get this (inaudible) distance. You can account for nonlinear alignment by
calculating (inaudible) energy and also penalize for short matching sequence. So this is basically
what balance sheet used so we borrowed just a measure from him.
Okay, the question is now how good does this perform? So there are a few standard databases
of shape and we performed some comparison on that so there is one (inaudible) 25, which
consist of 25 shapes and six different classes and here you see some of these algorithms that
have been proposed in the literature. Basically so this metric means like the first shape, how
often did you retrieve it correct from these 25? Who got 25 correct? And as you go along you
see. So This is a very simple database, you basically get perfect results out here.
Here (inaudible) 99 is a more complex database and currently the literature is
Phelps-Schwabb(phonetic) type of algorithm is considered to be one of the best, best -best-shaped matching results. And you see we're basically close so the first four or five retrievals
are the same as Phelps-Schwabb(phonetic) gets and the later we get a few mix-ups, but basically
would perform as good as same.
The important -- so this is then there is another database.
This impacts seven database, also standard space database and here so here again compared
our method to this type of shape so we are a little bit worse than the Phellps-schwabb(phonetic),
but there are these two shapes where our method is extremely bad so you get always this gets
wrong because this is -- you have just this inner structure, which we don't account for, so if we
remove this we are basically close to that thing.
So basically the results are more or less the same, but the important thing when you compare the
timing so we see like this Phellps-schwabb(phonetic) is basically performing best so it needs like
half a second on this thing. We're needing just 25 milliseconds using 30 points. So this is really
a speedup you get here. And get basically the same performance.
Once have you that you now can play a lot of different things. So now we have an efficient shape
detector so we can start doing object detection based on shape. So basically how you do that?
Well, you have -- you start your shape you would like, supposed to sparkle. Then you extract
boundary change. You connect boundary change little bit together so that you get longer change
and then you basically use the shape matcher and then you use this match shapes for (inaudible)
working like a half transformed type of working in order to do this object detection. So we ran this
on these. We have the image. You get Edge Detection. You do this boundary training. So you
see a lot of clutter and this is basically what the shape metric then delivers you as shape
matching results.
So we have been running this on this ATH database of object detection. We put there so they
have this nice epilog is there. So we see basically what the detector delivers you, so you detect
bottle (inaudible) and these things. So it performs so we haven't done really (inaudible) on this
type of database, but it gives you quite some good detection results, especially this epilog is so
nice because they are so distinct compared to other and you see your react to size translation
also sign transforms a little bit linear transform. It gives you quite nice results.
We have also applied these to window detection because a window can be described by shape
as a rectangle, so very primitive shape. And we have applied this to recently acquired Graz
database so the thing we have done here for preprocessing is we have used This MSER, this is
Maximum Stable Extreme Regions, for cementing that image into parts and then taken the
contours of this Maximum Stable Extreme Regions and plugged into the shape metric and the
matching shape was just a rectangle to that. And these are typical results you get. So you see
you detect most of the angles. There are some false detections, some you miss on the border,
but overall these are quite nice results.
So here are some videos. Okay. Now this doesn't play here so I have to show you outside. So
running along the facade, so here you see windows detected on this video and you see basically
almost some of the windows you miss, but you know if you miss the one or the other it really
doesn't matter because you can simply buy sort of drawings so if you miss that it doesn't matter
because you can simply complete by sort of drawing -- so if you miss that it doesn't' matter
because you an complete that by knowing that left and right are windows issues, just simply
cannot matter. So this would be one example. Here is another example.
So you basically see, you almost get really nice window detection results from this approach and
of course you can get it very fast. So here is another video. And you see a few of them you
missed, but a wall, the things look quite okay. So we are quite confident that we can get fairly
good detection results.
Okay. So let me come to a conclusion. So what I have shown you is we have a lot of recognition
machinery available, which we can use for street side data to recognize what is of interest to you
and we have heard during the Summit that there is quite a lot of things to recognize there. So of
course these things need to be combined in a workflow. There is also 3D data label that can be
used. Of course this Windows detection you don't need to work on distorted facades because it
is very simple to align them. You can use 3D data. You can use more information, which just
would increase the recognition results and a lot of the things we will do in this new project
certificate. So thank you very much for your attention.
(applause)
>> Moderator: Questions? Okay.
>> Question: So where does it fail? It seems like from looking at the apple and you managed to
match under a very strong perspective projection there.
>> Well it's not really perspective (inaudible). So if it's a little bit tilted it's okay. There is another
thing which we still have to work on. At the moment we are just getting the largest shape match.
But of course if you have like the middle some occlusions what you would like to have is like two
matches on the top and left so we need to do some reasons on top in order to find really the
complete shape because at the moment we are just getting the longest segment that it matches.
But we have some ideas how to do that.
>> Okay. Thank you.
(applause)
Download