>> Moderator: Okay. Good afternoon. My name is (indiscernible) and it's my pleasure to invite you to the afternoon session. The first presenter today is Craig Knoblock from the University of Southern California. The subject is integrating online maps with aerial imagery. >> Craig Knoblock: Okay. Great. Thank you. All right. So I'm going to talk about finding and integrating maps with aerial imagery. And I want to acknowledge my colleagues on this. This is joint work with Cyrus Shahabi, who is sitting back there somewhere, Yao-Yi Chiang, Aman Goel, Matthew Michelson, all from USC. And then Jason Chen, who's at Geosemble Technologies, which is a spin-off company from USC. And this is research sponsored by a combination of the National Science Foundation, Air Force Office of Scientific Research and Microsoft Research. Okay. So here is the problem we're looking at. The first part of the problem is there's lots of information out there. And so one of the things we wanted to do is say okay, we'd like to go out and find maps that we can actually put on top of imagery and find them automatically. And we can go out to an image search engine like MSN here has an image search engine. And I entered here Redmond maps. And I get back, as you can see, some things that are maps. Let me try this pointer. So this a map and this is a map, but that's not really a map. And so you can see there's some combinations of maps and things are not actually maps in this image. And so one of the problems we looked at was automatically classifying things into maps. That's the first part of the problem. And the reason we wanted to do that was for the second part of the problem, which was then to take these maps and automatically align them with the imagery. So here's an image that was automatically processed where we didn't have to tell the system anything about the map itself except its general location. So you could get it out of a map search engine and then the problem then was can we actually determine the exact coordinates of the map, find the control points and then superimpose it on top of Virtual Earth in this case. So that's the problem that I'm going to talk about today and how we actually solve this problem. And there's really two parts to my talk. First I'm going to review previous work that we've done in automatically aligning maps with imagery and I'm going to do this pretty quickly. This is stuff I talked a little bit about at the last Virtual Earth Summit. So I don't want to spend too much time on that. And then I'm going to talk about the newer stuff we're doing on automatically identifying maps. So let me dive in and start that. Okay. So here's the problem, as I said. On the top we have some image of the world and on the bottom we have some map that we found from somewhere, maybe we got it out of one of the image search engines. And this map -- a lot of these maps are in raster format so we don't actually know the metadata for the map. If you know the metadata, great, you don't need to go through this process. But a lot of times you might go out and you might find maps about all kinds of different things. You know, might be layouts or locations of oil and gas wells that could be real estate type maps. Lots of interesting maps that are available, but they're in raster format. And so you don't have any information about the coordinates of the map or even the details like the scale and locations and things like that. So we want to put these things together so that you get the combination and you can get the information and many times one of the most useful types of layers in the maps is simply the text layer, which has all the labels on the roads and buildings and so on. Okay. So here's another map. This is a map we got this map from the Washington, D.C. Transportation website. And it's PDF. There's no metadata for it. You just go there and you can download the map. And it has the location of all the bus lines and you say well I'd like to superimpose that on top of the imagery so I could see the location of all those things. But you'd have to do it manually today to do an alignment. Okay. So what we do is we developed an approach where we essentially use the vector layer, which is essentially the road layer, as the glue to essentially align the map with the imagery. Okay. And so the way this works is we start with a map and an image and we'd like to bring these two things together. And so we also have some kind of road network layer here and the first step is we have to take this road network and align it with the imagery. And one of the problems that we find is that a lot of these kinds of layers and stuff, they're not aligned and we need fairly accurate control points for this process. So the first step is to actually -- automatically align the road layer with the image so that we know where all the intersection points are on the image. The second step is then to find -- take the map and find all the intersection points on the map. And we want to do this automatically. We don't want to have to go through a manual process where the user finds each of these control points because it's just too time consuming to do this over large numbers of maps. Once we have that then the next step is to actually do some kind of -- do a process we call point pattern matching where we're going to determine the relationship of the points here to the points over here. And in some sense what's going on here is we're using the layout of the intersections as the key or the sort of fingerprint of the map to actually figure out the relationship to the image. So we know the general area that this is a map of some city, let's say. But we don't know the exact location. So we're using this layout of these intersections and intersections are over a large enough area are going to tend to form a unique pattern. So we're exploiting that property here. So we have a point pattern matching algorithm that comes up with the matching between these so then we can actually superimpose the map on the intersection -the map on top of the imagery or vice versa and do a process called conflation, where you actually stretch the map so it actually fits over the image. Okay. So here's the first part of the process. So the first problem then I'm going to just very quickly review is work we published in 2006. But it essentially goes through this process of taking the road network and aligning it with the imagery. And the basic problem here is that you know as I said if we just take this image -- take the road network and superimpose it on the imagery it's actually not aligned. If we go through some automatic conflation process then the intersection points are actually on intersections, which is important for the next step of the process. And just to give you a sense -- I'm not going to go through the details, but just to give you a sense of how the process works we go through a general process here where we first identify a set of control points. And I'll describe how that works next. But that's really the heart of the process, which as we find a set of control point pairs and we're using the intersections on the vector data and finding the corresponding intersections here on the image. We do some filtering so that we get rid of any noise that we might have introduced to this and then go through that final sort of triangulation rubber sheeting so that we end up with the road network superimposed or correctly aligned with the imagery. So the key then as I said is this control point detection and that process is essentially we start with the road network and we're really explaining the fact that we don't have to just do road extraction from the imagery, which we all know is hard. And instead what we're doing is exploring the fact that we know something about the approximate locations of the road network so it is closed, it's just not aligned perfectly. And we also know something about the general shapes of these intersections. And so we, you know, the intersection is here, but the real intersection of the imagery is here and what we're doing is we're looking for the intersection within some radius of that original intersection. And so the first part of this process is we're going to take the image and do simple machine learning to basically classify pixels as either road -- on-road pixels or off-road pixels so that that's what's shown here on the right where we have the white is the road pixels and the black are the off-road pixels. And you can see things like trees create some noise in there and roofs sometimes look like roads. The Usual kinds of problems we have there. Then what we do is we take the vector data of this road network and we build some kind of template here that we can then align with the image. And now we can combine these two, essentially searching for the best match within some radius of that. So that is one possible location. So you can see it goes through a search process and decides this okay the is actual best match of that intersection point on to the actual intersection. So that we can get a nice alignment between the two different layers and that will then tell us where the intersections are on the image. All right. So the end result -- so here's a sample of the result of this. So the red vector, the original vector data here, I think this comes from the Missouri Department of Transportation. And the blue line here is the actual line imagery. You can see the alignment is quite good after we go through the process. We know where the intersections are. Now that brings us to that. We've sort of gotten to this point here. So we've got this image here and the all the little red dots shown there are the actual intersections on the roads. Now we go through the other part of this process where we have to now find the intersections on the map and align it with the image. And just very quickly I want to give you a sense what is going on here. So this in general is hard and we don't want to have to train this for every map. So the idea is we go through a process where we first do some kind of automatic thresholding where we can pull up the background color and we end up with just all the foreground pixels, which contain the roads and text. Then we're going to remove noisy information, which in this case is the text. They tend to interfere with finding the road intersections. We go through a set of transformations where once we remove the text we end up breaking the lines and so we'll go through some morphological operations and we'll clean that up and reconnect the roads. We find the corner points where the things that are likely to be intersections and then we finally have some wave actually testing which things are really intersections by checking how many roads are actually coming together in here. And so what we end up with is a set of detected intersections now which are actually shown on the map. And there's a whole paper that describes this work and I just don't have time to go into it today. But I want to give you a sense that this is an automatic process. We feed it in and find the intersection on the map. And now we know where the map intersections are and the next step then is to do this point pattern matching where now the problem is that I've got all these intersections here and all these intersections over here. And I need to figure out the relationship between these two. And this is a fairly -- can be a fairly search intensive process but we exploit some properties, so for example if we know -- or we assume the orientation of the map because most are oriented north and if they're not they usually indicate what north is. So we assume we know the orientation. But we need to worry about the translation and scale of the maps. If we know the scale then we can give it to the system, but if we don't know it then we can still handle that. So what happens here now is you said you take the point from this and we're going to look for the corresponding matching to the points over here and find that in fact the best fit for this set of points is going to be right there and that's going to tell us exactly -- not only where the map goes, but it's going to give us the -- you know it's going to give us the scale and the translation for this. And, this is very important, it gives us the control point pairs. So if we figured out what the mapping was that means that we know the mapping from each intersection point on the map to each of the corresponding intersection points over here on the image and we can use those for the final process, which is called conflation, where we're basically doing triangulation process and rubber sheeting where we're essentially taking each of these triangles and essentially stretching them to fit on top of the image. So in some cases this means we're going to have to draw up some pixels and in some cases we're going to have to fill in a few pixels. But the end result might look something like this where we had that original map and now we've stretched it to fit on top of the image. This map is slightly distorted partly because of the fact there was probably some photographer involved who moved things around a little bit to make things lineup. But you can see -- now you can see the relationship with the bus lines directly on top of the imagery for Washington, D.C. Okay. So that's my very brief review of sort of the general technique for automatically aligning the maps with the imagery. Now I want to get into the new stuff we've been working on, which is how we can automatically identify these maps. All right. So what we want to do is basically be able to go out and harvest these images from the web or maps from the web. But we really start out with some type of images. So we might use something like The mSN image server where it will generate a set of images and we can specify the area. So one useful technique we use is we often say okay here is the name of the city and we want -- give the key word "maps" and maybe about half the things that return are actually maps. And the problem then is we want to basically classify these things and actually generate a database of the maps and then we can go on and do some additional processing and kind of processing is what I just described where as then we go through those processes of actually trying to pull out the intersection points and then doing alignment. That's an expensive process. So we don't want to do that on a bunch of random documents, which is the reason we do classification first. Now what we assume here is we have a couple repositories. We have a repository of maps that we've seen before. So we've done a bunch of classification of things that we say are maps and we have another repository of you know non-map images. These are things that are not maps. And we're not talking about maps. I'm talking about street maps. So we usually classify them as you know they have to have a road network on if we're going to classify them as a map because there are many maps that just are generic kinds of maps. Okay. So the basic approach that we take to this problem is called -- is using what's called content-based image retrieval. And the idea is that we're using the similarity of the image that you want to classify to other images that are in the repositories. And so the nice thing about this approach is it's fast. It's based on sort of information retrieval techniques and it's quite scaleable. And I'll talk a bit more about that in a minute. But the way this works is that we go and we use the image we're trying to classify to essentially query the repository of maps and non-maps and it returns some set of images and in fact what happens is the set of images returned essentially are voting whether or not it thinks this thing is a map or not a map. So this is based on sort of canurous(phonetic) neighborhood type search and we compare this and I'll show you in the results you know we -- an earlier version of the system we developed using a machine learning technique called sport vector machines and we compare it to the use of For SVM in this and we found that it actually works better. So here's the way content-based information retrieval works. So you have some image set and have you some query image and then you use similarity function and I'll talk about similar function we use for maps in a minute. And you go out and using this image and the similarity function it goes out and finds those images that are closest to your image. And it you know has to rate them based on similar function and returns the closest to us. Now in terms of the features we need some similarity function that's actually going to work well on maps and we've experienced with a variety of them. And on of the one we've found that works the best is using what are called water filling features. And you can think of these as sort of similar to water going through a set of pipes, where you have forks in the pipes and you have some sort of flow for the amount of water, that's why it is called water filling. And this work well on images that have a very strong edge structure. And one of the properties of maps that you know if you think about a map you can often recognize at least a street map from a distance, the fact they have this pattern with lines that are very clear and one of the advantages of using the edge structure instead of things like the color on the map is that we get this color invariance. We don't want to have to have seen this map before in order to recognize it as a map and so we found that you basically have maps of every possible color and so we wanted to abstract away from the color itself. So here, okay, so the first step in this to use this sort of water filling algorithm is we're going to run this on what are called Edge Maps and so we use a standard sort of vision algorithm called the Kenny Edge Detector. And This just shows an example of maps. So here's a map on the left and on the right is the result of running the can Kenny Edge Detector on that. And you can see the edge structure in this map is very clear. Right. It's really basically a whole bunch of lines that come together various places with a little bit of text superimposed in different places. But that's the first step. And then the next step is to actually compute these sort of water filling features. So let me just explain the water filling by example here. So here is just a set of snippets of you know what could be a map. For a straight line like this might have a filling time, which is based on the number of pixels. So if this is 50 pixels in length then this would have a filling time of 50 here and then the fork count for this example would be one and that's simply because in this example there's no branching. If you go to the next example in the middle here you can see that okay this has an increased filling time because there's more pixels. You have twice as many pixels it's going to give you a filling time of 100. And then you have a branch. So each branch is going to increase your fork count by one and so now we have a fork count of three. Now on the third example you can see it's a little smaller so maybe this has a filling time of 70, but a higher fork count and you can see that there's more branching in this particular image. In fact it is nine. So those are what the features are computing and actually what we do is we compute with these sort of algorithmic features based on these water filling metrics. So we compute four things. One is the maximum filling time and its corresponding fork count. So the image is sort of broken up into subareas and for each of the subareas you're going to find whichever one has the maximum filling time and then for that one you're going to compute its fork count. And this is a measure of the longest edge and sort of the complexity of the edge. The second one is the maximum fork count and its filling time. So instead of computing the maximum filling time or actual fork count this is going to give us a measure of the highest complexity of some part of the map and then its corresponding length. And then we also create this filling time histogram and a fork count histogram for the whole image and the areas are broken up into two. Okay. Those are the basic features that get used. Then what happens in the actual content-based imagery process here is we started with some image we want to classify. And then what we do is we use this CBR. We have these two repositories. The map repository and the non-map repository. And then we use the content based information -- image retrieval to find the nine most similar images. I could only fit five here, but imagine there were nine. But in this case we return the first three, which have been previously classified as maps. So they came out of the image repository and we had labels on them. It said these three were identified as close. Then we also return these other two things that for some reason it decided was close to this. These are non-maps. And essentially what happens, we just vote based. I mean, based on what came back. So these three are voting as it's a map. These two are voting it's not a map. So we decide you know we take the majority. In this thing we say okay this thing we're going to determine to be a map. Okay. So how does this compare to doing this machine learning? So in machine learning we do something similar, where we need to train the system. We need to give it some training data. And one of the things that comes up in machine learning is okay how many classes do you have? And we can do it as a binary classification. It's either a map ore non-map. But one problems we had with training the machine learning process to do this was that maps are all very -- I mean you have a huge range of things that come up on maps and there is a huge variety of them. And using something like support vector machines, it would end up with either low precision or low recall depending on how you did the classification to get you know fairly high accuracy. So the machine learning is trying to use one model to sort of distinguish the maps and the non-maps and it was challenging to do that. The other thing you could do is try to break it up into sort of separate models. You say well there's different classes of maps because different maps look different. But that requires a lot more work to do the labeling. Trying to figure out which maps form which class, which is actually quite hard. So but we did want to compare it to this. And there's been a -- well, some related work here. So the SVM work done previously for solving this problem on classifying maps was done by another student of mine few years ago. And here we didn't use water filling. We used a different metric here. We used what was called loss textures. And these look at the texture map. Again we're trying to be color invariant. It is just a different feature. That instead of looking at the edge features we're really looking at the texture type features in the map. Other sort of related work here is the CBR work, CBR based KNM type work and this was used previously for classifying images in the medical demand and the medical images have different kinds of properties and stuff and so we found that the features that they were using didn't work that well in our domain. And the other piece of work here that is relevant there has been a fair amount of work using water filling features for other measures of CBR. And so some of these types of things are features that we may want to look at in the future to basically improve how we're doing. But the edge features we'll see in a second work quite well. Okay. So let me run briefly through this set of experiments we ran on this. So we took quite a variety of maps. So we collected maps for the set of cities shown here and then we took the Cal Tech 101 database, which I saw in another talk someone else used the same database. But it's a very nice image database of things that are all not maps. You can see the total number of images in repository. It's about 3000. And then we have about 1700 are maps and I don't know how we -- 3000 that are for non-maps. I don't know why the numbers don't add up actually, but the experiments we ran were to test two things. First we wanted to show or test the hypothesis that the content-based image retrieval is better than the support vector machines. So to test this we want to compare them using the same feature. So we ran the machine learning stuff that we had done before using the water filling features. And so we took 1600 training images. So we randomly picked them from the repository and then 1600 testing images, which are distinct from the training ones. And in both cases we took equal numbers of maps and non- maps. The second hypothesis we wanted to test was that the water filling features actually work better than the textures and so we compared -- well, we actually ran both the content-based image retrieval and the SVM on both types of features. They worked -- the loss textures worked terribly on the content-based informational image retrieval so we don't report the results here. But we also compared the support vector machines on both types of features. And these are the results, so just summary of the results here. So in the first line here we have the CBR with water filling and you see this we got the highest precision in recall. So we have 82% F measure, which is the combined precision and recall. The support vector machine with water filling, which is the same features we're using here, slightly higher precision, but the difference here was not statistically significant. It's very close. It's basically the same precision, but you can see the recall is much lower. So it's almost, well, more than 20% lower on the actual recall, which is really significant in finding a map. So you can see we end up with lower F measure for the results here. The second comparison was running SVM across these two types of features. And again the water filling helped SVM quite a bit. We are able to get better results using water filling features because the edges seemed to work well for map-type images. Okay. So you can see in summary the CBR with the water filling really worked the best for this. We are getting pretty good results on this. I think we can do better. So I'll talk about things in a moment, the next steps. But here is just a quick graph summarizing what happens. Okay, on the X axis here as we increase the number of the amount of training data. So what's happening along the X axis is that we're increasing the number of maps and non-maps training the system on. So 200 here would correspond to 100 maps and 100 images that are not maps and so on. And what wanted to see the impact on additional training data. The green line on the top is showing the CBR with water filling, the orange line in the middle there is SVM with water filling and then the gray is SVM using loss texture. So you can see they are all improving with additional training data. We're only going to get so far. It hasn't quite got there, but it's not going to get to 100% using this. So the question is all right we started looking at what kinds of things do we need to do to really get the performance to get the next level of performance to get up into the 90 percentile. And one of the things we discovered for a lot of the maps that were misclassified or images that were misclassified is that we had what we call culprit images. Okay. They were things in the database that would cause the system to misclassify a map. And usually a lot of these misclassifications were essentially off by one, meaning it was really close. The errors tended not to be just clearly a map or a non-map. Those were easy. They tended to be in the middle. If they were not things classified it would be 5 to 4 or 4 to 5. And what we discovered was there was a certain set of images in our database that over and over again actually could be held responsible for misclassification. So here's an example. So if you look on the left. Okay. You say, okay, well we see a picture of a person there. But on the right you can see that there is this road network behind his head that caused the system to essentially say, okay, these look like roads to me being connected up and had similar properties to maps and stuff. We found a lot of these things we find culprit images here. Essentially were caused by kind of features in the background that you wouldn't even necessarily normally pick out. So one of the things we're planning do we have system we've been building is to actually go back and be able to very quickly look at the image its is making mistake s on and reclassify the mistakes, but also start to look for these -- for the ones that's making mistakes on, look to see if there is some pattern in the images that are causing the misclassification and then remove them from the repository. Okay. So to conclude here so we essentially have developed this method for automatically harvesting the maps on the web that's accurate. I mean fairly accurate. I think there's a lot more we can do in terms of improving the accuracy there, as I said. Fast and scaleable. Scaleable in the sense that it's really easy for us to add additional maps to it to improve the overall accuracy without hurting performance of the system. So future work we're going to look at resolving these things that we call corporate images, exploring other features. I mentioned other features people have explored in the past. So we will try to incorporate additional features to improve the classification and then finally plug it into our automatic sort of geo referencing framework where we take the images and pass them through the whole process where we then pull out the intersection through the alignment with the imagery and we have -- our vision then is that you have a whole repository of maps for a given area that you automatic harvested. That's it. Let me take some questions. (Applause) >> Moderator: Thank you. Questions? >> Question: Hi. In your testing set the (inaudible) images most of them were photographs or some of them were also ->> Craig Knoblock: No, there's a whole variety of things. It -- Cal Tech image repository there's a lot of photographs, but there's a lot of other stuff in there, too, that are not photographs. >> Question: Okay. >> Craig Knoblock: Might be paintings and other kinds of things. That repository is available online in Cal Tech. >> Question: I presume a lot of the maps that you find are going to be geocentric. I mean, they're not sort of (inaudible) (inaudible) in the system? >> Craig Knoblock: What? Sorry. >> Question: Most of the maps you find online are sort of generic digital mapping tools, the assumptions of geo (inaudible) (inaudible) on a plane. >> Craig Knoblock: Yes. >> Question: That's another -- have you considered that pattern as bias when you're trying to identify something that is on the map? I mean yeah you will find things in strange projections, but most of the time ->> Craig Knoblock: Yeah. In fact what would happen with things like strange projections and stuff is that we may be able to identify it's a map and then we would probably end up failing to, I mean depends on how strange it would, but we might end up having trouble ->> Question: (Inaudible). >> Craig Knoblock: Right. Well, you know, actually the real problem is we have maps that as long as they're drawn to scale and we can do the processing of the map we'd be able to do the mapping. One thing that happens with things like the projections and stuff is because we are not just looking for the corner points on the map, right, we end up finding the control points that in some sense we can actually correct for the different types of projections and stuff. So you'd end of stretching the map. If you could find the mapping you would end up stretching the map and then it would align with the image that you are looking at. The real problem comes up in the class of maps that are out there that are not to scale for example. Right. There is a whole set of things or you know one of the things we haven't dealt with yet is dealing with maps that are really high levels of extraction. Right. So you are only seeing sort of the very high level road network and those kinds of things, so those are the places that we're going to have trouble with. Any more questions? >> Moderator: Anyone else? >> Craig Knoblock: Okay. Great. >> Moderator: Thanks. (applause) >> Moderator: The next speaker is Horst Bischof from Graz University of Technology Austria and the subject is semantic enhancement of street side metric and enrichment. >> Horst Bischof: So thank you for this introduction. So -- lights? Is there someone turning on the -- okay. I don't want to fall you asleep, huh? (Laughter) >> Is that okay? More? >> Horst Bischof: I think this is fine. Thank you. Okay. So thank you for inviting me here and giving me the opportunity to talk here. I have also to thank the Microsoft technical support that saved my talk because my laptop crashed and fortunately they were good enough to save -- to get off this data from the laptop so that we see a presentation here. So it's really great. Otherwise I would have to give a presentation without slides and this would be less interesting, you know. I hope so. Okay. So what I'm talking about here is semantic enrichment of street side data. And this is work we have done recently. So you have seen here a lot of presentations like Steve's slides. Steve Stansel(phonetic) showed what he can do without his technology generating massive data. So go there, take photos, do these through 3D reconstruction, like here, and you get various sorts of things. But and I have to thank Wolfgang Walker for his presentation yesterday because he paved my way. This is just data. These are triangles, points, pixels. You can nothing do with the data except looking at it, navigating through it, browsing through it. And Wolfgang had some nice examples yesterday where he showed well, what you really want to do is you want to query this data, you want to ask questions. Where are the windows? Where is the parking meters? Where are the cars? Where are the trees? Think me -- moving close to a tree or close to three trees, which means we need some semantics there. And Wolfgang had this nice red line there and said well we need to cross this red line and I hope to show you something how we can do that with some of the technology. Of course I'm not saying we have solved the problem, but I would like to give you some hints how we can do that. So as I said, data isn't enough. So we have to recognize objects and I will show you something on street side data we did recently. So the talk will be split in two parts. First spent briefly about recognizing compact objects like cars, pedestrians or as you have seen probably in the day or more these parking meter things. So where you have really compact things. So this is type of street furniture you would like to recognize and I will talk very briefly about it and show you how we can do with one of these approaches we have reason to develop like online tools did. The second half of the talk will be devoted to recognition by shape. This is very recent work basically done by Mike Donaldson in his PhD thesis where we have developed a new efficient shape matching method called IS Shape -- IS Match, sorry. And I will show you how this works and explain you the algorithm and then I will show you some recent results we have done with That window detection, which is one of the things you really want to do on street side images. Okay. Before doing that I would like to report to you an additional success story we had recently because one of the goals of this Microsoft grant is also to foster additional research. And we have been very successful in that. We have been recently granted a project for a grant from the Austrian Ministry of Science for a project called City Fit, which stands for High Quality Urban 3D Reconstruction by Fitting Shape Grammars to Images and Derived Texture from 3D Point (inaudible). So long title. So the story is you really want to generate building grammars from data you acquire from things. And this is a three-year project with the Department of Microsoft and Graz and then Computer Graphics Institute in Graz and our Institute in Graz. And the total project money we get for that is more than 600,000 euros and with the new euro since the dollar is so weak this is more than a million dollars. So this is quite some money. So basically if you calculate what you invested in this research grant it basically gives you a factor of 25 what -- so for $1 you put in research grant you get now $25 back. So this is worth the money, I guess. So the challenge we have here is we really want to do building reconstructure, building grammars for also these old-type style buildings where you have really a lot of the same structures, just very detailed facades and you would automatically derive grammars that construct you the buildings. So this is a real challenge and will keep us busy for at least three years, probably a little longer. But this is the goal we have and this is some of the data we recently acquired in Graz. And modernizing this project so there is a competitive quarrel call. I think there were about 30 submissions to this call and only seven of them were granted. And actually we won also the first prize. So this is certificate that stands like first prize like in German. So quite successful project. Okay. So this was somehow that work is meant now to the real stuff. Let's talk about recognition of compact objects. So a compact object, I mean object which I can recognize appearance where they have compact shape so like rectangle type shape, not sort of spread out in space. And typical examples will be cars, parking meters, signs, lampposts, something like that. And so a common approach of doing that in the machine vision literature is like you take some features and then you use a classifier to do that and typical examples we have also seen in the park before, but create like you would use for example SVM, support vector machines to do that and classify that. Or another approach would be like boosting type methods to do that. The problem with that is if you follow that thing you need a lot of training data and we know from the literature when you have enough training data these things work nicely. So if someone is sitting there and labeling you hundred thousand cars and hundred thousand long cars, perfect. But of course we have a lot of street furniture which we'd like to recognize. So maybe you will find someone doing the cars, but you don't find someone doing all the parking meters and you don't find someone doing all the lamps and all these things. So one of the goals we have here is reducing the training effort. And by doing that we have to developed over the years a method called online boosting. So maybe you are familiar with this viola-type boosting things which have been very popular in the literature. So what we can do on top of that is we can incrementally and online train such a classifier as new data arrives. And using derived features basically things that can't be calculated very fast. Typical examples would be highway floods, inter (indiscernible) histograms, low cabana repetance, basically everything you can calculate fast is integral data structure. You can do this in realtime. So this is one of the advantages. And basically what we are doing is here and what we have shown in the demo and in the poster, we are putting this booster since we can do it now fast in an active learning framework. So we have images. We have that online booster. Run the booster, get some (inaudible) and then we have the teacher, in this case a human that, says well this is correct. This is not correct. And this provides the labels. And so we then finally answer the classified. The big advantage of doing it that way is you only label the necessary data. You don't label unnecessary data. So this really reduces the labeling effort quite a lot. Here are some examples and you have probably seen it in the demo. So here are parking meter images. So if you look closely, so there is a quiet park. Like here you have this type of parking meters we are interested in. And of course just look in every city, they look different. So you really have a demand, you cannot put one universal parking meter detector. Every city has to have its own. There has to be some training effort. So if you click once, this is basically like a parking meter. This is the parking meter -- this is the positive response you get. Then you say, well, this is wrong, this is wrong. You click a few clicks more and then you can see you detect it, but you still get some false positives and as you go along the number of false positives reduces until you then finally end up with the perfect detector. And then basically if you do this on this type of images it basically means like 50 clicks and you have already a fairly good texture, which is much more efficient than doing it off line. This is one of the examples. We have been also using that on aerial images, also from this Excel camera for car detection. So basically here you see cars and one of the goal is to detect the cars, either for removal or for counting. And so training was one example. We'll give you lot of false detections. 10 samples, 58 samples. You always get a perfect car detector and yeah, so doing some post-processing like (inaudible) gives you perfect car detector. So this is a very universal tool which you can now use for different type of things and hope you are happy with that, you will see it is something you can do with ->> Question: (Inaudible). >> Horst Bischof: Sure. Sure. Sure. So this is one of the tools developed for this type of compact object. Let me now come to the second part, which is about shape metric. And shape is another very useful tool. Not all of this object can be characterized by appearance, by texture, like looking at this type of images we as humans are very good. We see this is a face. This is a horse. This is a plane. And the only feature we have there despite all of the occlusions and the things is shape. And so what we are interested in is we need a partial shape matcher, a matcher that work when is we have partial segmentation occlusions and that thing. And things we want to apply this on lots of data meaning street side data. It has to be efficient. So This is the goal we set up. So what I will present you is just IS Match that can work, can match this type of shapes, even if they are coded and give you perfect shape matching. So the IS Match basically has two types of characteristics. It uses sample points. So the shaper presentation is just take your points and sample on the regular basis along the shape. So you don't have to detect like high curvature points or interest points, longer shapes You just do regular sampling. It exploits the neighborhood information so the point plus neighbor so it is related to shape context things. But it allows also for aclusion which unlike shape context which basically puts histogram representation, cannot do that. And it is really efficient. So the goal is to provide competitive matching results, compared to the best shape metrics that are around, but it reduced complication of costs. How we do that I show you. So basically the IS Match consists of five steps. The first thing is we have to select a shape representation, an SS set. We just sample points regularly along the contour. So you have to shape here and just pick points and say, I want to have 30 points along this contouring to sample now. So this is straightforward. Then second thing is shape description and I show you how we do that, then how we do the matching and the strength of the detector is really in two parts. We have red dot shape description together with the shape matching gives you really the power of that. Once you have that, the second step is you have to calculate the similarity measure, how similar is your shape to your target shape and then finally you could also like to shape completion by once you have the matches you transform the target shape to the shape and complete the shape. This is straightforward so we want to talk about these things. So I will spend time on these three points. So how does the shape descriptor like? Remember we have order of sequence of sample points along the shape, P1 to PM. So what we do is take a point on the shape. Then we take all the points along the shape, so this PI, PJ. And we take a point that is minus delta points away from that shape. And we take that angle as a representation. And so basically you start here with the first point, get this triangle here first to the first, of course the angle is zero and then you move delta away. So you get here this angle which you enter here. Then you take the second point, third point so you get one line here. The same thing you do with the second point you get another line here of angles. So this is of course representation and another possibility instead of taking the angle you could also take the cross ratio of the lengths of the court. The only parameter you have here is this delta, meaning how far are you away from that thing. And this is basically like a (inaudible) parameter. If this delta is 1 so you go in like one point away you finally getting all the fine details. You take delta larger you of course sample the (inaudible). So you end up with the matrix A with all these angles. Now this is over complete representation of the shape because basically one line would be sufficient to reconstruct the shape. But you have many of these things. And this is very important that you have this over complete because in other shape representations what they would do is histogram on top of this type of things, but of course then you can no longer handle occlusions because you are mixing up the things. So the important thing is that you really have that here. So you have the matrix and of course depending on where you start this matrix will look different. So if the first point you start five points away. This will be a different point. But important thing is that the matrix is just a shift of the matrix it would get. Since you are taking angles of cross ratios this is invariant of translation, rotation and scale. So yeah. So the next step is how we do so we have now shape representation of the test matrix. So how do we match now two shapes together? How do we -- so since we have these ordered points here it's basically just an order preserving alignment of the shape. So we have this order point so we need to solve this order preserving assignment problem. So formulating that mathematically so here is this angle of metrics. Here is this cross ratio metric. So you can deal with one or use both of them. So this is not important. So what you are doing is you try to find, you would like to minimize this measure, which basically means you are trying to find some matrixes that give you the minimal distance of these two-shape representations. Which doing that basically means have you to move over three different types. You have the sequence assignment. You have the starting point. And you have the chain link. So if you are doing that in a trivial way have you would have to program a loop over four chain links for starting point of over sequence assignment. So this is a complex process and would take forever. But now we can make use of one thing is we can basically think about this integral data structures because basically what you are doing here is just like putting signs over the submatrixes, so looking at this matrix what we have here. So everything would -- this A sub mean you calculating here sums over some submatrixes. So if I'm now putting this thing in an integral data structure I can calculate the sums more efficiently just by look up. So if we now for each starting point we calculate such an integral thing we can very fast calculate these differences, but just look how the shape matrixes. So this allows a very efficient descriptor calculation. So we are getting rid of two of these inner loops there. So we have only one loop related there, just a starting point. So we can calculate it very efficiently, the shape matching measure. Yeah. That is basically the whole trick around that. So you can't calculate it that seem very efficient link. Here you see some of the results. So this would be the target shape and here is the line shape and you see here this is basically the shape, the subshapes that match along this contour. Last point is shape similarity. You need somehow to assess how similar two shapes are and what we did here we just borrowed this measure that (inaudible) used in this shape context so he has like four different types of measures so the descriptor difference this is what we calculated before. Then when you -- once have you this alignment you can basically put the shapes on top of each other so you get this (inaudible) distance. You can account for nonlinear alignment by calculating (inaudible) energy and also penalize for short matching sequence. So this is basically what balance sheet used so we borrowed just a measure from him. Okay, the question is now how good does this perform? So there are a few standard databases of shape and we performed some comparison on that so there is one (inaudible) 25, which consist of 25 shapes and six different classes and here you see some of these algorithms that have been proposed in the literature. Basically so this metric means like the first shape, how often did you retrieve it correct from these 25? Who got 25 correct? And as you go along you see. So This is a very simple database, you basically get perfect results out here. Here (inaudible) 99 is a more complex database and currently the literature is Phelps-Schwabb(phonetic) type of algorithm is considered to be one of the best, best -best-shaped matching results. And you see we're basically close so the first four or five retrievals are the same as Phelps-Schwabb(phonetic) gets and the later we get a few mix-ups, but basically would perform as good as same. The important -- so this is then there is another database. This impacts seven database, also standard space database and here so here again compared our method to this type of shape so we are a little bit worse than the Phellps-schwabb(phonetic), but there are these two shapes where our method is extremely bad so you get always this gets wrong because this is -- you have just this inner structure, which we don't account for, so if we remove this we are basically close to that thing. So basically the results are more or less the same, but the important thing when you compare the timing so we see like this Phellps-schwabb(phonetic) is basically performing best so it needs like half a second on this thing. We're needing just 25 milliseconds using 30 points. So this is really a speedup you get here. And get basically the same performance. Once have you that you now can play a lot of different things. So now we have an efficient shape detector so we can start doing object detection based on shape. So basically how you do that? Well, you have -- you start your shape you would like, supposed to sparkle. Then you extract boundary change. You connect boundary change little bit together so that you get longer change and then you basically use the shape matcher and then you use this match shapes for (inaudible) working like a half transformed type of working in order to do this object detection. So we ran this on these. We have the image. You get Edge Detection. You do this boundary training. So you see a lot of clutter and this is basically what the shape metric then delivers you as shape matching results. So we have been running this on this ATH database of object detection. We put there so they have this nice epilog is there. So we see basically what the detector delivers you, so you detect bottle (inaudible) and these things. So it performs so we haven't done really (inaudible) on this type of database, but it gives you quite some good detection results, especially this epilog is so nice because they are so distinct compared to other and you see your react to size translation also sign transforms a little bit linear transform. It gives you quite nice results. We have also applied these to window detection because a window can be described by shape as a rectangle, so very primitive shape. And we have applied this to recently acquired Graz database so the thing we have done here for preprocessing is we have used This MSER, this is Maximum Stable Extreme Regions, for cementing that image into parts and then taken the contours of this Maximum Stable Extreme Regions and plugged into the shape metric and the matching shape was just a rectangle to that. And these are typical results you get. So you see you detect most of the angles. There are some false detections, some you miss on the border, but overall these are quite nice results. So here are some videos. Okay. Now this doesn't play here so I have to show you outside. So running along the facade, so here you see windows detected on this video and you see basically almost some of the windows you miss, but you know if you miss the one or the other it really doesn't matter because you can simply buy sort of drawings so if you miss that it doesn't matter because you can simply complete by sort of drawing -- so if you miss that it doesn't' matter because you an complete that by knowing that left and right are windows issues, just simply cannot matter. So this would be one example. Here is another example. So you basically see, you almost get really nice window detection results from this approach and of course you can get it very fast. So here is another video. And you see a few of them you missed, but a wall, the things look quite okay. So we are quite confident that we can get fairly good detection results. Okay. So let me come to a conclusion. So what I have shown you is we have a lot of recognition machinery available, which we can use for street side data to recognize what is of interest to you and we have heard during the Summit that there is quite a lot of things to recognize there. So of course these things need to be combined in a workflow. There is also 3D data label that can be used. Of course this Windows detection you don't need to work on distorted facades because it is very simple to align them. You can use 3D data. You can use more information, which just would increase the recognition results and a lot of the things we will do in this new project certificate. So thank you very much for your attention. (applause) >> Moderator: Questions? Okay. >> Question: So where does it fail? It seems like from looking at the apple and you managed to match under a very strong perspective projection there. >> Well it's not really perspective (inaudible). So if it's a little bit tilted it's okay. There is another thing which we still have to work on. At the moment we are just getting the largest shape match. But of course if you have like the middle some occlusions what you would like to have is like two matches on the top and left so we need to do some reasons on top in order to find really the complete shape because at the moment we are just getting the longest segment that it matches. But we have some ideas how to do that. >> Okay. Thank you. (applause)