22998 >> Larry Zitnick: It's my pleasure to welcome Jianxiong... a student of Antonio Torralba's at MIT. And he's...

advertisement

22998

>> Larry Zitnick: It's my pleasure to welcome Jianxiong Xiao to MSR today. He's a student of Antonio Torralba's at MIT. And he's going to be beginning his third year. In the past he's worked with Long Quan doing a lot of really interesting work on image-based modeling. He's done, I think, segmentation-based modeling, modeling trees, streetside imagery, modeling 3-D buildings. A lot of good SIGGRAPH papers as well.

And then MIT, he switched courses and now he's doing even -- I don't want to say cooler research -- but two new really great research with Antonio. And just some of the highlights on that.

He's one of the creators of the Sun Database, which is a really huge database for scene recognition, I think he'll be talking a little bit about that today. And there's another paper, What Makes Images Interesting. I think a lot of people -- I know that paper made the rounds at MSR and I know people were interested in that paper, and he was an author on that as well.

And today I think he's going to be talking about his most recent work on scene understanding.

>> Jianxiong Xiao: Thank you for that introduction. I'm very glad to be here to talk about visual scene understanding.

Now let's start by putting it again I'll show you a picture for 200 milliseconds after that you're going to tell me what you just saw.

So are you ready? We'll locate the [inaudible] point. The photo is coming. So what did you see?

>>: A guy in a redshirt.

>>: Ducks.

>> Jianxiong Xiao: Oh. Yes, that's right. It's a park. Obviously you can understand the general gist of the scene. However in the view of traditional computer vision, many people with the grouping of the image into [inaudible] and do the segmentation must be the first step of understanding the image but in the guessing game, in the guessing game we just played, I don't think you can

[inaudible] the image in 200 milliseconds. Object recognition people may come to say by understanding the image you also need to know the name of the objects.

So this is supposed to be the representation of the result of your recognition system happening in your brain. But the way we already know so much about how many ducks are there are to see the shape of the ducks or even is there ducks at all.

By understanding the image probably this is what we mean is a park, it's a view of a park. But not all information on the side here. Many people ignore that and they still feed the ducks.

So in 1995, Mary Porter demonstrated a [inaudible] feature can be instantly understood by a human being [inaudible] a lot of information have been processed by the brain.

So now imagine that one day you become Superman, you have the power of teleportation, you can just travel around the world in a few seconds. Yet your human vision system can still process the scenes very well. You can handle it very well.

So I like the understanding of the general gist of which tile category the scene is our key major step in the human visual procession pipeline. For example, you want to understand the whole, know that this is [inaudible] this is a park. Even though you cannot recognize that's my Ph.D. advisor you are probably fine.

So what does it matter for computer sense? Imagine that one day we completely solve the AI problem and we can build [inaudible] as capable as a human being.

Actually [inaudible] solving those inside. So at the first -- at the step of the factory [inaudible] need to figure out what signal it is now. If the robot cannot figure out standing in a street, running around in a playground means he gets killed by a car, probably Google driver of this car.

Now, the simplest way to model the scene recognition problem is giving a picture that will define a list of scene category, and do the scene categorization of the model, simplest thing we can imagine.

If we want to build a system that has the category we have to build some data to it. Ten years ago two of my collaborators built this scene category dataset and then later few more category were added and this became standard benchmark of this view. And a lot of programs have been made.

So we see where we are after ten years of progress. Today we did the best computer algorithm we can attain 88 percent accuracy using the image as training data. So for human we can attain 95 percent accuracy on this dataset.

So the gap here is very small if you increase to 200 image it probably becomes very, very small.

So that doesn't mean that we have solved the problem. We already nailed down the scene recognition problem. Now as located, the dataset you look at this. It looks very easy to classify because it looks very different. For example, if you just open the file manager and change the image size of the JPEG fall for this because most of the time it has faulty file size in the JPEG.

And to evaluate the algorithm and to get more intuition of what are we now.

Instead of just cracking the larger database, we want to have more ambitious goal we want to collect all scene category. So how many scene category are there in the world? In the domain object recognition, people have seen to attain

a number of visual object category in the world. They took it additionally and they estimated the total number.

So we decide to do the same thing for scenes. We took additionally [inaudible] and we used to go through them one by one after all the words, whether this is a synonym or not. And download the image using search engine using those names as keyword and we're manually culling them out. So we see what category we have.

So in the end we reach about 900 categories and 130,000 images. Category

Abbey to Zoom. Our scene category have very subtle difference like kitchen and kitchenette are all kinds of different kinds of kitchens and also sometimes they have very similar function and visual appearance like bathroom cabinet and bath interior.

Our scene category also try to exhaust all the places cultivated by human being.

We cover both of natural and [inaudible] but in terms of that even with the

Internet with imagery some category may not have an image. So for all the spanning -- we use all these 397 category in that image have more than 100 image per category that we can example.

So the first question we want to address is whether our dataset is a good dataset. Whether it's constructively, is constructed correctly.

So we want to ask whether humans can do the same classification in this task.

So we want to ask humans to do the 397 alternative first choice. That's a big -- that's a very difficult task. But fortunately that's very hard working, hard workers on Mechanical Turk. We paid them in one sense and preview the image and ask them to classify the scene. To help them classify the scene without going through our 400 categories, we dial over complete hierarchy, for this image, when they say this is in dark, they can go this part and they can, this is the home, hotel, finally they choose this is a bedroom. So we filter out those bad workers and we have some way to identify them. In the end for those workers we can attain 98 percent accuracy on the first level. And 68 percent accuracy on the certain level. That means we have 68 percent accuracy on the 400 category choice.

So here are some categories that humans can do very well. They do it perfectly,

100 percent accurate. Humanly is not always perfectly accurate. For example, in this category, humans do already confuse with something else all the time.

The following two colonies the most confusing for that category they confuse with

[inaudible] and coast. But the conclusion is that humans can do it. Our dataset has been constructed consistently, a human can guess right -- despite the huge number of categories and despite we only paid them $0.01.

So we also want a benchmark for the computational masses but how do things get classified in computer? Definitely objects play a certain role if you know the object names, you know the way out, you can figure out -- you can identify and differentiate those things very well.

But that's not always the case. If these three different images, the objects are the same. The are totally the same but the function and there are different

[inaudible] quite well. For example, if you figure out you are sitting inside

[inaudible] you're not going to be able to go anywhere. But if you are sitting in a conference room you can just go outside anytime you want.

Quantity, that's the examples, the artistic, the major conclusion is that the observer have more accurate representation of the whole thing that they can see.

Not in the individual object. So, for example, the observer, the human being may have better estimation of the average size of the dot rather than the individual size of the dot.

So it's also very inspiring, the human vision literature. For example, if you see this -- if you see the coding, it's all just [inaudible] among T junction. There's nothing interesting at all. But by putting them together you can see that is obviously a tree.

But not all information is processed in the brain. For example, I'm going to show you a video. And something is changing in the video. You can see if you can identify what it's changing.

So you see what? Has it been changed? So I'm going to show you the first frame and the last frame of the video. You can see the bar is totally different color. The task disappear. And even here the whole building changed. But you still feel that, oh, I need to read the same image.

So the whole system [inaudible] where you get an image, you stress global information that lost information but it's still preserve, we still try to preserve what we already care about and we put them in a classifier and we train and predict the same categories.

Local feature, prediction local feature, means we have an image, local features that the frequency the gradient and the archive statistic. By global feature we just combine them together in some smart way and we get a feature. This image is likely to be similar with this one rather than this one.

So we'll evaluate a whole list of global features to see how they work in our sun database. In this list, call it histogram, tiny image and there's some very strong texture feature. The latest one the heart feature or [inaudible] feature and

[inaudible] feature. And the standard feature in this province.

So for a given feature in kernel we can imagine that all the images are embedded in certain feature space. A good feature is the one that allow a classifier such as support vector machine to divide one category from the other. For example, a category may be able to divide all these tasks with all the others and divide living room versus the others.

We can also combine all different features together by trying better algorithm by combining them together. So we can do a linear sum of those features and we use a technique called multi-kernel combination is just simply linear sum of those distances.

So we'll see what we get. With 50 training image per category and we remember we have only 400 category. So we have very low accuracy around five for the baseline feature. And the best single performing feature is the feature called two-by-two is the concatenation of two-by-two divided together.

But combining all of them together, using the multi-kernel combination that I just talked about we get 38 percent accuracy. But human is 68 percent accuracy. So that's the visualization of the result. So, for example, [inaudible] cabin we can predict those images quite correctly but also confused with some other image like this one. It looks similar but it's actually inside of them.

And sometimes it made no sense. But those I don't know how they do make sense. And here our gallery sometimes they make mistake at hotel room. This hotel room looks like an art gallery because there's paintings in there.

Here are a few more examples. Some of the confusing scenes, very reasonable at amusement park with playground. The room looks like a kitchen and the hospital room confused with [inaudible] room. And supermarket confused with

[inaudible].

Now, that's --

>>: Do you get a sense the percentages, how many query images? You said these are a couple hundred for each class?

>> Jianxiong Xiao: 415 --

>>: So that's roughly 15 out of 100 images or are there more query images? For supermarket when you say 58 percent, how many --

>> Jianxiong Xiao: We choose 100 images, training 100 images for testing. And those for 100 testing image we get 58 percent accuracy. Cool. Now, let's look at the closer -- that's a closer look at the confusion matrix. If we cluster the confusion matrix, we can see that obviously the two clusters here actually if we read the same thing here, most names here are corresponding to outsourcing, most are corresponding to insourcing. Let's take one example. For example, same machine can do 75 percent accuracy.

And they confuse it with [inaudible] kernel for five percent of the time. But human confuse with beach, that's slightly more reasonable than confused with natural kennel.

So we want to look at how the atom make mistake compared to the arrow. This is the true category, a human confuse those categories, most views with that category, they confuse the interior with the [inaudible] interior.

And with the very simple baseline feature we confuse mostly with something that doesn't make too much sense. But which are better feature you can see bus interior does looks a little bit bigger. If we combine all the features together, we decide the best algorithm, we can see that they're making more sense.

The error by them and by human is quite similar. Here are a few more examples.

The beach confused with coast and then also if we combine all the features, it's also confused with coast. So we can quantify this phenomenon.

Here is a list of features that we'll evaluate based on that performance. We can see that we are -- for each category we put how often the most confusing category was the same for human in each feature. So with the better and better feature we see that the error made by the computer and the human are more likely to be the same.

To further compare the human performance with our computational model here are some categories which perform most -- computer back, and human, computer that. But human is not always better that. For some category human can perform worse than computer. But they cannot figure out baseball stadium.

They may confuse it with football stadium.

So our computer algorithm is still far worse than human algorithm. So does it mean it's still too immature to be useful for any real world application? Probably not. In the second part of my talk I will show that scene understanding seems to be quite mature enough for real world application in certain domain.

I will quickly demonstrate the results on the semantic segmentation on street view image, and I will also quickly showcase a view of, few applications for scene understanding. Now, the scenario is that we have a camera mounted on a car, and the car drives along the street. Because the car can only drive on the street we know that the image catcher must be a street scene. That's all known.

We also know that the camera is facing towards the street-side. It's not facing on the direction of the car. And with all this given, then the task is to examine the image into meaningful areas where the semantic meaning this is a building and this is tree, this is ground, this is sky.

So to tackle this task we propose a very simple algorithm. We take a picture like this for each local patch, we stress very local feature, which may include the response from image filter bank, the 2D pixel location and the point density and surface orientation estimated from the 3-D point cloud, because we can run structural form motion algorithm and get 3-D point cloud because we have many sequence or image rather than one image.

So all these features are helpful but the information provided are very limited.

For example, if we just look at -- if a pixel is located on the top of the image, then from the histogram we know that it's either a sky or a building. There's no value to get anything else.

So if we combine all these features together to form a local feature vector, put them into a classifier to predict for the category. We can also enforce the spatial consistency.

That image that if two pixels are next to each other in the image they're very likely to have the same label. So we construct a Markov random field, which is a general technique for enforcing spatial consistency inside an image. We can also enforce the temporal consistency.

For example, two images they have many correspondents at this car. Not really, that somehow the image may correspond to the other part. So larger graph linking the subgraph in the video image.

We start with simple technique, we can also attain pretty good results. Here's the visualization. On the left is the image captured by Google's review capturing the car and on the right is the segmentation result. And on the bottom is the color-coded meaning of those colors.

And we can see the results are pretty accurate no less. And despite Google only allowed to give us some super low resolution image while they have something better. So for general semantic segmentation I have to say that the problem is very difficult.

Some progress has been made, for example, the Microsoft Research group in

Cambridge in UK but the problem itself is very difficult. In our case, because we already focus on street scene and capturing the environment and everything is very constrained. So the problems suddenly become easier. After we publish this there were a few groups following and they get very similar results even better results.

But furthermore, if we have a better image quality and also the 3-D [inaudible] of the string we can raise the [inaudible] on the car then we can probably get much better result and narrow this problem down.

I'm very glad to see in many companies have realized this opportunity and they have started to design systems that produce better results and with some very interesting application.

One interesting application is Google's driverless car. Essentially doing something very similar. That in real time, by combatting multiple sensor and dual sample texting of the object but the ideas are very similar. Another interesting application is building construction. If we figure out which part, which idea of the image company corresponding to a building we're constructing property.

Here's also a system we can build automatically reinstructing model. We use the

Google street image and we run structural motion to get all the camera pose 3-D per car we run the segmentation to attend the segmentation. Now we have a vertical lie and then we have to define some heuristic to send the routine in for service because we have a huge amount of data.

By this way we divide and conquer the problem and we can focus one for each time. By even here the algorithm can still figure out that you merged those together.

For each scene, we combine all the 3-D information for multiple view image, and we use a technique that we propose called inverse [inaudible] computation to compose all the evidence. And we have the composed [inaudible] and composed tetral map. Then after we do a [inaudible] to segment all the different areas and fit a tangle to those areas to force the architecture reality. Otherwise that map might be quite noisy and the result may look funny.

Now we have a model for each. When we combine them together, we can get the building model for the whole city. Here are rendering results that are feasible. We can see the building reconstruction is very robust. Maybe ready to deploy massively at city scale.

Thanks to the segmentation type eye, there's a [inaudible] here even if you make this 3-D algorithm, the 3-D still may fail. For example, you have a building full of glass highly deflective you never get 3-D there but if you can figure out the ROI from the semantic segmentation you can figure [inaudible] and you can do very visual representation.

>>: I think you can actually do some additional reasoning figuring out what

Windows doors --

>> Jianxiong Xiao: Yes, some people are doing that. Grandmar and semantic meaning there.

Another interesting application is to predict how memorable an image is. With glass mapping, we browse the Internet, we're constantly, continually being exposed to thousands of photographs. But humans are not very good at understanding thousands of image at the same time and various times. And also remember latent, but not all images are the same in memory.

Some of them stitch to our mind but the other we forgot to. For example, this image here, this image on this image are very likely to be remembered while this landscape image on the house is very likely to be forgotten. So we characterize image memory-ability by the probability that an observer will have repetition of a photograph. A few minutes later after he see the photo for the first time.

We think it's [inaudible] again the partition, the participant will view a sequence of image for one second and then they have a break for 1.4 seconds. And the task is to press the space key whenever they see identical representation of what they see before.

So by this we can get a set of image that are memorable and for gettable. For example, here is a image very likely to be memorable and here is a image that is very likely to be for gettable. We can observe most image if we browse through the dataset we see most images variable must contain a human and [inaudible] or they may have some funny objects that you never seldom encounter.

And those forgettable image, most of them are landscape. So we can use the image feature that we evaluate before to try to predict the image memory-ability we evaluate a whole list of features the conclusion by combining all features together we actually get the consistency behind the prediction of the computer and human is very close to the consistency between different groups of human.

The computer can even, better than human consistency.

So here are some visualization of the image that predicted to be most memorable. You can see they all contain humans.

>>: Was one of your features does the human exist in the image? Or is it just in --

>> Jianxiong Xiao: It also have some high level features. Use labels to annotate objects and you have the name.

>>: So you have object label as well.

>> Jianxiong Xiao: Yeah, yeah, that's here.

>>: Okay. And with that object -- with object arguable feature is to prove that here.

>>: Okay. It helps a lot to have object labels.

>> Jianxiong Xiao: Yeah. Yeah, but not absolutely. Well, in the paper we also put, if we only have the object in [inaudible] if we only have the object, it's not that much. So the major conclusion is that the human consistency is quite high. So memorability is imaginable, but the computer is also good at figuring out the consistency between humans so we can do the prediction quite well.

So here is some image with typical memory-ability sometimes they have humans sometimes not. But they have interesting story. But these images predictable have at least memorable but they're most forgettable, in other words. So most of them are landscape photos.

So if you want to have people remember your photo don't take landscape. So another interesting application is strap away image. I talk about giving an image like this you know it's a theater but not only human -- this is typical theater but the camera is facing towards the seat it's not facing towards the stage. In the typical theater like this, the way of the theater like this you have a stage here. The camera is pointing -- to build a mobile global features to predict scar on the viewpoint, here is a visualization, a score you have the highest regression that means the model can predict, be rational for the image.

Now we can superimpose the image on the average image of the panorama of theater. So we can see that here looks like a stage, and here is the seat. And the picture will lie quite well. And here is the nearest neighbor from the training

set. You can put the image here and you can see the [inaudible] is tended naturally.

Now we saw -- we're going to show some initial results. This is tension. The image you first see in the center is the input image. Those parts are the extrapolation results and here is another example.

The extrapolated view is produced using tensor techniques by using nearest neighbor from the training example as the prior to guide the texture synthesis.

Here is another example. You can actually see that even you see this, you can see that TV is ready to be here.

>>: Why are these -- are these big ellipses all over the place is that how you're splatting in, I keep seeing circles everywhere.

>> Jianxiong Xiao: Yes, maybe I walk with them -- not very good.

>>: But there must be something in your algorithm that prefers ellipticals in a regional area. Can you explain why those things are like that? You see what I'm talking about.

>> Jianxiong Xiao: If you match it locally, if we do it pixel by pixel you have a patch and match one pixel [inaudible] probably have some bias.

>>: Are your patches circular some spherical map when you unwrap it in a spherical they become elliptical.

>> Jianxiong Xiao: Yeah, they can be. Good point. Very good point. But besides potential like computer graphic application for this type of extrapolation it also gives us a sense of the [inaudible] tension in the human visual system. The

[inaudible] tension is out of commission in which human confidently they feel they have a strong image while they're not actually presented to them.

That if you see a train here, if I ask you to draw a picture of the train, you can draw it ran than the actual view after seeing the picture.

Here is another extrapolation result of the wolf. So in summary, the first part of the talk I have presented some database covers exhaustive set of the same category. And we saw that with hundreds of categories become available for the first time, we're able to evaluate classification algorithm [inaudible] and evaluate any of the state-of-the-art algorithm we establish new [inaudible] kind of performance by combining them together. And we also compare them with the human performance.

So in the second part of the talk, I saw that scene understanding seems to be mature enough for some real world application in certain domain, such as the semantic segmentation of [inaudible] image. I also showcase several possible applications for scene understanding, including 3-D construction of building mash

model, prediction [inaudible] how an image is and extrapolation of image beyond this boundary.

So I'm glad to acknowledge the effort from my major collaborator and the following agencies for providing financial support, including Microsoft. And I also want to thank Larry for inviting me here to MSR and looking forward to some interesting discussion.

And you may also go to my website to know more detail about these projects if you have any question I can take them now. Thank you very much.

[applause].

>> Larry Zitnick: Additional questions?

>>:

>>: One question I had you mentioned before when you do scene recognition there's a big conflict in my head whether it's important to recognize the objects before you can do scene recognition. The first example you gave us it was a park, and you said what was in it. A lot of us said ducks or lake that sort of thing.

We actually did object recognition. Did you think it's possible to do scene recognition without recognizing the objects?

>> Jianxiong Xiao: I feel it's optimal. You need to do scene recognition first then the scene recognition can guide the fixation points to search for the object.

Because for now the object that -- detection you do sliding window you try all possibility. That's super computationally expensive. I don't believe that human brain can do that.

So what you recognize the scene. For example you are looking for a pen so you look way out in space and you feel a pen should be here and you start to search for the pen. So --

>>: The other example we were able to recognize the objects without moving our eyes without searching.

>>: Right.

>>: We recognize the grass, ducks, et cetera, without moving our eyes because it's flashing in front of our faces, we can do recognition of objects real quickly. It's a question of does the scene immediately go to pixels or does it take the objects recognized and then --

>> Jianxiong Xiao: Yeah, it can be a feedback loop. But different people have -- you have to get a picture like this, you have to recognize the scene, you know the way out and you know how to navigate and then you figure out the fixation in the room, you are likely looking for a chair to see and then you have the object recognition pylon going on. But definitely there's feedback. It's happening very fast, and they're ready to be parallel. But I feel this is a key step that people have a lot of focus on fixation point, prediction in human vision literature and object

classification and detection in computer vision literature, but still they are missing this part by focusing here probably it will benefit all of them.

>> Larry Zitnick: Let's thank the speaker again.

[applause]

Download