22053 >> Larry Zitnick: Hi. It's my pleasure today...

advertisement
22053
>> Larry Zitnick: Hi. It's my pleasure today to introduce Jia Li to MSR. She's currently at
Stanford, but I think had quite a trek with her advisor, Fei-Fei, technically graduating from
Stanford, right, but has also been to Princeton and University of Illinois. And she's done a lot of
great work in image understanding and semantics.
She had a role in developing Image Net and had recent great work on image understanding
called Object Bank, which I think she's going to be talking about today. Great.
>>Jia Li: Thank you. Hi, everybody. Today I'm going to talk about semantic image
understanding from the Web in large scale ways real world challenging data.
As all of us know, a picture is worth a thousand words. So humans can easily understand a lot of
information from an image.
For example, in this photo, two cat friends are sitting in front of a table having fish and having a
drink in a restaurant. The understanding has different layers. For example, there are objects
involved.
And seeing this is a restaurant, it's indoors, and there are activities. For example, these two cats
are drinking. As easy as humans can do, all these tasks, the computer has a long way to
understand the same information.
So we want to in the computer the same ability to understand an image at the same level as a
human. So understanding an image as a human is an overreaching goal, the very grandiose
goal.
Let's see what's a fundamental computer vision problems involved. It is related to classic
problem object recognition. For example, find two cats here.
And sometimes we need to give overall label of the image. This is dining image. And it's indoor.
But only knowing the objects within an image or a global label of the image is far from enough.
Sometimes we need to segment out the objects or grouping pixels belonging to one object.
That is the segmentation task. And usually we want to connect the pixels to semantic meaning,
annotation task.
So here, the problem I want to solve is to enter the computer to understand every single pixel of
the image. So why do we care about this problem at all?
So first in our physical world, vision is the most fundamental sensory for humans. 50 percent of
our neurons are devoted for image processing. We use vision to read, to drive, to social, to work,
and to enjoy our life.
And visual intelligence becomes very -- becomes very important. Since there's a vast amount of
data, digital data out there, we want to harness this vast amount of data to have a better
understanding of our visual world.
So my report is I want to learn from real world images, such as those ones on the Internet. And
in turn apply to real world problems, such as robots, navigation, perception, autonomous car
driving, and sometimes we wanted to help us to build a safer world.
And automatically organize the vast amount of multimedia data on line or in our personal albums.
And we also want such technique to help us to do automatic camera calibration, depending on
what kind of object and what kind of scene we are shooting at.
And we can also be used to improve our image search results. And eventually we wanted to help
the visually impaired or elder people to perceive the beautiful visual world.
Here is an overview of my thesis work. Basically toward the goal of understanding every single
pixel. I have been working on object recognition, event recognition, image representation for
understanding an image and hierarchy construction and scene understanding.
Today due to the time limit, I will sample a couple of them and talk in detail.
So I will first briefly introduce a new way to tackle a classical problem: Object recognition. And I
will briefly talk about a classic problem, image classification, by introducing a new image
representation, Object Bank, and then I will talk in detail more ambitious and less investigative
problem. That's the scene understanding.
So let's start with object recognition. I'll talk about our 2007 and 2009 work about optimal. It's
automatic picture collection, incremental model learning.
So let's look at the traditional problem of object recognition and recognizing a single class of
object. Suppose we want our computer or the smart agent to find a cat within this picture.
As easy as the humans can do this, this is a very challenging task for the computers. We are
facing a lot of occlusions in this image, and it seems the cluttered restaurant and there are
uncertain lighting essential.
Literally, there are hundreds and thousands of papers published on this topic. There is one
common property among all these papers that we need to start from some manually collected
clean data.
For example, here we are looking for a cat. Then we need to collect a cat dataset. And now we
want to look for fish in this image. We need to manually collect clean fish dataset to learn our
model.
And these are okay, because they're in the Cal Tech 101 of Pascal dataset [phonetic] we can
easily use those images.
So if we want the computer to look for rare object such as Japanese wine cup in this image, we
are in big trouble now. Where do we find our dataset?
And this is a problem, is a bigger problem. Think about how many objects are there in our world.
And how these methods will be able to be scaleable to all these objects within our worlds.
Thus, the 2007 work, we tried to harness the vast open Internet and use the Internet images to
learn our object model. And it is scaleable to any object that was images online.
And now Internet is great. We can type in a word, keywords. For example, chicken, to all the
search engine. And we can get a lot of images.
But there are some challenges. For example, there could be multiple types of chicken, animal
chicken, the cartoon chicken, et cetera. Existing methods such as constellation models of
perimetric topic models, they are fixed -- they have fixed complexity.
You need to give apriori set of parameter to tell it how many types of chickens are there. And this
number of types will vary over different objects.
So in optimal, we try to use nonperimetric topic model which is flexible in complexity. While I
have no time to dive into details of the nonperimetric topic model, at a high level basically it can
offer natural clustering among all these images. And it will be able to capture multiple prototype
of the objects.
And also we're facing some other challenges. For example, there are many incorrect or irrelevant
images which record as very noisy Internet, is a very noisy library.
Existing methods are normally limited to hand cleaned data, which again cause the human
labelled clean dataset. How do we deal with this? We use a noise rejection system.
Now, it's robust to noise. The idea is, well, we are sifting through all the Internet images. We
only keep the good related images to learn our model while rejecting the noisy images.
So here is a glance about our optimal system. Basically starting from a very small handful of
number of state images which we can easily obtain from the first couple in the search engine.
We can learn a model that can capture the property of the chicken object. Now, serving as a
classifier, this model can go back to the Internet and pull more relevant images to this model.
And the relevant images will be appended to the dataset and it will be used to incrementally learn
the object model, such that we can obtain a larger and larger dataset while have a better accurate
model through this incremental learning in a very efficient way.
>>: Do you find that it at all restricts the variability of the images that you collect, then? Does it do
a good job if you have a category which has a lot of different ->>Jia Li: Yeah, that's the nonperimetric model. So it can capture the variability of different
prototypes.
>>: Does a good job of capturing.
>>Jia Li: Yes, later I will show some results related to that. So our system has been tested on
large scale datasets, with no labeled images.
It can accurately recognize and localize the object. Basically here the purple bounding box is the
localized object. And it has a label to indicate what kind of object are these.
We also compare the multi weight classification with some state-of-the-art algorithm. Again no
labeled data was used in this setting.
And this goes back to the question that Larry was mentioning. Our model can naturally capture
the multiple prototype about one object. It fits naturally in the [inaudible] interpretation problem.
So suppose we type in mouse in the Bing or Google image search. We can get a lot of images
about animal mouse, computer mouse and some of them not even related.
Our model can naturally separate them into animal mouse, computer mouse and discard the
unrelated ones.
So it's superior performance in the first place in a computation that is designed to simulate the
real world robot perception scenario.
In this triple I and NSF organized competition, each of the teams were given a list of objects. And
then within very limited training time, say two hours for over 30 objects, with a single machine,
and the computers are required to learn the models for all these objects.
And dataset is not available. But the Internet access is provided. And the list of objects -- object
names, were not provided before the competition. Now, do you feel this is a very familiar setting?
It's basically the optimal setting for our framework.
Our computer can then go to the Internet and collect a lot of related images to each of this object
concept. And learn a good model that can capture all these properties from noisy Internet
images.
And eventually correctly and accurately detect the objects within a cluttered conference room.
So now let's switch gears a little bit and talk about image, classical problem, the image
classification problem. Here I'm going to introduce our image representation called Object Bank.
As we know, image classification is a very challenging problem, especially when the classification
is related to semantic meaning.
For example, this image is a sailing event. And similarly, this image is similar in appearance, and
it's also a sailing image. Sometimes, the appearance could be quite different.
For example, the color in this image and the layout of this image is pretty different than the
previous two. And sometimes it goes two challenging, and it's impossible for us, for a computer
to detect this. This is an on-boat shot of a sailing image, apart from some similarity in some small
patches of water and sky, there is very little similarity appearance to the previous two.
And traditionally people were using low level features to describe all these kind of appearance.
Low level features actually carries very little semantic meaning, causing the so-called semantic
gap to the high level tasks. We need some good and semantically meaningful representation to
close the semantic gap.
And as we know, objects could be very useful in complex recognition. Let's go through a very
simple salt experiment. We are using the [inaudible] sport event dataset as an example. It's an
eight-event class dataset with very complex scene images.
And the best low level image representation, the spatial pyramid matching, was Sift feature can
achieve about 70 percent on this dataset.
And suppose somebody magically give us some object labels related to each image. How much
can we achieve in the classification?
100 percent in this experiment. There's a huge gap between the low level image representation
and the ideal high level image representation. How do we obtain this kind of object, high level
image representation.
We introduce the Object Bank which entails the spatial location of the objects and the possibility
of the objects appearing in some image, in some scene image.
Starting from scene image, for example, this sailing image, let's try to obtain the possibility of
sailboats appearing in this image. So we run some object detector for sailboat on this image.
And different objects have different spatial preference. For example, you might say look for sky.
Most of us will look up. And things like water always appear at the bottom of the image.
So we pull the maximum value of the object probability from a spatial pyramid structure that can
capture all this spatial preference.
And object size could be very different. The same horse object could appear at different size, in
different images.
We run over different scales, our detectors. And we want to describe multiple objects. There's a
possibility appearing in one image. So we run multiple detectors over an image.
And that's our final Object Bank representation for our image. Specifically, we use around 200
object detectors, and these 200 object detector names are summarized from famous image
dataset, the language corpus analysis and image search results. We had around 200 object
detectors around MIPS and now we have around 2,000.
>>: Where did you get the categories from?
>>Jia Li: The categories, we basically used Word Net to get the object names and then we tried
to find what's the most popular objects and what's the most important objects by using the
statistics from image dataset, from the language analysis and Web images statistics, yeah.
So what's so special about Object Bank representation? Let's see these two images. Humans
will have no problem to differentiate them. One is about mountain and forest and the other is
inside city image.
But let's see from the low level image representation point of view. Just filter bank based
representation. They look almost identical around these two images.
And the best state-of-the-art image representation, SPM, Sift, they also look very similar to each
other.
Now let's see the Object Bank representation. As we can see, the tree responds, the possibility
of tree appearing in this image is very high and is very low in the other image.
So from appearance we can easily differentiate these two images by using Object Bank.
As a proof of concept, we apply our Object Bank representation to same classification. Let's start
with a very simple basic level scene classification.
The low level image representation can achieve about 50 percent around that range. And here
we use very simple off-the-shelf classifier. The linear SVM, Object Bank can achieve much better
results over the low level image representation.
On a more complex scene class dataset, that's the MIT dataset, the advantage of Object Bank is
more obvious. It outperforms all the low level image representation and it outperforms the
state-of-the-art, which requires more human supervision than Object Bank.
And similarly, for super ordinate complex scene classification, our Object Bank again outperforms
the low level image representation and the state-of-the-art representation. State-of-the-art
approach.
What's the take-home message? Basically this means with more descriptive image
representation, the burden of designing a complex model can be reduced.
And if we combine the powerful image representation with some more sophisticated modeling, we
can achieve even better results.
And when I introduced the Object Bank representation, I brushed aside a fact that it is very high
dimensional representation, which can easily lead to problems related to curves of dimensionality.
So the good news is it's high dimensional representation with redundancy. As long as we find the
effective way to remove the redundancy and keep the useful information, then we can still
achieve a good compression on this image representation.
So we choose to use regularized logistic regression to compress Object Bank. For those of you
who are not very familiar with this regularized logistic regression, here is the objective function.
Basically the first term is if we only keep the first term, it will be the vanilla logistic regression for
classification, and the regularizer we put here is basically try to help us, try to encourage the
important objects at important scale, and important locations to be selected.
For example, in this setting, as for sailing image, sailboat response and the water response will
be selected, and the water response at the lower part of the image and sailboat at the middle part
of the image will be selected.
In this way, we can compress Object Bank to 1/10 of its original dimension, and the classification
performance is still on par.
And further, because this compression is based on semantic meaning, we can discover some
good interpretability. Let's look at -- yes?
>>: Previous slide, I missed the algorithm. How many different classifiers do you have?
>>Jia Li: How many different classifiers? So basically it's one versus all.
>>: So you have an image you run all these different keyword classifiers on it, you have
thousands of keyword classifiers.
>>Jia Li: That's when we construct the Object Bank representation. And here we already have
the representation. So we just do a scene classification. And it depends on how many scenes
you have, then you have how many classifiers there.
>>: How did you get to run their response on here?
>>Jia Li: That's again from the Object Bank construction part. So basically we tried to
summarize the important objects from the statistics we collected from important objects dataset
from the image search result, from the language analysis.
And we find, okay, not all the objects are equally important to each other. So we finally end up
selecting 200 at that time and it's 2,000 now.
And now with that we can represent our image as Object Bank representation, and it will be the
input for this classification algorithm.
>>: 2000.
>>Jia Li: Currently we run 2,000, yeah.
>>: Extend the regularizer I can understand the ->>Jia Li: Regularizer, this L12 is a grouped-wised regularizer. It will encourage one group of -the group of dimensions. For example, the dimensions related to one object to be jointly zero.
And the second term is sparsity. Is L1 regularizer.
You encourage the sparsity within the object, the group of non-zero -- nonzero dimensions to be
zero.
>>: Optimizer function.
>>Jia Li: Just coordinate these, yeah.
>>: Do you do it this way mainly because that way it gives it some semantic meaning.
>>Jia Li: Exactly.
>>: Or it performs better than PCA?
>>Jia Li: We wanted to preserve some semantic meaning. And like here I wanted to show that
actually. So PCA won't be able to discover all this. So basically the compression method is
based on semantic. Let's see this page class.
We can -- from the human point of view, we can see that sky and water and sand and grass are
very important objects within this beach class.
It can be automatically discovered by our algorithm. Just the semantically-based compression.
You will keep, as I explained just now, it will keep the important objects at its important location
and important scales.
And, similarly, for this mountain scene, we can find out that sky mountain and tree and walk are
the most important object class.
Now we have the most important object. Let's see, there's special, important special location and
scales. Basically here we have a cloud. As we can see, cloud is around middle size. It doesn't
occupy the entire image. And it's not too small.
And the location is always at the upper part of the image. This can be also discovered by the
semantically-based compression method.
>>: So one quick question. So how important is that -- I can imagine doing the same thing just ->>Jia Li: The terms it will help you discover the important objects.
>>: Do you have any experiments where you just use the [inaudible] and get the same thing?
>>Jia Li: Yes, we did three experiments actually. One is L1 term. L1 term will discover the
important dimension despite what objects it belongs to. Right?
The problem with that is it doesn't keep the structure of the Object Bank representation, because
we formed the Object Bank representation based on objects, based on their special location, et
cetera, and different objects have different probability to appear in different classes.
So L12 is introduced just to capture the importance of objects.
>>: We can take it off line.
>>: Are you familiar with the class [inaudible].
>>Jia Li: Yeah, I know.
>>: Can you briefly state why those ->>Jia Li: Sure. So I think that work is great. Basically the major differences we capture the
spatial location of the possible object within an image, and we try to capture the semantic
meaning where they claim that it's intermediate representation doesn't necessarily related to
semantic meaning, but with semantic meaning and spatial location information, you can see we
can do a lot of interesting things.
And we can build in the future we can build, say, segmentation algorithm based on the possibility
of object appearing in an image from the responses of objects within an image.
So there are a lot of applications for this. So are we good? So now I'm going to talk in detail,
more challenging and less investigated problem. The total scene understanding problem.
In total scene understanding, we try to understand every single pixel of an image. And currently
we are considering classification, segmentation and annotation.
As we know, traditionally these three problems are independent, separate fields in multimedia
and computer vision.
And they have been treated as independent problems. Here we tried to take -- we tried to solve
them together towards the goal of total scene understanding. Taking advantage of the fact they
are mutually beneficial to each other.
And traditional methods also require clean training data. Here, similarly as our optimal approach,
we try to learn from Internet, noisy data, such that we can release the human labor from this
heavy labeling, because segmentation and annotation actually requires much more labor work
than labeling the object within it.
So how do we model these three problems? First, we want to model the classification. Here we
use generative model which is represented as a graphic model here. Sarco [phonetic] represents
a random variable, and plate represents repetition, gradually build upon this simple model to
achieve the end goal.
So C here is image class. It can take values such as polo, sailing, rock climbing, et cetera.
It's slightly shaded. That means it's observed in training. And we also want to do segmentation.
How do we do that? We presegment all the images based on low-level cues into coherent
segments.
And the problem is we don't know which segment belongs to what object. We use an object 0
variable to represent here. O can take values such as horse, grass, sky, et cetera. And different
objects, the possibility of different objects appearing in different image class is quite different.
We use a multinomial distribution to model that. And the object identity is related to its
appearance. For example, horse is always -- is sometimes brown. And its texture is like this
skin-like texture, while grass is always green. And the texture is the middle-like texture.
So we use color, location, texture and shape to represent our appearance, reaching appearance.
And different -- again, different object have different appearance. So we model this as
multinomial distribution, given the object.
And small regions, small patches within each region can sometimes tend to be very
discriminative. For example, the knees and the years of the horse can help differentiate it from
other objects.
So we extract that the Sift representation to represent these small patches and the model is
multinomial distribution depending on the object. And now this is the visual component. And we
want to model annotation as well.
We want to connect our visual regions to the annotation words.
So T here is again observed in training. So we have -- as we know that the number of segments
in the image and the number of text words are quite different.
So we use a connecter variable to associate the text words to one of the image regions. And
there are also some visually not related text words such as saddle. But it can help us to
differentiate the polo class from some other rock climbing classes, et cetera.
So we use switch variable to connect these visually less salient or visually unrelated words to the
class variable.
And if it's visually very salient objects such as horse and human, we connect it with the object
within the image.
So the exact inference of this model is intractable. I derived collapse keep sampling method to
do all this.
And, again, here we try to learn from Internet data, because we already modeled the visually
unrelated text and visually less salient text.
So no human supervision is needed. We can go to Flickr, Picasso and Windows Live to grab a
lot of user images and try to learn from there.
So, for example, here we tested on Flickr user data. We downloaded eight event scene classes.
Here each double-let represents the images from one class.
We can see this dataset is a very challenging one. Objects can be very blurred and there are
occlusions and there are multiple objects, et cetera. This is also a quite large dataset.
It's over 6000 images. And over a thousand and 200 concepts. Now, I have talked about we
have three goals. Now let's evaluate the performance on these three individual goals. So we
start with the classification task.
So first we compare with Fei-Fei and Tau. They're visually only based model pioneer topic model
language model that's been applied in scene classification.
They can achieve about 40 percent. And now let's see another model that jointly models the
textural component and visual appearance. It can do slightly better.
And this is the performance of our model. By jointly modeling the textural and visual appearance.
And considering capturing the noisy property of Internet data, we can achieve a much better
result.
And --
>>: How would you [inaudible] model I thought it was unsupervised way of finding segments and
how are you modifying?
>>Jia Li: The scene class mentioned in the model, the C variable is ->>: It's observed.
>>Jia Li: It's observed in training. So it can -- we can predict it in tests.
>>: So in tests you basically hide it and you're trying to infer which is the most likely ->>Jia Li: Yes. Exactly. So the annotation which tries to compare with the traditional very popular
multimedia content-based notation caliper [phonetic] and joint, as I mentioned just now the joint
modeling of textural and visual appearance. We are doing comparably well, and our method, by
jointly modeling the textual and visualing, I think the most important part is considering the noisy
part of the Internet images we can achieve much better results.
Here is the overall result. And in segmentation, we compare with the state-of-the-art concurrent
object segmentation. Basically you concurrently are segmenting multiple objects in an image.
That's the TAU et al., and our model can do better in every single class, object class.
Now let's see some visualization results. How can we segment this image? What's the most
challenging object here?
Probably this snowboard. Let's see if we can get it. Yes. Although, it's a very small object and it
has this different viewpoint, et cetera, by learning from large amount of Internet images and
modeling the Internet image property we can successfully model it.
>>: What was the input from the other? You just said.
>>Jia Li: Flickr images and tags in training.
>>: But when you ran the segment you said snowboard to the system and it ->>Jia Li: Basically the image is just one image with no tag in tests, with no tag. But because we
modeled the tag relationship and the visual relationship and the class relationship so well in
training phrase, it can capture it when you fit in unknown image to it, yeah.
>>: So the output is not just snowboard.
>>Jia Li: It's not.
>>: They are not. Yeah. So what's the base level segmentation you are using?
>>Jia Li: Base level?
>>: You mentioned that the input is not just the objects ->>Jia Li: Comparing small segments, but you can use any low level segmentation algorithm.
Specifically we were using super pixel, the pheasant [phonetic] swab algorithm.
>>: [inaudible].
>>Jia Li: Just based on the local appearance, low level appearance, yeah.
We can also detect human. And sky, cloud, snow and trees all concurrently automatically done.
And now a more challenging task to segment out the objects, as we know segmenting horse is a
very popular topic.
There are hundreds of papers published on segmenting horse. And there are even datasets with
human labelled dataset developed just for this object.
Our segmentation algorithm can successfully segment the object concurrently with some other
objects as well.
>>: A lot of that is actually dependent on your super pixel segregation algorithm, like the
[inaudible] of the horses are always missing.
>>Jia Li: Exactly. But do you take a look at original super pixel input, it's very -- it's very
oversegmented. It doesn't have semantic meaning at all.
Then you need to put together all the segments related to some objects. That's a very difficult
part, because the low level segmentation doesn't carry any semantic meaning at all. Yeah.
So further let's see some segmentation and annotation task. Basically in this -- given this image,
our algorithm can perfectly segmenting out the -- segment out the objects within it.
And in addition, we can discover some visually not so salient object, such as wind and seaside.
It's because they're so related to the salient class they can be automatically discovered as well.
This is also -- this is largely attributed to the switch variable that can selectively associate one text
word to the visual content or to the image class.
If we try to associate wind to some visual appearance here, we'll be in big trouble. Yeah. So
understanding why images is very good. We can do a lot of things, annotation, and
segmentation, et cetera.
Now what else can we do? We can do some fun application. For example, organizing many
images. I don't know about you, but I'm a big travel fan. And I take a lot of pictures during my
travel.
And I'm also a very lazy person. I do the minimum part of organization after my travel.
Now this is all the photos containing my images. Suppose one day my friend comes to me and
asks me to show her all my birthday party photos.
I will be in big trouble. I will have to go into each of these folders and try to find my birthday party
photos. How tedious is that?
This can be done easily and automatically by building upon the total scene modality model and
impose hierarchical multi- -- nonperimetric prior on that which is nested Chinese restaurant
process.
Don't worry, I will not go into detail of this. But if you want to discuss off line I'll be happy to talk
about details.
And let's see what can we do with that. All the photos will be grouped as photos. And then within
photos we can see event photos. And wedding photos are grouped as a subpart of event photos.
And within wedding images we can see wedding gown and flower images related to wedding, and
we have birthday images within event. And many other branches containing different concepts.
And within them there could be a garden and holiday, et cetera, et cetera. The most important
thing is all this can be done automatically.
So I've talked about ->>: You said could be done automatically. Have you built the system, or this is something ->>Jia Li: This is already done, yeah.
>>: Where is this described?
>>Jia Li: Huh?
>>: Which of your publications describes this?
>>Jia Li: It's my 2010, CDR publication. So I'll be happy to talk more about that.
I've talked about my current effort. So in the future, towards the goal to semantic image
understanding, detailed understanding of every single pixel within an image, I want to develop a
more powerful image representation. Remember previously I mentioned about the Object Bank.
It's a very good representation. The problem is its high dimensionality, and we use the
supervised method to compress it. And it was effective but sometimes we don't want supervision.
As we know, unsupervised featured learning has been very successful recently. Why don't we
just use unsupervised featured learning upon Object Bank representation to discover the useful
representation while preserve its original structure and original object meaning.
That's one of my short-term goals. And in the long term, as we know, there are large scale
multimedia data available online. I'm sure Microsoft have a lot of them.
So the question is how we effectively take advantage of this data such as GPS information, time
stamp, the user tags, the text information around some images.
I want to do a good analysis. And to integrate all these components and to let them be mutually
beneficial to each other. And help us to manage this multimedia data better in the future.
And a more ambitious goal is to use image as information source to understand human behavior
and society. As we know, different generations might have different flavors of images that they
want to browse.
A little girl would browse a lot of Disneyland images, and travel fan or snowboarding fan will have
a lot of snowboarding images in their personal albums.
So this will reflect their behavior. And we can, based on that, to form useful groups or societies
among them, and also maybe to recommend different products to them.
So in my research my goal is to understand every single pixel within an image. I've talked about
a new way of tackling a classical problem, object recognition, by learning from noisy Internet
images, where no human supervision will be required. And this algorithm will be scaleable to
recognize any object, images online.
And I've also introduced a new image representation for classical problem, image classification.
And it's a sharp departure of all the low level image representation by carrying rich semantic
meaning and close the semantic gap.
And finally I talked about the total scene understanding problem, a very ambitious and less
investigated problem. And this is our first attempt to understand every single pixel within an
image, and again we try to learn from the real world images and try to recognize real word's
challenging images.
That's the end of my talk. I'd like to thank you, everybody here, and thank all my collaborators
and funding resources.
[applause].
>> Larry Zitnick: Any other questions?
>>: How would you prove image classification object banks. Still some ways to go.
>>Jia Li: Exactly. So there are two ways, actually two ways to go. So as we know, two
important components is the image representation. And another important component is how we
use models to capture the information that is carried by the representation.
From the representation side, as I mentioned, I want to improve it basically from, make it more
efficient and compact representation and also we can improve the object recognition result.
So that the buildup, the image representation built upon it will be more robust. That's one aspect.
Another aspect is suppose we have the representation here. How are we going to take
advantage of the existing representation and extract the most useful information, useful
components from it?
So two of my other approaches kind of are that flavor. Basically you use a good machine
learning algorithm to capture the properties that is embedded in the image representation so that
to achieve a good representation, so that to achieve a good performance in image classification.
>>: Okay. So I'm kind of curious like what is the most common failure mode right now on Object
Bank?
>>Jia Li: You mean most ->>: Why does it fail right now?
>>Jia Li: So currently we have tested on all existing scene datasets, and we also tested on one
object dataset Cal Tech 256. I have to say Object Bank actually has the most potential in
recognizing object-centric, object-oriented images.
So if, say, we are doing a DDUP, image D duplication, I don't think it would necessarily go that
far. Use Object Bank representation to do that.
But if it's connected to semantic meaning, and connected to objects, it's very powerful. So so far
we've observed that it actually gives the best results under UIUC sports, it's a object oriented
dataset, carries a lot of semantic meaning. Did that answer your question?
>>: We can do it off line.
>>: I think just to follow up on Sydney's question, it's impressive that your system is the best. But
even the best is only 60 percent. So which are the failures that don't have the most that you
know, where is it like object of a human doesn't, but computers do.
>>Jia Li: As I showed just now, human, if we can have the object labels, it can achieve
100 percent, right? Current part is the object recognition. Object detectors, haven't been
developed into that perfect situation.
Once if we have a perfect object detection, that is basically a software problem.
>>: Since we have detection, indicates on top of it, it doesn't matter as much any more.
>>Jia Li: Exactly. So currently what I show here is object detectors are not perfect. But as some
intermediate representation, it can show a lot of potential in classification. Yeah.
>>: I have a question about the first part, when you do the text search and [inaudible] what is the
algorithm? You target some images show up. And then how much data do you actually -- when
you know that you have learned the model for amounts.
>>Jia Li: That's a very good question. It's related to some technical detail. So basically we
collect all the possible images related to one concept. And then we do incremental learning.
>>: Images ->>Jia Li: So I've tested over 400,000 images. Over 23 classes. Basically the question is where
should we stop.
>>: Right.
>>Jia Li: So in the iterative learning process, we try to learn a good model while reject the bad
ones. So we learn till we cannot get useful images.
So until the model thinks, okay, there is no related images. Or images rejected, then we stop.
Yeah.
>>: You didn't say much about the low level features you use. So for the optimal system, where
you're learning your [inaudible] model, features, and also for Object Bank, you have your initial
filters, so what are you using at the very earliest stages?
>>Jia Li: Very earliest stage, so in optimal, we use the Sift image representation, and we show
that with a good machine learning technique, and a good vision algorithm, even with very simple
object, image representation, you can already see very positive signal in object recognition.
And in Object Bank, basically depends on what kind of object detectors, what kind of object filters
we use.
We use different type of low level features. We have used color texture, and we also used a
HOG feature within it.
So basically I still think the low level features are very useful features. Object Bank is not
designed to replace the low level image representation. In fact, we did some experiments to
combine the Object Bank representation with the low level image representation.
And it shows significant boost over both Object Bank representation and the low level
representation. And that shows that Object Bank representation is a high level
semantically-based image representation. It's complementary to the low level image
representation. While when we do experiments by combining two low level image representation,
since, for example, Gist and Sift, their properties are quite similar. So there won't be many
boosts in those experiments.
>> Larry Zitnick: Anything? Let's thank the speaker again. [applause]
Download