Ross Girshick: It is my pleasure to introduce Greg Shakhnarovich

advertisement
>> Ross Girshick: It is my pleasure to introduce Greg Shakhnarovich, who
has one of the most difficult to spell last names in computer vision. He
got his Ph.D. from MIT, originally working with Paul Viola who had a
stint at MSR awhile ago. He is now assistant professor at TTI-Chicago.
He has worked quite a bit on sign language interpretation, both in the
past and I think more recently or starting up again.
And he's also worked quite a bit on semantic segmentation. And he's
going to be talking about a recent system that uses deep learning to do
semantic segmentation.
>> Greg Shakhnarovich:
>> Ross Girshick:
Yes.
Please take it away from here.
>> Greg Shakhnarovich: Thanks, Ross. I will try to be original. Hence
the deep learning.
One correction. Actually the day I arrived at arrived at MIT was the day
left Paul Viola left MIT. I worked with him mostly outside.
I am going to talk about the recent and ongoing work with semantic
segmentation. It is joint work with two of my students, Reza and Payman,
you see here. Since some of you might not be aware of what TTI is, I
felt I should tell you in one slide something about TTI. Some of you
here have spent some time at TTI, but many of you haven't. It's an
independent academic research institute, philanthropically endowed. At
this point we have an endowment of a bit more than a quarter billion
dollars. And it's sort of completely independent institution. We have
our own Ph.D. program which is fully accredited. We are loosely sort of
allied with the University of Chicago. We are on a University of Chicago
campus, but we are independent for purposes of hiring admissions, et
cetera. We focus on a relatively narrow set of topics in computer
vision, in computer science. The two main sort of areas are machine
learning and roughly speaking applications of machine learning to various
AI tasks primarily. So vision, language, speech, robotics.
And another big area of interest is theoretical computer science. So we
sort of try to cover those. We currently have ten tenure-track faculty
and about a dozen research faculty who come and go. They were in those
positions in awhile. And we have about 25 Ph.D. students and we keep
actively growing, hiring in all positions, recruiting students. We are
recruiting summer interns for research project. Very active and vibrant
environment. If you have any questions, I can tell you more about the
talk, after the talk.
All right. So semantic segmentation. So in general segmentation is a
very old task in computer vision. Traditionally I think it has been kind
of an evolving notion what it means to segment an image. In the early
days focus was mostly on the very general and admittedly vague notion of
partitioning an image into meaningful regions. You really have to do air
quotes when you say meaningful regions because we are not clear exactly
what it is. I will talk a little bit about it later because maybe we can
somehow leverage that notion to help us with the second task, which is in
fact the focus of today's talk. That's semantic segmentation.
That is a little bit more well defined. It means taking an image and
labeling every pixel with a category level label, which is creating what
that pixel belongs to in a scene. Of course, there are various issues
along with this as well. We'll discuss those briefly just to help us
understand the challenges. But arguably this is not the most refined or
most reasonable segmentation tasks. More refined one is instance level
segmentation that many people here work on. And the distinction is that
if you look at this image you can say, well, I just want to label all
bicycle pixels with green, all person pixels with pink and everything
else with black. But you can also maybe more meaningfully ask: Well,
how many bicycles are there? To answer that you have to label bicycle
number one, bicycle number two, and bicycle number three. That is a more
refined task.
Now, the focus today is on this intermediate level, category level
segmentation. Partially because that kind of has how things have been
evolving and partially because I personally think that it is still a
meaningful task. It should help us towards instance level segmentation
and also even though I am not admittedly going to show any experiments of
that nature here, you can think of categories for which instance, notion
of instance is not meaningful. For example, stuff or regions which are
defined in terms of their texture or their physical properties, and not
an object instance.
So it is in some sense a formal classification task. You want to label
each pixel as single label. We are going to ignore the issues really to
hierarchical labeling or the fact that the same pixel might have multiple
labels, multiple categories which overlap. So if you treat it as such,
there is a standard at this point, a standard benchmark task which is
called Pascal VOC dataset. It has 20 object categories plus the catchall amorphous background category, which is everything else. I think the
field is in the process of adopting a new benchmark, COCO, which is
spearheaded by a bunch of people here actually, and a few other
institutions. I think we are still transitioning to making that the main
benchmark. So for now Pascal VOC as of today is still the central
vehicle for evaluating segmentation.
The categories, if you are not familiar with it, are kind of a broad
range of things which are reasonable in everyday life. A few animal
categories, a few furniture categories, a few vehicle categories, and a
bunch of friend and odd categories outside of that.
So there are some examples. Who is familiar with VOC and semantic
segmentation tasks?
So many of you, but also kind of briefly I'll go through this. This is
an example of a few images and the underlying ground truth labeled by
providers of the dataset. You can see it's pretty high quality, in most
cases most high quality outline of objects with some fairly fine details.
It sounds really challenging. You have the cat and the sofa and cat and
sofa are two of the categories of interest. So you really have to sort
of correctly predict that these are pixels of a cat and these are pixels
of a sofa, and those are arguably challenging.
In the bottom row you can see some of the potential issues which we are
going to ignore here, but I want to point them out because I personally
always feel concerned about those. You can argue that these pillows are
not really part of the sofa, but are to the purposes of this task we are
going to ignore that and go with what the providers of the benchmarking
said. They said well, it's all sofa. And here you can see an instance
of something that is kind of hard to see, maybe a little bit Christmas
tree. One of the categories is potted plant. You can say well, is it a
potted plant or not? Maybe it's a plastic Christmas tree. I don't know.
To avoid worrying about that, they in many cases like that just say it's
white, which means do not care. We are not going to penalize you for
predicting anything for those pixels and evaluation. Of course, so these
are a few more examples. And they show a couple of issues. First of
all, it's really important to distinguish between semantic segmentation
versus instance level. In semantic segmentation we just care about all
of the purple pixels here being labeled bottle. Whereas if it's instance
level, it is really important for us to distinguish there are four
instances of bottles here.
Another interesting issue which comes up in the same issue is that this,
this kind of maybe hard to see, but there is a car on the label of the
bottle. So in the ground truth it is labeled this car and we can debate
at length whether it is reasonable or not. I would say, you can say it's
a picture of a car and it will tell you what all of those things are
pictures. If I take a picture of a picture of a real car, is it going to
be less of a car? It's hard to say. So this is where, for example, you
could argue that those pixels are both car and bottle. Or maybe car,
bottle and bottle label, if you want to sort of extend this. We can go
down that well for a long time. We are going to just back up and ignore
all of this.
>> It's not a car.
>> Greg Shakhnarovich:
It is not a car?
It's an SUV?
>> FYI, it's not a car.
[laughter.]
>> Greg Shakhnarovich:
Because what is it?
>> He says it's ... [indiscernible].
>> Greg Shakhnarovich:
Oh, it's a truck.
>> No, it's a chair.
>> It's a truck.
We finally decided it was not a car.
>> Greg Shakhnarovich:
to that bottle?
Here we go.
For that, we have a meeting devoted
>> Yes, we had many meetings devoted to that bottle?
>> Greg Shakhnarovich: To that bottle? To the content of the bottle.
Anyway, so this is another example where you have the really fine kind of
fine-grained. Many chairs and some of them are even for us hard to
separate. So I'm just bringing it up to mention to get us thinking about
the challenges here, but we are going to kind of simplify our lives for
purposes of this talk and really focus on this semantic segmentation,
semantic level, category level segmentation. We are going to ignore the
issue of whether it is a car or not.
All right. So I will tell you briefly about some history of segmentation
and how it relates to work, partially because I think it's good to know
if you don't know, and partially because this will help me lead towards
motivation that we had on designing our system. I'll tell you about
how,, kind of how and why we came up with what we came up with, which is
the architecture which we call zoom-out feature architecture. And I'll
tell you how we implemented using deep learning, in a somewhat, I feel,
pretty natural way. I'll tell you a little bit about the results and
kind of where we are taking it now.
So okay. Going back to this original vaguely defined segmentation task.
Often people call it unsupervised segmentation. I put the quotes here
because really it is a misnomer. It is unsupervised only if you don't
learn it from data. Typically people do learn from data. And what
people mean by unsupervised usually it is not class aware. There is no
notion of classes, but rather this partitioning of image into regions.
For awhile it was kind of very disorganized and I guess up until late
'90s people would just take a few images, run their new experiment on
those images, put the images on the paper and say look, the segmentation
is great. Someone else will take another five images, run their
algorithm and say our segmentation is even better. So the Berkeley
people starting in the late '90s decided to take it into a more rigorous
modern, quantitative field and they collected a bunch of images and let
people, many people per image to ask them to label what they considered
meaningful boundaries between regions in those images and that led to the
creation of Berkeley Segmentation Dataset. And as you might imagine, if
you show the same image to a bunch of people and just tell them this
vaguely defined task, people will do different things. Someone might -in fact, these are three actual human labelings for the first image.
Someone only outlined fairly coarse kind of boundaries. So the person
versus the background, this very salient column here and the wall on the
boundary of the floor versus the outdoors, and that's pretty much it. A
couple other things.
And someone else was extremely nit-picky and outlined very fine details.
Maybe someone was disturbed, very fine details of the branches of the
trees and almost individual leaves. And someone else did something in
the middle. And of course, people do different things, but there is a
lot of systems to this madness. It is not chaotic.
In fact, if you combine all those labels, what emerges is some notion of
perceptual strength of boundaries. So ignoring that minor details of a
few pixels or displacements, you can overlay them and see that pretty
much everyone labels some of the boundaries. Like the person there in
the background. Most people label some other boundaries. And then as
you kind of reduce this threshold you gradually get to boundaries which
only maybe one or two people label. And you can think of them as less
perception [indiscernible].
A full link into this insight, a lot of work on nonclass segmentation has
been rather than trying to partition an image into a hard set of regions
has been focusing on hierarchical partition which corresponds to this
boundary strength.
So one reason I am bringing it up is we can arguably think of taking some
sort of partition like this, when thresholding this boundary MAP
somewhere and using the regions that it produces to do semantic
segmentation. Saying well, we just need to label the regions. It turns
out it's really hard to do this because we don't have a good way to
establish the threshold which would be good for any, for all categories,
for all images, et cetera.
In fact, it remains a very challenging task even though we are starting
to approach human level performance in terms of, between human agreement
on this region task in terms of precision recall. But if you want to get
an actual set of regions, it's hard. What we can do really well today is
take an image and partition it into something called superpixels, which
are, you can think of as very small segments which tend to be semantics,
tend to be coherent in appearance. Color is probably the most important
thing when you look at small regions and tend to be spatially compact.
So kind of a large grid. We would like to get a large grid of almost all
regular small regions and you can, usually in these algorithms you have a
knob which we can turn. And we use an algorithm called SLIC which is
particularly good at this. You can turn it from having really refined
partition, in extreme case just one pixel per region, all the way to very
coarse partition here. You have 25 regions.
Now what happens when you have this very coarse partition, you start
hitting real boundaries. At this point you kind of start breaking
things. You chop off this lady's head and connect it to the sky. Part
of the building is connected to the grass region, et cetera. And it is a
very, very fine over-segmentation. Of course, you aren't going to do
that, but if you have single pixel per region, you haven't done anything.
What we are after is some regime in the middle where you have maybe
hundreds to a thousand maybe regions. And then they tend to have higher
recall for real boundaries, at the cost of maybe low precision. So the
point is that if you reduce the original image from a million pixels to
500 superpixels, but you append them in such a way that almost all the
true boundaries are preserved. If there is a true boundary, it is going
to be a boundary between superpixels, but many of the boundaries between
superpixels are not real boundaries. Then you haven't lost that much
information in terms of recovering the true boundaries. But you have
dramatically reduced the computational cost of many things you want to do
with those images, right? Instead of labeling a million things you now
have to label 500 things.
Specifically for the VOC benchmark images we found it was about 500
superpixels per image. We can retain almost 95 percent achievable
accuracy. What I mean by achievable accuracy, if you magically knew what
category to insert per pixel, you could get up to 95 percent accuracy. I
haven't yet told you how we compute accuracy. I'll mention it later.
But we can think of any reason that it should be more or less true of any
reasonable measure of accuracy.
So we are going to stick with this. Who is familiar with SLIC
superpixels? Okay, many of you.
I'm going to really briefly mention because it's a good tool. It is
basically a very simple algorithm. K means over pixels. There are two
twists which make it really work well compared to what people tried
before. One is that there is a spatial constraint which doesn't allow
you to associate the pixel with the cluster mean, which is too far in
terms of location in the image. The cluster mean has color, three
numbers describing the average color and two numbers describing the
average position. So the position can be too far. And the second twist
is that you have the second knob. The first knob tells you how many
pixels you have and the second knob tells you how much you should care
about distance and color versus distance and geographical distance and
location.
And the idea is that if that knob is very high, then you are going to
mostly care about location and what you are going to get here is mostly
rectangles. So I don't know if anyone here is wondering about this. One
person, so far in all my talks I give, asked me how come it's rectangles
and not hexagons? I think if you initialize it on a shifting grid, it
actually should be hexagons, but usually people initialize it in a
regular grid and it ends up being rectangles. If you set this M to 0,
it's going to produce fairly irregularly shaped but still somewhat
constrained in space clusters because we still have the spatial
constraint, which will mostly care about color. So with some reasonable
intermediate values, of this value of this M here, you are going to get
things which the superpixels which tend to be regular, would like to be
regular rectangles, but they will snap to boundaries based on the color
differences when local evidence is sufficiently strong about it.
So it's a very, very good algorithm, very fast. It takes a few couple
hundred milliseconds at this point per image. Probably can be made even
faster.
All right. So now we have this machine way we can take an image. We can
simplify our life by partitioning it to superpixels. We want to label
them and now we can talk about how we actually do segmentation. And for
a few years until maybe about a year ago, almost all successful
approaches to segmentation were following the general philosophy of
structured prediction. So very briefly what it means is that we want to
classify a bunch of things, a bunch of, assign a bunch of labels to
superpixels. And, but we know that they are not independent, right? We
will talk shortly about some of the sources of those dependencies, but it
is pretty clear there are a lot of relationships between different
labels, if you want to assign to an image. And this is true in many
other prediction tasks and applications with machine learning. So the
way structured prediction [urges] is usually express some sort of score
function. You can think of it as the score is telling you how reasonable
is the particular set of labels X given image I. So segmentation here,
we consider a graph over superpixels or pixels, whatever you are
labeling. So V is a set of superpixels, E is a set of edges. And in the
simplest case you can think of some sort of lattice-like graph where you
have neighbor, notions of neighbors and each superpixel assigned has an
edge connecting it to its neighbors, but you can also think of a really,
really large complete graph where every superpixel is connected to all
other superpixels. And this variable XS is the class assignment from 1
to C. C is the number of classes to superpixel X -- S, sorry. And so
the structure, typical structure prediction framework you have two terms
to determine this cost function, this core. The first one is a bunch of
unary terms. So FI of XI, I basically tells you how reasonable it is to
assign label XI to superpixel I, given the image. So it's very kind of,
this template is very generic. It technically allows you to look at all
kinds of things you want to look at inside the image I. You can think of
it as kind of like a classifier. It doesn't have to be a classifier
probabilistic function, but something that tells you how good this
assignment is for the superpixel. And then pairwise terms tell you how
reasonable it is to assign a particular pair of labels, XI, XJ to a pair
of superpixels I and J which happened to be connected in your graph.
Again, while computing whatever you want technically in this
[indiscernible] from the image. It is a very broad notation.
Now, once you define the notation, it can think of finding X which
maximizes this function. It's called MAP assignment. I assume many of
you are familiar with this. It is basically maximum a posteriori, as
this terminology comes primarily from probabilistic thinking about the
smallest [indiscernible]. Specifically if you think of F as an
unnormalized log probability in terms of log of unprobabilized
probability, then you can think of this as a maximized conditional
probability of X given the image of this labeling given the image, if you
in fact train parameters so that there are some parameters of this FIs
hiding inside of this generic template, if you train the parameters to
maximize the conditional probability of ground truth labeled given image
per training data you have what is called CRF. If instead you say I
don't care about probability, I just want to maximize the score further,
great, you will get something htly different learning procedure. Then
the intuition here is that you want to make sure that the ground truth
score is higher than the score of any other labels by some margin. And
the margin depends on how bad the label is. If it's a really almost
perfect labeling, then you are okay with the score being almost as high
as the ground truth. If it's really bad labeling, you will like to
punish it and make sure that the score of that labeling is really
significantly lower than ground truth.
IE turning this in a quality into loss function, hinge loss function
produces what is called the structural XPM. So it is kind of two
different ways to train this model. In most cases, however, you have to
worry about is part of the major part of the learning procedure is at
least doing this MAP inference, finding X which maximizes F of XI for a
particular current setting of parameters. And that is often very hard.
And to understand why it tends to be hard we should think what kind of
things you want to capture with this score functions. So unary
potentials, right, is something which tells us how reasonable it is to
assign a particular label to a given superpixel on an image. Any
suggestions what we might want to capture there? Reasonably?
>> Color.
>> Greg Shakhnarovich:
color, what else?
Color.
Some consistency with the class.
So
>> [indiscernible].
>> Greg Shakhnarovich:
else?
Ross looks like he knows.
Texture, right?
What
>> So why do you have to give these things name? ... [indiscernible] and
figure it out.
>> As long as it works, that's all we care about.
>> Greg Shakhnarovich: Oh, you want to learn the features? That's
awesome. We should do that. Still, that's -- so okay, everybody, I
never know what is -- maybe everybody here has thought about it deeply.
So color, texture, position of the image which is, I want to see that
being learned, but we can, supposedly.
One other thing which may be a little bit less obvious is object size.
So if you care about labeling pixels, you can say if generally some
object tends to be very small, then a priori it is much less reasonable.
Without looking at the image you should think that labeling anything by
that object is less reasonable than labeling by the object which is
really large, right? Again, it is something which can be part of the
unary, this unary term.
Pairwise terms may be a little bit more interesting. What do we capture
there typically? Some of the things that have been prevalent in all this
approach is smoothness. You can say well, things, pixels next to each
other, [indiscernible] next to each other are statistically more likely
than not to belong to the same class because of most places in the image
are not boundaries. Most places in the region are kind of inside the
object. You can refine it a little bit more by saying, well, if they are
not in the same class, some combinations are more reasonable than others,
right? Cow next to grass is reasonable. Cow next to typewriter is less
reasonable perhaps? You can refine it even more saying well, it also
depends where things exactly are relative next to each other. Person
above horse is good. Person below horse is less good. Things like that.
But there are lots of other things you might want to capture with this
model and some of them are notoriously hard to capture, but when we
restrict ourselves to the kind of pairwise features for its potentials.
For example, you might want to say well, we expect some things to cooccur in the image, not necessarily next to each other. Once you want to
capture that, you need to look at the much broader interaction than just
neighboring pairs of sort of pixels. You might want to capture things
like shape of entire regions that you recover and those things are
notoriously hard to capture with just locally computing features for
superpixels. And so people have been thinking a lot about this. One
example of this, which I like, is that harmony potentials work, which was
the winner of the VOC competition associated with the VOC Vision
Challenge in 2010. And here the idea you have this notion of
superpixels. They had other kinds of superpixels, superpixels and groups
of superpixels. And on top they have this global potential which is
allowed to look at all of the labels you are assigning. And what they
did here is, it was one of the earlier attempts to leverage image
classification to help segmentation, which was somewhat successful. They
said well, we are already at the time, it is before the CNN era, but
still image classification was considered to be much easier task than
image segmentation. Image classification here means give an image. Tell
me if you think that the image includes any instances of airplanes, or
any instances of cows or any instances of persons.
Based on that you can first of all, if there are classifiers saying there
is no airplane there, you should be very reluctant to assign airplane to
any superpixel in the image. And in addition, it can also think of kind
of types of scene classes. You can say well, given the distribution of
classes which mean classifying things are there, if airplanes and cars
and the, I don't know, birds, there's high probability and desks and
chairs are low, then intuitively you would say it's an outdoor image and
not indoor image. Other classes which may not explicitly be present,
identified by the classifier still get boost or are being squashed by
this potential.
The idea is to somehow use this global information about the image to
help local decisions about what the sensor pixels. That was a half a
percentage point better than the other method which was significant at
the time. But it didn't really move forward much more, kind of people
tried to improve [indiscernible] approaches, but not a lot of success. I
think the major breakthrough came a couple layers later when people
started relying on a very different approach, which kind of you can think
of as a pipeline which separates the process into two stages. The first
stage is producing candidate regions. Saying wait, it's hard to
partition an image. Maybe we can produce a large pool of regions which
are allowed to be overlapping. You would probably hope to have some sort
of diversity there to make sure there are some interesting differences in
regions you have produced. But you could have potentially a couple
thousand regions, maybe even more. And you hope that some of them are
really good matches for the underlying objects. And then you can have a
separate second stage in which you will take some machine which will
score this region saying how likely is it that this region is an entire
object or group of objects which I'm looking for?
And so there are major kind of, it led to a significant jump in the
accuracy of this segmentation algorithms under a bunch of different
papers and just general venues, and some of them are actually the work
that Groer and I did together on producing a diverse set of regions and
re-ranking them. And I think that the most recent work from this general
line is, some SDS, the segmentation detection that Ross participated in.
And this certainly was a big improvement, but it still, I kind of felt
somewhat unsatisfied by this because this multistage, multistage setup
seems kind of unsatisfactory and it definitely is quite slower than what
we would like it to be.
And there is this, it is hard to -- there is no kind of way to learn the
whole thing together, which as Grover was saying seems to be the
prevalent philosophy today, which I certainly subscribe to.
So when we started this project we wanted to take this structure
prediction approach and say basically just get better unary potentials.
So we have common intuition and structure prediction is that at least in
vision, in this kind of setup, is that the unary potential, the
individual terms are the ones that really drive most of the inference.
They tell you how kind of what is the general saying you should expect to
see. And the pair risk potential is maybe even higher potentials help
you a little bit to improve the result. They smooth things over. They
remove some totally unreasonable combinations. But the main meat of this
method seems to be the unary. So let's try to get better unaries. We
felt like the unary potential that people used were inadequate and as
I'll show you shortly it turns out that you can get much farther than at
least we thought we would with just unary potentials.
So the key idea why it would be better here is that you can shift at
least some of the burden of deciding what combinations of labels and what
structure of label space is reasonable from the inference expressed in
the label space into the feature computations. So you want to kind of
shift it to what we compute from the image and hope that some of the
properties we would like to capture which we discussed earlier would
emerge.
And it is pretty clear that if you do it, you need to look beyond just
the local pixels, set of pixels in the superpixel because some of the
features, when we talk about them, we talk about something which relies
on the information farther than the boundaries of pixels. The question
is how far we need to look. As I'll show you, it turns out it is
beneficial to look really far basically, as far as you can.
So here is the general gist of what we are doing. I'm going to kind of
instantiate in the next few slides. So suppose you want to classify this
red superpixel whose boundaries are in red. This happens to be part of
the headlight of a car. If I just show you the pixels there, you
probably would not be able to really guess what that is. Some sort of
shiny object. Maybe it would have, well, it doesn't look like it's part
of an animal, but who knows. And so this very local feature is something
which may be helpful, but not beyond, we don't expect to be too helpful.
Then we are going to start zooming out that superpixel and look at the
larger and larger areas of the image and how exactly this boundaries are
computed and what we compute from those I am going to defer until a
couple of slides later.
But let's just say we extract some, we only extract some useful
information from those. By the time I get to this yellow or olive
colored region, it is starting to be a little bit clearer. It still may
be hard to say what exactly it is, but you can see a bunch of really
straight lines which should make you think that some sort of manmade
object. Certainly doesn't look like an animal now. Has some flat
metallic like looking surfaces. Maybe some sort of vehicle or maybe a
piece of furniture. By the time we get to this purple region, most of
you probably if I just cropped this and showed it to you, most of you
would probably say it's a car because you now see the radiator, you see
the wheel. It is kind of pretty clear what it is.
Of course, by the time you get to this larger region, the blue region,
you actually see the -- oh, although it doesn't look blue here for some
reason. You actually see most of the car. It's pretty clear that you
are looking at the car. Remember, all of this is in the context of
classifying that red part of the headlight, right? By the time you look
at the entire image, in this case you don't get much more than what you
had from the intermediate level because the car occupies most of the
image. But in many cases you see things which are other than just the
object you are looking at. You see other objects. You see the kind of
stuff which surrounds it. Here you would say well, it's an urban outdoor
scene. So the car is very likely. So the idea is to extract features
from all of these levels. We call them zoom-out levels because you zoomout from the superpixel all the way back to the entire image.
Concatenate them and use them in some kind of classifier to predict the
label for the ones superpixel you are looking at. And do it for every
superpixel in the set, in the image.
So that is the general gist. Let's try to now decide, define how exactly
we want to do this. So what properties would we like to have from these
features? So one intuition which I think is common in vision and
certainly is captured by CNN, as Grover was suggesting, is as you move
from a very small spatial part of the image to a large spatial support,
you can essentially get more information that allows you to compute, to
extract more complex features because you kind of, more things can happen
there and you have more information to decide what is happening.
So as an example, suppose I show you these two superpixels. Anyone has
any guesses what these are?
>> A cheetah on the bottom?
>> Greg Shakhnarovich: Cheetah on the top? Yes, cheetah is not one of
the classes in VOC, but yes. Cheetah and something else? Okay.
>> [indiscernible].
>> Greg Shakhnarovich:
Sorry?
>> Potted plant.
>> Greg Shakhnarovich: Potted plant, very good. Okay. Now, let's zoom
out a little bit more. Any updates to the guesses?
So usually when I saw it first, I couldn't really -- I said I can't tell.
This looks like a wheel.
>> Yeah.
>> Greg Shakhnarovich:
this? Cheetah?
[Chuckles.]
Wheel or maybe steering wheel.
Okay.
What about
>> Horse.
>> Greg Shakhnarovich:
Horse?
Who said horse?
Okay.
>> Are these from VOC?
>> Greg Shakhnarovich: Oh, yes, they are from VOC. Okay, so maybe.
Let's zoom out a little bit more. I don't know, horse?
>> Wheel.
>> Greg Shakhnarovich:
Wheel?
>> Steering wheel.
>> Greg Shakhnarovich: Steering wheel and this may be horse.
Zoom out a little bit more?
Okay.
>> Oh, a chair.
>> Greg Shakhnarovich: Chair. This is like Antonio Terrable used to
show a lot of these kind of for context. So probably ->> Imagine idea [indiscernible].
>> Greg Shakhnarovich:
Exactly.
>> [indiscernible].
>> Greg Shakhnarovich:
points.
[Chuckles.]
What is the name of the horse?
You get bonus
>> Greg Shakhnarovich: Okay. so yes. Maybe it's a donkey though, but
anyway. So run through them all. It usually becomes more clear because
we see more of the object, we see some surrounding stuff. In this case
we don't see the surrounding stuff that much, but we see a lot more of
the object. In this case we start seeing surrounding things. Of course,
by the time we get to the entire image, everything makes a lot of sense
now. It's a chair. In fact you can see it's some sort of room, dining
room. It has other chairs. It has tables. Clearly it is inside the
house. Here you, you know, you see much of this horse and see some other
animal which, even if this horse would look some sort of weird, but this
is clear it's animal. Having any kind of animal increases probability
it's a horse. You see hay, sky. It's a [indiscernible] image. All of
the things we would like to capture.
Obviously anthropomorphizing computer algorithms is dangerous; it's
wishful thinking. We would like to extract all of those things, but it
gives you an idea at least what we would hope to extract and why zooming
out might help us to capture things. And also emphasizes that yes, that
as you go zoom out farther from the original, you should be able to
compute more complex features for this.
Okay. Now, another thing we should think about is how do features we
compute from different zoom-out levels interact for various locations in
the image. So if you consider two locations which are close to each
other, immediate neighbors or almost immediate neighbors, and consider
different spatial extents from which you want to compute the features, it
is pretty clear that as you consider very local zoom-out levels, like
individual superpixels and slightly enlarged areas, they could vary very
quickly, as you move around the image, because they are very local, if
there is a strong boundary there it can dramatically change the color and
texture of other things. It will compute if you move just a few pixels.
As you zoom-out more, you start the underlying, the areas from which you
compute the visual information start overlapping more and more. And so
as a result, you start imposing some sort of smoothness. So here just by
this notion of overlapping regions we get smoothness for free without
having to penalize the underlying assignments later in the inference
stage.
So if you look at the areas which are zoom-out levels which are still
fairly small, the overlap might be minor. By the time you get to large
let's say purple regions here, two superpixels which are even not
immediate neighbors start having very large overlap.
The ones, this guy which is far from them still have a fairly minor
overlap. So we still can allow quite a bit of variation. By the time
you get to the very large regions many, everything which has lapped into
some sort of an image away from a each other will have likely very
similar distribution of features unless there is dramatic change in what
is underlying there.
So the sort of smoothness which is dynamic and in a sense adaptively
varying depending on what actually compute from the image, and different
[indiscernible] of this levels, it varies much faster in a small level.
Yes?
>> Do you share computation in the computing [indiscernible]?
>> Greg Shakhnarovich: I'll talk about it later.
there could be. Would you like there could be?
The answer is that
[Chuckles.]
>> Greg Shakhnarovich: So there could be except that we are still
working on getting it better. But certainly in terms of, it should be
possible to compute this in a shared way.
I'll talk about how we compute the features and it will be all clear how
we can actually share the computation.
And of course, once we zoom-out to this entire image and this global
level, the entire image, all superpixels in the image will have exactly
the same set of features by definition. Of course, the other images will
have different other features. You can think of it as kind of varying
degree of how shared the features are and as we move up this hierarchy of
zoom-out levels and sort of they become more and more smooth and capture
more and more spatial context, the underlying complexity grows. You can
think of different other kinds of things you can capture with these
features. If you are very local, you capture very local properties. We
talked about color, texture, et cetera.
As you move to the intermediate levels you start capturing maybe some
parts. Maybe even some small objects. Some kind of informative pieces
of boundaries which straddle multiple objects, which should tell you
about statistics of class versus another class.
If you go to even larger regions, you start capturing maybe bigger
objects, large parts of objects, constellations of parts. By the time
you get to the global zoom-out level, you capture properties of a scene
and what kind of image you are looking at, which includes things which
are not directly related to any expect object or stuff, but like
distribution of objects, types of environment like, you know, lots of
straight lines and the image makes it like it's a manmade environment.
It is not something directly tied to any object, but you expect features
that captured to be useful.
So all of this kind of list of [indiscernible] suggests a particular type
of architecture this day and age, right? And the sync -- that is
[indiscernible] convolutional net because that really fits the bill on
all counts, right? It computes features of increasing complexities as
you increase the receptive fields. It captures things at different
levels of, as we know from lots of people who looked at, trying to
visualize and understand what these networks do, different semantic
levels of representation. So how can we leverage neural networks to do
this?
So initial version of this work which is still on archive, until recently
I didn't realize that people expect the archive versions to be actually
more up to date than the conference versions. So the archive is the
preliminary version. People keep asking me if I published it. I said
yes, I published it in CVPR.
Initially we tried to kind of, we initially combined some features
computed by neural networks with some hand-crafted features and we went
through the process which has taught me a lot. The bottom line is that
every time we dropped some hand-crafted features, we improved
performance. Every time we dropped some decisions we made and said we'll
just use all the layers of neural network, we improved the performance.
The bottom line is as Fred [indiscernible] supposedly used to say every
time I fire a linguist my recognition rate goes up. Basically every time
you prohibit yourself from making decisions, apparently it improves the
accuracy.
>> [indiscernible].
>> Greg Shakhnarovich:
What is that?
>> Every time you fire Gestaltist ->> Greg Shakhnarovich:
and then fire them.
A Gestaltist?
Yes, I should hire a gestaltist
[Laughter.]
>> Greg Shakhnarovich: In other news, there is something called a
Gestaltist.
All right. So let me now describe -- so this is maybe if you want to
remember one slide which summarizes, really gives you very good
understanding of what we do at both conceptual and detailed level, this
is the slide. Let me walk you through it. So this is an example of a
two-way convolutional neural net that which has three convolutional
layers and two pooling layers, okay? So the first convolution layer has
say 64 filters. We want to class, we are now thinking of representing,
computing a representation for zoom-out representation for this red
superpixel marked in red.
So the convolutional, first convolutional layer is going to compute 64
feature MAPs which ignoring, assuming the right padding are going to have
the same size as the input image, right? So now you have 64 numbers for
each pixel. We are going to take all the pixels in the superpixel and we
are going to average those 64 numbers. Over those we are going to have a
single 64 dimensional vector. That is the first zoom-out level feature
for the superpixel, eh?
It is, while we do this, it is important to think about the receptive
field of this feature, right? What is the receptive field of this
feature? It is not slightly more refined than just direct angle to
[indiscernible] the filter. I'm thinking of the set of pixels in the
image which affect the value of the 64 numbers. In this case, it is
basically both times using three-by-three filter here. Can someone tell
me what is the receptive field of this feature? In some concise form?
[There is no response.]
>> Greg Shakhnarovich: So there is a very simple way of thinking about
it. It is the exact superpixel dilated with the three-by-three box,
right? Because that is how I compute the, its convolutions. So
everything, all the pixels which fall within the dilation of this threeby-three box are going to contribute values. Everything outside is not.
So basically it is one pixel outside of the third pixels, almost all the
same as the original superpixel.
Then I do pooling, some sort of pooling, I try to, regardless of this max
average. That produces feature MAP which is actually half of the
resolution of the original image. Now I'm running on this convolutional
layer with 128 layers, let's say. So this will give me 128 pixels for
each 128 numbers for each pixel here. Now, I still would like to
describe things in the original image. So I'm going to up sample it by a
factor of two. So get back to the original resolution. It can be
bilinear interpolation. Now do the average pooling. Now I have this 128
dimensional representation for the entire superpixel. Now, I am not
going to ask you what is the receptive field. It's a little bit
trickier. It turns out -- it is not that tricky. There is a formula,
but you have to spend a few minutes figuring out how to compute. There
is no -- it's a recursive formula that tells you what is the size of the
receptive field for this feature because it is a combination of
convolution and the sub-sampling due to pooling. But it is not rocket
science.
The general intuition is that it will grow significantly mostly, it will
grow a little bit because of the convolution significantly because of the
pooling. So the receptive field of the feature here is going to be
larger extent than the original subbase. So by a few pixels.
Then I'm going to the next pooling and the next convolutional layer. The
same thing happens. Now I have to up sample a factor of four and average
over a third pixel. Now I have an even larger. If the first one of the
receptive field was pretty much superpixels, the second one was a little
bigger. The third one is probably something like this.
So there is this natural growth in the size of the receptive field as I
move up the network, as I compute more complex features. But in the end
I describe those features four pixels in the third pixel I'm classifying.
In this case, since it is a very small network, I'll stop here,
concatenate those features and ignoring the global features, just from
convolutional layers. I get these three components which together give
me 448 dimensional representations. The sum of these two numbers. Three
numbers.
>> So [indiscernible] happening in the ...
at the same time, between this and hypercolumns ->> Greg Shakhnarovich: I will talk about those when I talk about
results. Both hypercolumns and FCN, the functional convolutional
networks, I guess all of this is network. There are some interesting
differences about how we compute, what we average from what layers. I
think those, our results happen to be a lot better than either of those.
I think that's partially because of the difference of the choices we make
about how to combine things across levels.
Yeah.
>> Do you do any reasoning about the magnitude of the different layers?
>> Greg Shakhnarovich:
Magnitude is --
>> As in, for example, in the input image if you are not rescaling, you
might have values from zero to two-thirds, negative 128, to 128.
>> Greg Shakhnarovich:
Right.
>> By the time you get to the top layers, you'll have values from
negative one to one. If you are concatenating together different
features from different layers, just the amplitude of the signal might be
very different.
>> Greg Shakhnarovich: Yes, that's something I should take care of when
I classify those. Right?
>> I mean, in theory you, so if you just ignore it then yeah, the
classifier hopefully will do, sort of learn its way about the fact that
there are different amplitudes.
>> Greg Shakhnarovich: Right. So there are a few kind, few degrees of
depth of this. If you are doing a linear classifier on this and I don't
do regularization, it doesn't matter, right? Because the only thing
that, if a linear classifier, nonclassifier is the only way in which the
magnitude affects the results is if you have regularization based on the
norm. If you don't, then actually it doesn't matter.
>> Right.
>> Greg Shakhnarovich: Right? You can scale some feature by a factor of
a thousand and the classifier will just, linear classifier will scale the
equalization.
>> The point, I agree it doesn't really matter.
learning problem would be harder.
I'm just saying the
>> Greg Shakhnarovich: That's right. What we do effectively is we -- so
for now at least conceptually, it is still two stages in a sense. We
compute all these features and then we classify them. So when we
classify them and treat this now, forget how we compute them. We have a
feature vector. You would apply the standard normalization tricks you
would want to apply in any case which, for example, would mean taking
some sample images, computing the mean [stern] division, normalizing
them.
You could fold it into the process which collects this feature, but it is
kind of conceptually is the same. We do have a multilayer network on top
of this. So it is important and we definitely found that it is crucial
to do it, do it right. But the process itself is fairly pedestrian, in
that sense. So you have to think about the features having reasonable
sort of, reasonably normalized magnitude, but why there is no normalized
division, maybe we don't care about this.
>> Right.
>> Greg Shakhnarovich: Okay. Other questions about this slide? It is
kind of important slide. So dwell for a few seconds. If there are no
more questions, we'll move on.
>> So just to clarify, the extent of the zoom-out region is entirely
defined by the grown receptive sub [indiscernible].
>> Greg Shakhnarovich:
Correct, exactly.
>> Rather than being defined by somebreaking superpixels [indiscernible].
>> Greg Shakhnarovich: Correct, correct. That's my distinction from the
gestalt approach, right? That's right.
So it is entirely driven, the features and the end zoom-out levels are
entirely driven by the network. And you basically can think of the
regions obtained by kind of succession of dilation with three-by-three
boxes because we use three-by-three filters. And increasing the field
due to the size, resizing and pooling layers.
So these are the stats. So the numbers here are not, the numbers here
are empirically computed by us by taking a bunch of superpixels and
evaluating what was the underlying receptive fields. It is not something
we can compute in closed form ahead of time because all of the sizes
actually depend on the regional superpixel, right? You start with the,
start dilating and increasing the size. But we typically have
superpixels which are 30, the larger area of superpixels, about 30
pixels. And so the larger receptive field, if you consider a box,
bonding box inside the field, start the 30. It increases, so 30 to 36.
Then there is a pooling layer. This is for 16 layer VTG network which we
end up using for most of our interesting experiments. Then there is a
jump because of the pooling, kind of slow increase until next pooling.
Then another jump. And by the time we get to the last layer we really
are looking at most of the image. The image is similar, typically 500 by
300 pixels. By this point we are looking at a large part of the image.
Now, in addition we have this global zoom-out. So global zoom-out you
take the image, run it through the network and take the last fully
connected layer, which is the feature representation which is then
traditionally used, for example, in image MAP classification to classify
a thousand classes.
We take that representation, use it as a global zoom-out representation.
And then here we do something which is kind of arguably a hack, but I
think it makes sense. I'll explain in a second. This is, we take and
here we decided on some, by some heuristic reasoning on the size of the
bounding box around the superpixels. So we just take the sub-image and
do the same as we do for a global. We run this entire subimage through a
network and compute the last vertical layer. Why do we do this? If you
think about that picture with the dining room, that didn't really happen
there. But you can think of a big window it clearly the picture is
inside of a dining room. But there is a big window. In the window you
see a pasture with the sheep, right? So if the entire room is clearly
indoors, but if you take just the some way measure around that window,
mostly it's outdoors. You are looking at something which has very
different characteristics. We found that some images there is this kind
of very different subscene which informs what you should do for pixels in
that subscene, different from the entire global image.
So that adds a few percentage points. And so we ended up using that as
well.
So all tolled we have about 12 and a half thousand feature dimensional
representation. If you concatenate all of those things ->> Why wouldn't the subscene ... [indiscernible] receptive field?
>> Greg Shakhnarovich: Ahh, because the receptive field, even a large
receptive field we still compute only the features which are computed the
convolutional layer which are not quite as complex.
>> [indiscernible].
>> Greg Shakhnarovich: He they are simpler. You want to have really
highest level semantic features which you can think of standing what is
going in the image, but compute it only for a limit of spatial extent.
And some cases it doesn't add anything because the same thing as anywhere
else. You take any picture of any part of this is going to be similar,
except maybe there are no people that are here and there are people here
and many images it is not really significant, but some images it is.
Okay. Some minor issue here, which kind of annoying is that there is a
very huge imbalance, there is a huge imbalance between classes. So most
pixels are background in this data set, Through I think the COCO, I would
like to know the stats, what percentage much pixels in COCO background.
>> I'm sure, but they are in the background as well.
>> Greg Shakhnarovich:
>> [indiscernible]
I'm sure it's less than this.
would say higher.
>> Greg Shakhnarovich: Sixty? So basically the most common class after
background is person. It's almost an order of magnitude smaller. By the
time you get to the least common class, bottle, you see it's in VOC, a
percentage point. It's two orders of magnitude, three orders of
magnitude difference. Two orders of magnitude. So you can ignore this.
There are four things we can do basically. You can ignore this and we
know that it's usually produces results, I should probably mention now
how we compute accuracy for [indiscernible] The standard way to compute
accuracy for this task, which has issues, but has been adopted for the
most part for semantic segmentation is the following: For a given class
C, you can think of all pixels which you think are the class, and all
pixels which are really that class. So all pixels they think they are
horse and pixels that are really horse, you take intersection of classes
over union, that gives you a number between zero and one. The number is
one that means you got perfectly right set of predictions for every
pixel. If the number is small, it could be because you under predicted
the right things. There are many horse pixels which you fail to say are
horse, or maybe it's because there are many pixels which you think are
horse but they are not, or maybe both, right?
So as both either types of errors increases, you have lower number. This
is something which specifies how well you did for a particular class and
then. What Pascal VOC benchmark does is average this over classes. So
the effect of this is that if you have a bottle class which is 100 times
less common than a background class, each pixel of the bottle class is
going to hurt you much more if you get it wrong, either way, than the
background. It's really terrible if you misclassify 100 bottle pixels
and it's okay if you misclassify 100 background pixels unless you
misclassify them as bottle, roughly speaking.
This is kind of an artifact of this task and to some extent it is
mitigated once you switch to objects, but I guess the average of the
objects, but only to a limited extent because objects also tend to have
different size.
And so since we do want to play this game and have better results, we
wanted to optimize for this measure. So instead you can do four things.
You can ignore the imbalance. This is going to produce forced results.
Everything sees this empirically. You can try to impose balance by subsampling the more common class. That turns out to be a bad choice. You
basically lose a lot of information. The background class is really
rich, lots of things are background. If you reduce it from 70 percent to
7 percent to match the no person, you are going to lose a lot of
information. Kind of unnecessary.
You can up sample the less common class and that's actually going to be
conceptually concurrent with what we are doing but it is very wasteful.
You are going to go many, many times, many, many times through the pixels
of bottles, versus one pixel per person. So what we do is actually we
think the best, the fourth choice which is better than any of those other
three, which is use all the data there is, but weigh the loss. It's a
pretty standard thing in machine learning. I don't know why it is not
more widely used in segmentation work. It's trivial, a trivial change to
any code you have to introduce asymmetric loss. Basically you weigh the
loss in each pixel inversely proportional to the frequency of this class.
So now if you make a mistake on the bottle pixel it is actually going to
cost you more in the subjective.
And that turns out to be significant. Not a dramatically huge change,
but a few percentage points improvement if you train it with this loss.
>> [indiscernible] key optimize for that loss object? Right now you have
a loss. At the loss there of your neural network there is a soft max.
Right now you are doing long leg YOUR... You can write down what is the
expected intersection by expected union produced by these goals?
>> Greg Shakhnarovich:
about --
It is going to be another surrogate.
It is all
>> So Sebastian had this paper where he showed the expected intersection
by expected union, approximate size ->> Greg Shakhnarovich: Yeah, the surrogate.
[overlapping speakers.]
>> Approximate factor and the approximation goes on by order one over N.
>> Greg Shakhnarovich:
Where N is number of --
>> Number of super pixels.
>> Greg Shakhnarovich:
differential.
It is not that bad.
Well, actually looking good, maybe.
And it's
>> We have this paper that just couldn't get published because it just -we didn't have a good segmentation bottle. But we showed that I over U,
you can differentiate through that and you can train it.
>> Greg Shakhnarovich: We have a good segmentation model, but okay.
should talk more about that, maybe. Okay.
[chuckles.]
We
>> Greg Shakhnarovich: I mean, on the other hand there is maybe
something unseemly about trying to optimize this measure which everybody
criticizes, but ->> [speaker away from microphone.]
>> Greg Shakhnarovich: We can say how bad it is and then do it because
we want to Win the competition.
>> This is over the entire data set, right?
>> Greg Shakhnarovich: Correct. Oh, yeah, yeah, it is over the entire
data set and of course that is an approximation which saying you can't
avoid because it is going to be hard to optimize. Even if you knew the
ground truth and you had superpixels, finding the optimal assignment of
superpixels under this measure for the entire grading set is going to be
hard because, right ->> The thing is, like you have like small, I don't know, small motorcycle
and an image. If you miss the entire motorcycle, the 20 pixels, who
cares? It's such a small number of pixels and the other images of
motorcycles.
>> Greg Shakhnarovich:
many motorcycles.
Sure, sure, but if you do it for many pixels for
>> Yeah.
>> Greg Shakhnarovich: It's an approximation of an approximation, right?
There's like at least three levels of approximation I can think of. But
this is, you know, you can also think of individual image approximating
the data set and then pixels and the image approximating the decision
over the entire image, but empirically this is, we found this to be an
improvement over not doing this and that's the bottom line here.
I'm just mentioning it because I really don't understand why I found two
papers on segmentation then out of hundred I looked at which actually
used this symmetric loss. In fact, they reported better results than not
using it. So it's kind of strange why it is not more commonly used.
All right. So now some results. We now take the 16 layer network I
mentioned before. We extract the features. And the first thing we did
was take the linear models, a simple linear model, soft max, you can
think of soft max on top of those 12,000 features and you can look by
taking a subset of features how much each level contributes and so this
is kind of grouped them together by -- we didn't run all possible
combinations. We grouped them roughly by kind of groups of layers in the
16 layers, where before each pooling layer. The first two get you 6
percent average IU, which is pretty bad. Better than chance but not by
much. If you go to four, it's significantly better. Now it's ten. But
by the time if you take all the 13 convolutional layers you get to a
number that would have been state of the art four and a half years ago.
So you had a time machine, you can go and win.
But then it wouldn't be published because it uses neural network.
[laughter.]
>> Greg Shakhnarovich: There is no way to win there. Now, if you, so
this is kind of the most dramatic thing. If you take the 16 layers and
you add the global presentation, you jump huge amounts from 42 percent to
57.3. And that actually would have been state-of-the-art I guess eight
months ago, nine months ago. In fact, when we were preparing for
[indiscernible] that was state-of-the-art. Very excited. This made us
kind of, when we were coming close to this number as we added layers and
we kind of improved some bugs, at some point we passed the current stateof-the-art and we were doing it without any region proposals, without any
pairwise potential, so we were very excited the idea of unary potentials.
No, let's get rid of pairwise potentials, structured prediction. Who
needs that? We'll just do everything here.
So then if you use ->> In is not just another pairwise potential, this is structured loss?
>> Greg Shakhnarovich: Structured loss, this is true, which is not used
here. So it's a fact that there is no structure at all, right? It will
show about nothing.
So let's look independently. So if you have a subscene, a scene that is
-- so this is significantly worse. That's basically because you do fail
to capture a lot of information on the scene. We combine the, if you
take only kind of there are a couple of interesting things that are
missing here which we do look at a few. If you understand about a
particular combination, I might know the answer, but this shows what
happens if you ignore the local features up to seventh layer. This is
pretty much, I don't remember. I guess you start with the receptive
fields which are already quite large, 130 pixels. And you add the
global. You actually get a number which is -- but you add a subscene.
It's a little bit better than this. If you add all together and get
58.6. So this change is relatively small but keep in mind the dispersed
seven filters are probably about something like five or 6 percent of the
entire feature presentation because the number of fields is fairly small
initially. So figure that if we get 6 percent, .6 percent improvement on
this, at least for awhile, half a percentage point was enough to quite
Win and statistically significant. One of my advisors used to say
statistically significant but not important maybe.
[chuckles.]
>> Greg Shakhnarovich: There is no reason not to do it if we could,
especially given it's relatively small. We didn't think there were a lot
of features here because it is a pretty small number of features. These
are all numbers on the val which we didn't touch during any of the
training, the standard partition of training images and [indiscernible]
images. And we expected.
This was a linear model. Well, we don't have to use linear model. We
can use any classifier we want. Before I talk about that, actually let
me show you kind of graphically what happens here. This is maybe the
most typical kind of demonstration. What happens is, this is the ground
truth. A person. As you start from very local to more and more kind of
larger ultimate levels you get a little bit better -- here you get a
significantly better partition in background than foreground. But you
get all this nonsensical predictions. I think this is dog, I think this
is bird. It doesn't really have good simple features in the
convolutional layer. By the time you get the subscene features, it looks
here. It basically says, well, there is no dog here. There is no bird.
There's a person. So a lot of these labels go away and are replaced by
direct labels. Maybe dog is killed here. Not actual dog but the dog
labels are killed. The next most reasonable thing is person. So that
gets substituted. If you actually use the full representation, it is -in this case it is slightly worse in terms of person boundary but
certainly removes a lot of the spurious incorrect lanes.
And similar things happen to other images. One immediate thing you
notice here, if you look at it, there are two things. First of all, you
can see that there is still we know that we lose about 5 percent accuracy
because superpixels boundaries and that explains this jagged kind of -there's actually very little local edge information here about the bird
versus water. And superpixels here have really hard time localizing the
boundaries well.
And so the assignment level we still get the right high level kind of
boundaries with bird and back grappled, but locally it is jagged and
probably would be improved if we kind of cleaned up the superpixels
boundaries.
Another thing you can see here that even in this kind of high
segmentations which are full zoom-out presentation, there's a lot of
noise. And in addition to this, this is not just the jagged boundaries.
We have some kind of some very irregularly shaped boundaries even though
straight boundary for train would be much more reasonable here. And you
can argue that that could be improved significantly if you actually
brought back the previously dismissed idea of structured prediction. And
in fact, I'll talk a little bit about car instead of art in the
[indiscernible] and people do seem to get a lot out of it. I still
suspect that we might not need it but it remains to be seen.
So going back to the classification, linear classification gets you faith
.6 percent. That was about 7 percent better than published state-of-theart when we did the VCPR. If you actually use seven layer network, it's
a loss worse. So in fact it's even larger gain than what people
typically observe per classification, relative gain. It seems like it's
really important for simulation to have more rich, more layers for richer
representation. But there is no reason to restrict ourselves to linear
classifier. We can stick some nonlinear classifier. Given that we
already had all the machinery set up, the obvious choice was just
multilayer neural networks. It is not a convolutional neural network,
what used to be called multilayer per cep tron 30 years ago.
So what you get here is if you go through and get a huge jump, 68
percent, a significant improvement here, the three-layer network. We
couldn't get any more, squeeze any more out of more layers, more hidden
units. It drop out here, et cetera, but it basically stopped there.
>> If you run back to [indiscernible].
>> Greg Shakhnarovich: No, no. So this brings me to what we'll talk
about in a second, what we are doing now. This is still a set up for no
good reason essentially, but just because that's what is easiest for us
to do. We have a separate station which we use interface to Cafe to
extract features. We save them to disk and treat them in a completely
different stage as features someone gave us. We gave ourselves. And
they are run a classifier on those. The only thing on the back drop here
is through the three layers.
I sudden say that the network we use here is exactly the VTG network
trained on image net. Doesn't know anything about segmentation. We
literally take it as is and use it. And it gets us this result.
And one interesting thing we tried here, one of the reviewers for VCPR
asked for it. So we had kind of rough prediction what would happen. It
was mostly what happened. They said well, why do you need superpixels?
What if you just do rectangle regions? We said well, we will try. It
doesn't change anything. It is still the same zoom-out representation,
but how much do we gain from superpixels? It turns out we gain quite a
bit, right? So the same, exactly the same architecture, same machinery,
but with regular grade of rectangle, gives us about 64 percent and the
underlying work, accuracy, how much achievable accuracy drops by even a
larger number. So it is similar kind of drops. So basically what you
lose here is your ability to localize boundaries well. At least most of
the time with superpixels.
>> What happens if you change the resolution in superpixels?
>> Greg Shakhnarovich: We haven't extensively experimented with this. I
can tell, we have experimented a little bit and based, you know,
extrapolating between that and my general intuition, I can tell what I
think will happen.
If you go and increase by a factor of two, let's see, your achievable
accuracy goes from 94.4 to about 98 percent. Ninety-seven-point
something.
But you might gain about maybe one half a percentage to one percentage
point in accurate segmentation. Not more. At that point I think you
start getting all this noise. Look, there are two things you gain from
superpixels. One is expense. It's cheaper to level 500 things than a
million.
But another thing is actually localization of boundaries and just amount
of noise.
You lose when you label the whole thing wrong. But if you label the
whole thing right, most of the time you get for free. All of them are
right, right? If you now have 20 times more pixels, a thousand might be
wrong and a lot more chances to get it wrong. So I don't know exactly
what is the right setting of that knob, but my feeling is that if you
want to work with superpixels, it's roughly correct. A more interesting
thing to do which we haven't explored is to actually take the other knob
and SLIC and make them even less regular and see what happens there and
it's possible that that will be a better way to do it.
But I mean, eventually one thing that nobody ever calls me on for some
reason, I present it as a fully fit forward architecture. We start with
the image, compute the features, classify them. But of course, the
superpixels are not the feedforward part. There is an actual loop there
to do on okay means. What you would like to do is actually make it
feedforward superpixels somehow part of the network. We're thinking
about how to do that. It is not clear at all. If you have any thoughts,
I will be happy to discuss that.
And then we will learn all this and not have to separately tweak the
number of superpixels or any knobs.
All right.
So yeah?
>> How much does performance [indiscernible] keep the standard structure
of loss?
>> Greg Shakhnarovich: So our initial architect paper had a huge claim
of a huge drop and you need to run it for a lot longer. So if you use
symmetric loss, with the proper, let it run a few more days you get about
three percentage points lower than this. It's still quite a bit. So
there is no reason not to do it, but I certainly, we had to walk back a
little bit the claim of the importance of that.
>> The growing of the receptive field is also not learned, right?
a consequence of the parameters?
That's
>> Greg Shakhnarovich: It is a consequence of the -[overlapping speakers.]
>> Greg Shakhnarovich: -- of the choice of architecture of the network.
You can learn it using an experiment or something.
Okay. So where does this fall in the state-of-the-art? So initially,
before, when [indiscernible] C our initially, so before when semantic
CPR, we used the seven layer network and the numbers for six -- we had 59
percent, something like this, I think. And we had a lot of hand crafted
choices. So -- I am proud of how I got this number. It was very nice,
very good result. We also have been working on this and now we have 60 - we had 59 percent at the time. We had 62. There are two papers from
Berkeley. So hyper columns and fully connected networks. And so first
of all, just to give you an idea how things were moving, this is 2010.
This is 2012. Kind of this is a progression of numbers. Then these are,
I think these are the three best results published at VCPR. I'm not
sure. And I'm fairly certain that ours would be actually published in
VCPR was this, the deep -- I just went through, the 16 layer network and
everything. The fine .6 on the test set. I think that's the best result
among the papers published in VCPR. As of lunch time a couple days ago,
it was 664 [indiscernible] I haven't checked today in the last few hours.
It's possible it went up.
So why, kind of what explained this? I think a few things. We pretty
much hope that we can get all of them, many of them right once we
reimplement this, which is what we are doing now.
>> [speaker away from microphone.]
>> Greg Shakhnarovich: So that is strongest -- so if you look, here is
the thing. If you look at the numbers, so these numbers are with
structured prediction, CRF on top of the CNNs, trained on COCO, data in
addition to Pascal, and by everybody including these two numbers, fine
tune the network properly, end-to-end learning for the segmentation task.
But if you look at the results, if you remove -- so we don't fine tune,
but if you look at no CRF -- sorry, no CRF but only on Pascal data, I
think the best results was 72 percent. It is only two percentage points,
two and a half percentage points higher than us. It is entirely
conceivable we need to shoit that once we fine tune properly the network
we will get better than that. And if you train on COCO as well, you
might get a number -- I'm still not convinced that empirically, like I
can't, I wouldn't be able to support it, but my hunch is that once we do
it properly we will get as good results as with CRF. But of course, if
we add CRF on top of this, we could get even better results. It's
possible.
Now, so why haven't we done all of these things? Trained on COCO? Fine
tune the networks? The main issue here is that we can't do it because of
this poorly designed set, external setup when we extract the features set
to disk, et cetera. In order to implement it properly, you need to
actually -- there is a natural implementation of this as a fully
connected network. I'm saying this because as we all know all networks
are fully connected.
>> Convolutional.
>> Greg Shakhnarovich: Convolutional. I'm sorry, what did I say? Fully
convolutional. I'm so glad, I did not have that, I escaped the Wrath of
the Khan at VCPR. I kind of stayed low.
[chuckles.]
>> Greg Shakhnarovich: Anyway, so there's natural way to share
computation. As you said, you just represent this last -- actually we
thought about this when we were writing it, but we didn't quite do it.
The main tricky part, if you're interested in technical nitty gritty
details, if you go back to the sort of, to this picture, you do it -- so
you can Frankenstein this pretty much, this is from existing letters in
Cafe, for example. I don't know if people use Cafe here. Certainly they
are looking at it. There is a deconvolution allayer which will do up
sampling, but it kills the memory because if you start up sampling
everything to the resolution, by the time you get to the high
convolutional layers that have thousands of layer, it kills the memory.
You have to represent all the representations to all the batch images at
full resolution, it's thousands of channels. It's just impossible to do.
Now, you don't actually have to do it. Conceptually you do, you can fold
all of this, up sampling it to compute the responses, right -- to
computing the responses, right? Because up sampling is equivalent to
just weighting the lower resolution pixels with different weights.
Implementing that in Cafe has been a somewhat challenging process for us,
high potential for bugs. So kind of is -- but I think we are pretty much
getting there now. So hopefully within a few weeks we will be able to
run this experiment and release the code and everything.
But that has been the main challenge. There is another, equivalent to
that is that you might want to, we tried also a kind of building the
pooling layer, which takes arbitrary shape regions that also is a good
layer to have but might not need it if you do it the way I described.
So that's kind of the main thing we are doing now for this, trying to get
end-to-end training. Once we do it, we can also train on COCO easily, et
cetera.
So the other thing I mentioned is superpixels. Do we need them? It
seems like we are benefiting from them. That's one of the distinctions
between that and hypercolumn/FCN. So hypercolumns, they have the
boundaries of the entire region, which is a pain from region proposals,
the bonded box around it. But they don't really explicitly look at
superpixels. FCN, fully convolutional networks, doesn't at all use
superpixels. The other distinction, in hyper columns, one other thing
with hyper columns, they don't have the global representation. They have
something almost like the subscene but not quite because of the limits of
the bounding box, just the object itself. It kind of ignores all the
other things around it. Whereas FCN actually doesn't have, it has
something similar to our subscene. Maybe even slightly larger. But it
doesn't actually have the local level features because it doesn't -- it
starts pooling information, starts at a relatively high in the network.
But more important, I think, FCN doesn't pool the features. It pools
predictions. Basically what they do, they take some intermediate
convolutional layer and say let's try to make predictions here. The next
layer we are going to up sample them a little bit. And at that point you
up smaple by a factor of two or four or eight maybe max and then pool the
predictions.
We defer, it's like early [indiscernible] we defer all of the averaging
of features until the very last moment when we have all the information.
Then we make a decision. I think that's really important that probably
explains the significant jump in accuracy. So anyway, the pixels we need
to learn how to fold them into the network and inference, we are trying
to look at ways, as many people are now doing, to fold inference and CRFs
into a network and representing it as feedforward process. It's pretty
obvious, I think, to do it at this point. It remains to be seen how
important that is.
All right. I have, it seems like I've over stayed my welcome a little
bit. Hopefully not too much. So questions?
[applause.]
>> What do you see as the main failure in your cases right now?
>> Greg Shakhnarovich: I don't know if I have like an exact -- so it
seems like the main, the most annoying -- I know what the main character
is, there are multiple sources of failure. We haven't done a failure
[Derrick Coyne] style diagnosis, but by eyeballing lots of images, it
seems like okay, we lose a few percentage points because of the bad luck
representation of the pixels. Of course, we have some degree of some
number of, some percentage due to horses labeled as sheep and similar
categories. The most annoying thing which I think is really bad is
noise, like here you have this shoe of that guy labeled as motor bike
because there is a motor bike in the scene. The global features are very
strong, probably here. And because we know that a lot of the accuracy
comes from this. And this really looks great, looks similar to this
motor bike. So it says motor bike. Clearly if you had some hierarchical
model CRF it would have improved this. There are many similar cases like
this. So here there is very noisy, in this label.
So I think that's the main thing I would like to improve.
>> Seems like the boundaries have a lot of issues to do with superpixels
as well. Is Michael shouting in your ear the whole time: We should be
evaluating this based upon boundaries, and if the boundaries are correct
rather than [IOU]?
>> Greg Shakhnarovich:
I haven't known Michael to shout.
>> [speaker away from microphone.]
>> Greg Shakhnarovich:
Right.
[chuckles.]
>> Greg Shakhnarovich:
Hmm, I don't think so.
>> I can see that would be nice.
>> Greg Shakhnarovich: Well, we are doing something related now with
him, naturally. Not just writing, but training based on boundaries as
well. So yeah, it's a possible improvement there.
>> Like shift -- to the field would be sort of evaluating that venture.
>> Greg Shakhnarovich: Then you want to do semantics. Okay, so one
thing I did, there's the paper, Stanford -- Berkeley's, just a sec. The
paper, Barrett's paper which introduced the additional semantic
segmentation, there is a name for that. Semantic boundaries, semantic
contour, right? There was a recent paper in VCPR by [Pirarri], which
improved the accuracy of that. So I did the following baseline. I took
our segmentations here. I intersected them with the threshold of UCM
boundaries and got something which is about twice more accurate than that
VCPR paper. So than kind of, it is already ->> [speaker away from microphone.]
>> Greg Shakhnarovich: Yeah, we discussed with Michael. He actually, we
together came up with the baseline gradually. Yeah, maybe we will do.
You can't publish that. The problem is that we were trying to come up,
we had to work and still have work on improving semantic boundaries and
we said well, let's try this baseline. The baseline completely killed us
and everybody else, right? It's lots, twice better than the current
state of the art, but it shows that once you have good segmentation you
might actually not ->> There is meaning to it now.
>> Greg Shakhnarovich: I agree. But the point is it is already kind of
not clear how much difference there is in the boundary versus this, but
also remember that we have the statistic which tells us that if we knew
how to look at pixels you would only lose 5 percent in intersection of
reunion. Admittedly it's a very different measure, but once you get, my
sense is that once you get 95 percent, if you are 100 percent they are
equivalent, 100 percent accuracy to 100 percent accuracy. So I don't
know how much it is important.
I think also semantic segmentation is not a real task in general.
Segmentation is not a real task. For example, if you go to domains where
you actually might want to do this in a meaningful way, medical imaging,
for example, then this will be a killer because probably actual kill
patients because in some cases apparently one of the main features that
the doctor looks at is how jagged the boundaries are. If they are jagged
versus smooth it's like malignant versus benign lesions.
So there it is not even -- and that is one more thing. If you are
actually 20 pixels off but the shape is perfectly, they will make their
accurate classification. If you are 99 percent overlap and the boundary
is 99 percent accurate but it is like this, it is going to be wrong
classification. So it's really tricky.
So it is hard for me to kind of choose which one to focus on. The noisy,
the noisy labels which can be removed by some sort of structured
inference seems to be the most, both the most kind of appealing to me
right now and lowest hanging fruit to some extent. So we are focusing on
that for now.
We'll see about boundaries later.
Other questions? Okay.
>> Okay, thank you.
[applause.]
Download