>> Ross Girshick: It is my pleasure to introduce Greg Shakhnarovich, who has one of the most difficult to spell last names in computer vision. He got his Ph.D. from MIT, originally working with Paul Viola who had a stint at MSR awhile ago. He is now assistant professor at TTI-Chicago. He has worked quite a bit on sign language interpretation, both in the past and I think more recently or starting up again. And he's also worked quite a bit on semantic segmentation. And he's going to be talking about a recent system that uses deep learning to do semantic segmentation. >> Greg Shakhnarovich: >> Ross Girshick: Yes. Please take it away from here. >> Greg Shakhnarovich: Thanks, Ross. I will try to be original. Hence the deep learning. One correction. Actually the day I arrived at arrived at MIT was the day left Paul Viola left MIT. I worked with him mostly outside. I am going to talk about the recent and ongoing work with semantic segmentation. It is joint work with two of my students, Reza and Payman, you see here. Since some of you might not be aware of what TTI is, I felt I should tell you in one slide something about TTI. Some of you here have spent some time at TTI, but many of you haven't. It's an independent academic research institute, philanthropically endowed. At this point we have an endowment of a bit more than a quarter billion dollars. And it's sort of completely independent institution. We have our own Ph.D. program which is fully accredited. We are loosely sort of allied with the University of Chicago. We are on a University of Chicago campus, but we are independent for purposes of hiring admissions, et cetera. We focus on a relatively narrow set of topics in computer vision, in computer science. The two main sort of areas are machine learning and roughly speaking applications of machine learning to various AI tasks primarily. So vision, language, speech, robotics. And another big area of interest is theoretical computer science. So we sort of try to cover those. We currently have ten tenure-track faculty and about a dozen research faculty who come and go. They were in those positions in awhile. And we have about 25 Ph.D. students and we keep actively growing, hiring in all positions, recruiting students. We are recruiting summer interns for research project. Very active and vibrant environment. If you have any questions, I can tell you more about the talk, after the talk. All right. So semantic segmentation. So in general segmentation is a very old task in computer vision. Traditionally I think it has been kind of an evolving notion what it means to segment an image. In the early days focus was mostly on the very general and admittedly vague notion of partitioning an image into meaningful regions. You really have to do air quotes when you say meaningful regions because we are not clear exactly what it is. I will talk a little bit about it later because maybe we can somehow leverage that notion to help us with the second task, which is in fact the focus of today's talk. That's semantic segmentation. That is a little bit more well defined. It means taking an image and labeling every pixel with a category level label, which is creating what that pixel belongs to in a scene. Of course, there are various issues along with this as well. We'll discuss those briefly just to help us understand the challenges. But arguably this is not the most refined or most reasonable segmentation tasks. More refined one is instance level segmentation that many people here work on. And the distinction is that if you look at this image you can say, well, I just want to label all bicycle pixels with green, all person pixels with pink and everything else with black. But you can also maybe more meaningfully ask: Well, how many bicycles are there? To answer that you have to label bicycle number one, bicycle number two, and bicycle number three. That is a more refined task. Now, the focus today is on this intermediate level, category level segmentation. Partially because that kind of has how things have been evolving and partially because I personally think that it is still a meaningful task. It should help us towards instance level segmentation and also even though I am not admittedly going to show any experiments of that nature here, you can think of categories for which instance, notion of instance is not meaningful. For example, stuff or regions which are defined in terms of their texture or their physical properties, and not an object instance. So it is in some sense a formal classification task. You want to label each pixel as single label. We are going to ignore the issues really to hierarchical labeling or the fact that the same pixel might have multiple labels, multiple categories which overlap. So if you treat it as such, there is a standard at this point, a standard benchmark task which is called Pascal VOC dataset. It has 20 object categories plus the catchall amorphous background category, which is everything else. I think the field is in the process of adopting a new benchmark, COCO, which is spearheaded by a bunch of people here actually, and a few other institutions. I think we are still transitioning to making that the main benchmark. So for now Pascal VOC as of today is still the central vehicle for evaluating segmentation. The categories, if you are not familiar with it, are kind of a broad range of things which are reasonable in everyday life. A few animal categories, a few furniture categories, a few vehicle categories, and a bunch of friend and odd categories outside of that. So there are some examples. Who is familiar with VOC and semantic segmentation tasks? So many of you, but also kind of briefly I'll go through this. This is an example of a few images and the underlying ground truth labeled by providers of the dataset. You can see it's pretty high quality, in most cases most high quality outline of objects with some fairly fine details. It sounds really challenging. You have the cat and the sofa and cat and sofa are two of the categories of interest. So you really have to sort of correctly predict that these are pixels of a cat and these are pixels of a sofa, and those are arguably challenging. In the bottom row you can see some of the potential issues which we are going to ignore here, but I want to point them out because I personally always feel concerned about those. You can argue that these pillows are not really part of the sofa, but are to the purposes of this task we are going to ignore that and go with what the providers of the benchmarking said. They said well, it's all sofa. And here you can see an instance of something that is kind of hard to see, maybe a little bit Christmas tree. One of the categories is potted plant. You can say well, is it a potted plant or not? Maybe it's a plastic Christmas tree. I don't know. To avoid worrying about that, they in many cases like that just say it's white, which means do not care. We are not going to penalize you for predicting anything for those pixels and evaluation. Of course, so these are a few more examples. And they show a couple of issues. First of all, it's really important to distinguish between semantic segmentation versus instance level. In semantic segmentation we just care about all of the purple pixels here being labeled bottle. Whereas if it's instance level, it is really important for us to distinguish there are four instances of bottles here. Another interesting issue which comes up in the same issue is that this, this kind of maybe hard to see, but there is a car on the label of the bottle. So in the ground truth it is labeled this car and we can debate at length whether it is reasonable or not. I would say, you can say it's a picture of a car and it will tell you what all of those things are pictures. If I take a picture of a picture of a real car, is it going to be less of a car? It's hard to say. So this is where, for example, you could argue that those pixels are both car and bottle. Or maybe car, bottle and bottle label, if you want to sort of extend this. We can go down that well for a long time. We are going to just back up and ignore all of this. >> It's not a car. >> Greg Shakhnarovich: It is not a car? It's an SUV? >> FYI, it's not a car. [laughter.] >> Greg Shakhnarovich: Because what is it? >> He says it's ... [indiscernible]. >> Greg Shakhnarovich: Oh, it's a truck. >> No, it's a chair. >> It's a truck. We finally decided it was not a car. >> Greg Shakhnarovich: to that bottle? Here we go. For that, we have a meeting devoted >> Yes, we had many meetings devoted to that bottle? >> Greg Shakhnarovich: To that bottle? To the content of the bottle. Anyway, so this is another example where you have the really fine kind of fine-grained. Many chairs and some of them are even for us hard to separate. So I'm just bringing it up to mention to get us thinking about the challenges here, but we are going to kind of simplify our lives for purposes of this talk and really focus on this semantic segmentation, semantic level, category level segmentation. We are going to ignore the issue of whether it is a car or not. All right. So I will tell you briefly about some history of segmentation and how it relates to work, partially because I think it's good to know if you don't know, and partially because this will help me lead towards motivation that we had on designing our system. I'll tell you about how,, kind of how and why we came up with what we came up with, which is the architecture which we call zoom-out feature architecture. And I'll tell you how we implemented using deep learning, in a somewhat, I feel, pretty natural way. I'll tell you a little bit about the results and kind of where we are taking it now. So okay. Going back to this original vaguely defined segmentation task. Often people call it unsupervised segmentation. I put the quotes here because really it is a misnomer. It is unsupervised only if you don't learn it from data. Typically people do learn from data. And what people mean by unsupervised usually it is not class aware. There is no notion of classes, but rather this partitioning of image into regions. For awhile it was kind of very disorganized and I guess up until late '90s people would just take a few images, run their new experiment on those images, put the images on the paper and say look, the segmentation is great. Someone else will take another five images, run their algorithm and say our segmentation is even better. So the Berkeley people starting in the late '90s decided to take it into a more rigorous modern, quantitative field and they collected a bunch of images and let people, many people per image to ask them to label what they considered meaningful boundaries between regions in those images and that led to the creation of Berkeley Segmentation Dataset. And as you might imagine, if you show the same image to a bunch of people and just tell them this vaguely defined task, people will do different things. Someone might -in fact, these are three actual human labelings for the first image. Someone only outlined fairly coarse kind of boundaries. So the person versus the background, this very salient column here and the wall on the boundary of the floor versus the outdoors, and that's pretty much it. A couple other things. And someone else was extremely nit-picky and outlined very fine details. Maybe someone was disturbed, very fine details of the branches of the trees and almost individual leaves. And someone else did something in the middle. And of course, people do different things, but there is a lot of systems to this madness. It is not chaotic. In fact, if you combine all those labels, what emerges is some notion of perceptual strength of boundaries. So ignoring that minor details of a few pixels or displacements, you can overlay them and see that pretty much everyone labels some of the boundaries. Like the person there in the background. Most people label some other boundaries. And then as you kind of reduce this threshold you gradually get to boundaries which only maybe one or two people label. And you can think of them as less perception [indiscernible]. A full link into this insight, a lot of work on nonclass segmentation has been rather than trying to partition an image into a hard set of regions has been focusing on hierarchical partition which corresponds to this boundary strength. So one reason I am bringing it up is we can arguably think of taking some sort of partition like this, when thresholding this boundary MAP somewhere and using the regions that it produces to do semantic segmentation. Saying well, we just need to label the regions. It turns out it's really hard to do this because we don't have a good way to establish the threshold which would be good for any, for all categories, for all images, et cetera. In fact, it remains a very challenging task even though we are starting to approach human level performance in terms of, between human agreement on this region task in terms of precision recall. But if you want to get an actual set of regions, it's hard. What we can do really well today is take an image and partition it into something called superpixels, which are, you can think of as very small segments which tend to be semantics, tend to be coherent in appearance. Color is probably the most important thing when you look at small regions and tend to be spatially compact. So kind of a large grid. We would like to get a large grid of almost all regular small regions and you can, usually in these algorithms you have a knob which we can turn. And we use an algorithm called SLIC which is particularly good at this. You can turn it from having really refined partition, in extreme case just one pixel per region, all the way to very coarse partition here. You have 25 regions. Now what happens when you have this very coarse partition, you start hitting real boundaries. At this point you kind of start breaking things. You chop off this lady's head and connect it to the sky. Part of the building is connected to the grass region, et cetera. And it is a very, very fine over-segmentation. Of course, you aren't going to do that, but if you have single pixel per region, you haven't done anything. What we are after is some regime in the middle where you have maybe hundreds to a thousand maybe regions. And then they tend to have higher recall for real boundaries, at the cost of maybe low precision. So the point is that if you reduce the original image from a million pixels to 500 superpixels, but you append them in such a way that almost all the true boundaries are preserved. If there is a true boundary, it is going to be a boundary between superpixels, but many of the boundaries between superpixels are not real boundaries. Then you haven't lost that much information in terms of recovering the true boundaries. But you have dramatically reduced the computational cost of many things you want to do with those images, right? Instead of labeling a million things you now have to label 500 things. Specifically for the VOC benchmark images we found it was about 500 superpixels per image. We can retain almost 95 percent achievable accuracy. What I mean by achievable accuracy, if you magically knew what category to insert per pixel, you could get up to 95 percent accuracy. I haven't yet told you how we compute accuracy. I'll mention it later. But we can think of any reason that it should be more or less true of any reasonable measure of accuracy. So we are going to stick with this. Who is familiar with SLIC superpixels? Okay, many of you. I'm going to really briefly mention because it's a good tool. It is basically a very simple algorithm. K means over pixels. There are two twists which make it really work well compared to what people tried before. One is that there is a spatial constraint which doesn't allow you to associate the pixel with the cluster mean, which is too far in terms of location in the image. The cluster mean has color, three numbers describing the average color and two numbers describing the average position. So the position can be too far. And the second twist is that you have the second knob. The first knob tells you how many pixels you have and the second knob tells you how much you should care about distance and color versus distance and geographical distance and location. And the idea is that if that knob is very high, then you are going to mostly care about location and what you are going to get here is mostly rectangles. So I don't know if anyone here is wondering about this. One person, so far in all my talks I give, asked me how come it's rectangles and not hexagons? I think if you initialize it on a shifting grid, it actually should be hexagons, but usually people initialize it in a regular grid and it ends up being rectangles. If you set this M to 0, it's going to produce fairly irregularly shaped but still somewhat constrained in space clusters because we still have the spatial constraint, which will mostly care about color. So with some reasonable intermediate values, of this value of this M here, you are going to get things which the superpixels which tend to be regular, would like to be regular rectangles, but they will snap to boundaries based on the color differences when local evidence is sufficiently strong about it. So it's a very, very good algorithm, very fast. It takes a few couple hundred milliseconds at this point per image. Probably can be made even faster. All right. So now we have this machine way we can take an image. We can simplify our life by partitioning it to superpixels. We want to label them and now we can talk about how we actually do segmentation. And for a few years until maybe about a year ago, almost all successful approaches to segmentation were following the general philosophy of structured prediction. So very briefly what it means is that we want to classify a bunch of things, a bunch of, assign a bunch of labels to superpixels. And, but we know that they are not independent, right? We will talk shortly about some of the sources of those dependencies, but it is pretty clear there are a lot of relationships between different labels, if you want to assign to an image. And this is true in many other prediction tasks and applications with machine learning. So the way structured prediction [urges] is usually express some sort of score function. You can think of it as the score is telling you how reasonable is the particular set of labels X given image I. So segmentation here, we consider a graph over superpixels or pixels, whatever you are labeling. So V is a set of superpixels, E is a set of edges. And in the simplest case you can think of some sort of lattice-like graph where you have neighbor, notions of neighbors and each superpixel assigned has an edge connecting it to its neighbors, but you can also think of a really, really large complete graph where every superpixel is connected to all other superpixels. And this variable XS is the class assignment from 1 to C. C is the number of classes to superpixel X -- S, sorry. And so the structure, typical structure prediction framework you have two terms to determine this cost function, this core. The first one is a bunch of unary terms. So FI of XI, I basically tells you how reasonable it is to assign label XI to superpixel I, given the image. So it's very kind of, this template is very generic. It technically allows you to look at all kinds of things you want to look at inside the image I. You can think of it as kind of like a classifier. It doesn't have to be a classifier probabilistic function, but something that tells you how good this assignment is for the superpixel. And then pairwise terms tell you how reasonable it is to assign a particular pair of labels, XI, XJ to a pair of superpixels I and J which happened to be connected in your graph. Again, while computing whatever you want technically in this [indiscernible] from the image. It is a very broad notation. Now, once you define the notation, it can think of finding X which maximizes this function. It's called MAP assignment. I assume many of you are familiar with this. It is basically maximum a posteriori, as this terminology comes primarily from probabilistic thinking about the smallest [indiscernible]. Specifically if you think of F as an unnormalized log probability in terms of log of unprobabilized probability, then you can think of this as a maximized conditional probability of X given the image of this labeling given the image, if you in fact train parameters so that there are some parameters of this FIs hiding inside of this generic template, if you train the parameters to maximize the conditional probability of ground truth labeled given image per training data you have what is called CRF. If instead you say I don't care about probability, I just want to maximize the score further, great, you will get something htly different learning procedure. Then the intuition here is that you want to make sure that the ground truth score is higher than the score of any other labels by some margin. And the margin depends on how bad the label is. If it's a really almost perfect labeling, then you are okay with the score being almost as high as the ground truth. If it's really bad labeling, you will like to punish it and make sure that the score of that labeling is really significantly lower than ground truth. IE turning this in a quality into loss function, hinge loss function produces what is called the structural XPM. So it is kind of two different ways to train this model. In most cases, however, you have to worry about is part of the major part of the learning procedure is at least doing this MAP inference, finding X which maximizes F of XI for a particular current setting of parameters. And that is often very hard. And to understand why it tends to be hard we should think what kind of things you want to capture with this score functions. So unary potentials, right, is something which tells us how reasonable it is to assign a particular label to a given superpixel on an image. Any suggestions what we might want to capture there? Reasonably? >> Color. >> Greg Shakhnarovich: color, what else? Color. Some consistency with the class. So >> [indiscernible]. >> Greg Shakhnarovich: else? Ross looks like he knows. Texture, right? What >> So why do you have to give these things name? ... [indiscernible] and figure it out. >> As long as it works, that's all we care about. >> Greg Shakhnarovich: Oh, you want to learn the features? That's awesome. We should do that. Still, that's -- so okay, everybody, I never know what is -- maybe everybody here has thought about it deeply. So color, texture, position of the image which is, I want to see that being learned, but we can, supposedly. One other thing which may be a little bit less obvious is object size. So if you care about labeling pixels, you can say if generally some object tends to be very small, then a priori it is much less reasonable. Without looking at the image you should think that labeling anything by that object is less reasonable than labeling by the object which is really large, right? Again, it is something which can be part of the unary, this unary term. Pairwise terms may be a little bit more interesting. What do we capture there typically? Some of the things that have been prevalent in all this approach is smoothness. You can say well, things, pixels next to each other, [indiscernible] next to each other are statistically more likely than not to belong to the same class because of most places in the image are not boundaries. Most places in the region are kind of inside the object. You can refine it a little bit more by saying, well, if they are not in the same class, some combinations are more reasonable than others, right? Cow next to grass is reasonable. Cow next to typewriter is less reasonable perhaps? You can refine it even more saying well, it also depends where things exactly are relative next to each other. Person above horse is good. Person below horse is less good. Things like that. But there are lots of other things you might want to capture with this model and some of them are notoriously hard to capture, but when we restrict ourselves to the kind of pairwise features for its potentials. For example, you might want to say well, we expect some things to cooccur in the image, not necessarily next to each other. Once you want to capture that, you need to look at the much broader interaction than just neighboring pairs of sort of pixels. You might want to capture things like shape of entire regions that you recover and those things are notoriously hard to capture with just locally computing features for superpixels. And so people have been thinking a lot about this. One example of this, which I like, is that harmony potentials work, which was the winner of the VOC competition associated with the VOC Vision Challenge in 2010. And here the idea you have this notion of superpixels. They had other kinds of superpixels, superpixels and groups of superpixels. And on top they have this global potential which is allowed to look at all of the labels you are assigning. And what they did here is, it was one of the earlier attempts to leverage image classification to help segmentation, which was somewhat successful. They said well, we are already at the time, it is before the CNN era, but still image classification was considered to be much easier task than image segmentation. Image classification here means give an image. Tell me if you think that the image includes any instances of airplanes, or any instances of cows or any instances of persons. Based on that you can first of all, if there are classifiers saying there is no airplane there, you should be very reluctant to assign airplane to any superpixel in the image. And in addition, it can also think of kind of types of scene classes. You can say well, given the distribution of classes which mean classifying things are there, if airplanes and cars and the, I don't know, birds, there's high probability and desks and chairs are low, then intuitively you would say it's an outdoor image and not indoor image. Other classes which may not explicitly be present, identified by the classifier still get boost or are being squashed by this potential. The idea is to somehow use this global information about the image to help local decisions about what the sensor pixels. That was a half a percentage point better than the other method which was significant at the time. But it didn't really move forward much more, kind of people tried to improve [indiscernible] approaches, but not a lot of success. I think the major breakthrough came a couple layers later when people started relying on a very different approach, which kind of you can think of as a pipeline which separates the process into two stages. The first stage is producing candidate regions. Saying wait, it's hard to partition an image. Maybe we can produce a large pool of regions which are allowed to be overlapping. You would probably hope to have some sort of diversity there to make sure there are some interesting differences in regions you have produced. But you could have potentially a couple thousand regions, maybe even more. And you hope that some of them are really good matches for the underlying objects. And then you can have a separate second stage in which you will take some machine which will score this region saying how likely is it that this region is an entire object or group of objects which I'm looking for? And so there are major kind of, it led to a significant jump in the accuracy of this segmentation algorithms under a bunch of different papers and just general venues, and some of them are actually the work that Groer and I did together on producing a diverse set of regions and re-ranking them. And I think that the most recent work from this general line is, some SDS, the segmentation detection that Ross participated in. And this certainly was a big improvement, but it still, I kind of felt somewhat unsatisfied by this because this multistage, multistage setup seems kind of unsatisfactory and it definitely is quite slower than what we would like it to be. And there is this, it is hard to -- there is no kind of way to learn the whole thing together, which as Grover was saying seems to be the prevalent philosophy today, which I certainly subscribe to. So when we started this project we wanted to take this structure prediction approach and say basically just get better unary potentials. So we have common intuition and structure prediction is that at least in vision, in this kind of setup, is that the unary potential, the individual terms are the ones that really drive most of the inference. They tell you how kind of what is the general saying you should expect to see. And the pair risk potential is maybe even higher potentials help you a little bit to improve the result. They smooth things over. They remove some totally unreasonable combinations. But the main meat of this method seems to be the unary. So let's try to get better unaries. We felt like the unary potential that people used were inadequate and as I'll show you shortly it turns out that you can get much farther than at least we thought we would with just unary potentials. So the key idea why it would be better here is that you can shift at least some of the burden of deciding what combinations of labels and what structure of label space is reasonable from the inference expressed in the label space into the feature computations. So you want to kind of shift it to what we compute from the image and hope that some of the properties we would like to capture which we discussed earlier would emerge. And it is pretty clear that if you do it, you need to look beyond just the local pixels, set of pixels in the superpixel because some of the features, when we talk about them, we talk about something which relies on the information farther than the boundaries of pixels. The question is how far we need to look. As I'll show you, it turns out it is beneficial to look really far basically, as far as you can. So here is the general gist of what we are doing. I'm going to kind of instantiate in the next few slides. So suppose you want to classify this red superpixel whose boundaries are in red. This happens to be part of the headlight of a car. If I just show you the pixels there, you probably would not be able to really guess what that is. Some sort of shiny object. Maybe it would have, well, it doesn't look like it's part of an animal, but who knows. And so this very local feature is something which may be helpful, but not beyond, we don't expect to be too helpful. Then we are going to start zooming out that superpixel and look at the larger and larger areas of the image and how exactly this boundaries are computed and what we compute from those I am going to defer until a couple of slides later. But let's just say we extract some, we only extract some useful information from those. By the time I get to this yellow or olive colored region, it is starting to be a little bit clearer. It still may be hard to say what exactly it is, but you can see a bunch of really straight lines which should make you think that some sort of manmade object. Certainly doesn't look like an animal now. Has some flat metallic like looking surfaces. Maybe some sort of vehicle or maybe a piece of furniture. By the time we get to this purple region, most of you probably if I just cropped this and showed it to you, most of you would probably say it's a car because you now see the radiator, you see the wheel. It is kind of pretty clear what it is. Of course, by the time you get to this larger region, the blue region, you actually see the -- oh, although it doesn't look blue here for some reason. You actually see most of the car. It's pretty clear that you are looking at the car. Remember, all of this is in the context of classifying that red part of the headlight, right? By the time you look at the entire image, in this case you don't get much more than what you had from the intermediate level because the car occupies most of the image. But in many cases you see things which are other than just the object you are looking at. You see other objects. You see the kind of stuff which surrounds it. Here you would say well, it's an urban outdoor scene. So the car is very likely. So the idea is to extract features from all of these levels. We call them zoom-out levels because you zoomout from the superpixel all the way back to the entire image. Concatenate them and use them in some kind of classifier to predict the label for the ones superpixel you are looking at. And do it for every superpixel in the set, in the image. So that is the general gist. Let's try to now decide, define how exactly we want to do this. So what properties would we like to have from these features? So one intuition which I think is common in vision and certainly is captured by CNN, as Grover was suggesting, is as you move from a very small spatial part of the image to a large spatial support, you can essentially get more information that allows you to compute, to extract more complex features because you kind of, more things can happen there and you have more information to decide what is happening. So as an example, suppose I show you these two superpixels. Anyone has any guesses what these are? >> A cheetah on the bottom? >> Greg Shakhnarovich: Cheetah on the top? Yes, cheetah is not one of the classes in VOC, but yes. Cheetah and something else? Okay. >> [indiscernible]. >> Greg Shakhnarovich: Sorry? >> Potted plant. >> Greg Shakhnarovich: Potted plant, very good. Okay. Now, let's zoom out a little bit more. Any updates to the guesses? So usually when I saw it first, I couldn't really -- I said I can't tell. This looks like a wheel. >> Yeah. >> Greg Shakhnarovich: this? Cheetah? [Chuckles.] Wheel or maybe steering wheel. Okay. What about >> Horse. >> Greg Shakhnarovich: Horse? Who said horse? Okay. >> Are these from VOC? >> Greg Shakhnarovich: Oh, yes, they are from VOC. Okay, so maybe. Let's zoom out a little bit more. I don't know, horse? >> Wheel. >> Greg Shakhnarovich: Wheel? >> Steering wheel. >> Greg Shakhnarovich: Steering wheel and this may be horse. Zoom out a little bit more? Okay. >> Oh, a chair. >> Greg Shakhnarovich: Chair. This is like Antonio Terrable used to show a lot of these kind of for context. So probably ->> Imagine idea [indiscernible]. >> Greg Shakhnarovich: Exactly. >> [indiscernible]. >> Greg Shakhnarovich: points. [Chuckles.] What is the name of the horse? You get bonus >> Greg Shakhnarovich: Okay. so yes. Maybe it's a donkey though, but anyway. So run through them all. It usually becomes more clear because we see more of the object, we see some surrounding stuff. In this case we don't see the surrounding stuff that much, but we see a lot more of the object. In this case we start seeing surrounding things. Of course, by the time we get to the entire image, everything makes a lot of sense now. It's a chair. In fact you can see it's some sort of room, dining room. It has other chairs. It has tables. Clearly it is inside the house. Here you, you know, you see much of this horse and see some other animal which, even if this horse would look some sort of weird, but this is clear it's animal. Having any kind of animal increases probability it's a horse. You see hay, sky. It's a [indiscernible] image. All of the things we would like to capture. Obviously anthropomorphizing computer algorithms is dangerous; it's wishful thinking. We would like to extract all of those things, but it gives you an idea at least what we would hope to extract and why zooming out might help us to capture things. And also emphasizes that yes, that as you go zoom out farther from the original, you should be able to compute more complex features for this. Okay. Now, another thing we should think about is how do features we compute from different zoom-out levels interact for various locations in the image. So if you consider two locations which are close to each other, immediate neighbors or almost immediate neighbors, and consider different spatial extents from which you want to compute the features, it is pretty clear that as you consider very local zoom-out levels, like individual superpixels and slightly enlarged areas, they could vary very quickly, as you move around the image, because they are very local, if there is a strong boundary there it can dramatically change the color and texture of other things. It will compute if you move just a few pixels. As you zoom-out more, you start the underlying, the areas from which you compute the visual information start overlapping more and more. And so as a result, you start imposing some sort of smoothness. So here just by this notion of overlapping regions we get smoothness for free without having to penalize the underlying assignments later in the inference stage. So if you look at the areas which are zoom-out levels which are still fairly small, the overlap might be minor. By the time you get to large let's say purple regions here, two superpixels which are even not immediate neighbors start having very large overlap. The ones, this guy which is far from them still have a fairly minor overlap. So we still can allow quite a bit of variation. By the time you get to the very large regions many, everything which has lapped into some sort of an image away from a each other will have likely very similar distribution of features unless there is dramatic change in what is underlying there. So the sort of smoothness which is dynamic and in a sense adaptively varying depending on what actually compute from the image, and different [indiscernible] of this levels, it varies much faster in a small level. Yes? >> Do you share computation in the computing [indiscernible]? >> Greg Shakhnarovich: I'll talk about it later. there could be. Would you like there could be? The answer is that [Chuckles.] >> Greg Shakhnarovich: So there could be except that we are still working on getting it better. But certainly in terms of, it should be possible to compute this in a shared way. I'll talk about how we compute the features and it will be all clear how we can actually share the computation. And of course, once we zoom-out to this entire image and this global level, the entire image, all superpixels in the image will have exactly the same set of features by definition. Of course, the other images will have different other features. You can think of it as kind of varying degree of how shared the features are and as we move up this hierarchy of zoom-out levels and sort of they become more and more smooth and capture more and more spatial context, the underlying complexity grows. You can think of different other kinds of things you can capture with these features. If you are very local, you capture very local properties. We talked about color, texture, et cetera. As you move to the intermediate levels you start capturing maybe some parts. Maybe even some small objects. Some kind of informative pieces of boundaries which straddle multiple objects, which should tell you about statistics of class versus another class. If you go to even larger regions, you start capturing maybe bigger objects, large parts of objects, constellations of parts. By the time you get to the global zoom-out level, you capture properties of a scene and what kind of image you are looking at, which includes things which are not directly related to any expect object or stuff, but like distribution of objects, types of environment like, you know, lots of straight lines and the image makes it like it's a manmade environment. It is not something directly tied to any object, but you expect features that captured to be useful. So all of this kind of list of [indiscernible] suggests a particular type of architecture this day and age, right? And the sync -- that is [indiscernible] convolutional net because that really fits the bill on all counts, right? It computes features of increasing complexities as you increase the receptive fields. It captures things at different levels of, as we know from lots of people who looked at, trying to visualize and understand what these networks do, different semantic levels of representation. So how can we leverage neural networks to do this? So initial version of this work which is still on archive, until recently I didn't realize that people expect the archive versions to be actually more up to date than the conference versions. So the archive is the preliminary version. People keep asking me if I published it. I said yes, I published it in CVPR. Initially we tried to kind of, we initially combined some features computed by neural networks with some hand-crafted features and we went through the process which has taught me a lot. The bottom line is that every time we dropped some hand-crafted features, we improved performance. Every time we dropped some decisions we made and said we'll just use all the layers of neural network, we improved the performance. The bottom line is as Fred [indiscernible] supposedly used to say every time I fire a linguist my recognition rate goes up. Basically every time you prohibit yourself from making decisions, apparently it improves the accuracy. >> [indiscernible]. >> Greg Shakhnarovich: What is that? >> Every time you fire Gestaltist ->> Greg Shakhnarovich: and then fire them. A Gestaltist? Yes, I should hire a gestaltist [Laughter.] >> Greg Shakhnarovich: In other news, there is something called a Gestaltist. All right. So let me now describe -- so this is maybe if you want to remember one slide which summarizes, really gives you very good understanding of what we do at both conceptual and detailed level, this is the slide. Let me walk you through it. So this is an example of a two-way convolutional neural net that which has three convolutional layers and two pooling layers, okay? So the first convolution layer has say 64 filters. We want to class, we are now thinking of representing, computing a representation for zoom-out representation for this red superpixel marked in red. So the convolutional, first convolutional layer is going to compute 64 feature MAPs which ignoring, assuming the right padding are going to have the same size as the input image, right? So now you have 64 numbers for each pixel. We are going to take all the pixels in the superpixel and we are going to average those 64 numbers. Over those we are going to have a single 64 dimensional vector. That is the first zoom-out level feature for the superpixel, eh? It is, while we do this, it is important to think about the receptive field of this feature, right? What is the receptive field of this feature? It is not slightly more refined than just direct angle to [indiscernible] the filter. I'm thinking of the set of pixels in the image which affect the value of the 64 numbers. In this case, it is basically both times using three-by-three filter here. Can someone tell me what is the receptive field of this feature? In some concise form? [There is no response.] >> Greg Shakhnarovich: So there is a very simple way of thinking about it. It is the exact superpixel dilated with the three-by-three box, right? Because that is how I compute the, its convolutions. So everything, all the pixels which fall within the dilation of this threeby-three box are going to contribute values. Everything outside is not. So basically it is one pixel outside of the third pixels, almost all the same as the original superpixel. Then I do pooling, some sort of pooling, I try to, regardless of this max average. That produces feature MAP which is actually half of the resolution of the original image. Now I'm running on this convolutional layer with 128 layers, let's say. So this will give me 128 pixels for each 128 numbers for each pixel here. Now, I still would like to describe things in the original image. So I'm going to up sample it by a factor of two. So get back to the original resolution. It can be bilinear interpolation. Now do the average pooling. Now I have this 128 dimensional representation for the entire superpixel. Now, I am not going to ask you what is the receptive field. It's a little bit trickier. It turns out -- it is not that tricky. There is a formula, but you have to spend a few minutes figuring out how to compute. There is no -- it's a recursive formula that tells you what is the size of the receptive field for this feature because it is a combination of convolution and the sub-sampling due to pooling. But it is not rocket science. The general intuition is that it will grow significantly mostly, it will grow a little bit because of the convolution significantly because of the pooling. So the receptive field of the feature here is going to be larger extent than the original subbase. So by a few pixels. Then I'm going to the next pooling and the next convolutional layer. The same thing happens. Now I have to up sample a factor of four and average over a third pixel. Now I have an even larger. If the first one of the receptive field was pretty much superpixels, the second one was a little bigger. The third one is probably something like this. So there is this natural growth in the size of the receptive field as I move up the network, as I compute more complex features. But in the end I describe those features four pixels in the third pixel I'm classifying. In this case, since it is a very small network, I'll stop here, concatenate those features and ignoring the global features, just from convolutional layers. I get these three components which together give me 448 dimensional representations. The sum of these two numbers. Three numbers. >> So [indiscernible] happening in the ... at the same time, between this and hypercolumns ->> Greg Shakhnarovich: I will talk about those when I talk about results. Both hypercolumns and FCN, the functional convolutional networks, I guess all of this is network. There are some interesting differences about how we compute, what we average from what layers. I think those, our results happen to be a lot better than either of those. I think that's partially because of the difference of the choices we make about how to combine things across levels. Yeah. >> Do you do any reasoning about the magnitude of the different layers? >> Greg Shakhnarovich: Magnitude is -- >> As in, for example, in the input image if you are not rescaling, you might have values from zero to two-thirds, negative 128, to 128. >> Greg Shakhnarovich: Right. >> By the time you get to the top layers, you'll have values from negative one to one. If you are concatenating together different features from different layers, just the amplitude of the signal might be very different. >> Greg Shakhnarovich: Yes, that's something I should take care of when I classify those. Right? >> I mean, in theory you, so if you just ignore it then yeah, the classifier hopefully will do, sort of learn its way about the fact that there are different amplitudes. >> Greg Shakhnarovich: Right. So there are a few kind, few degrees of depth of this. If you are doing a linear classifier on this and I don't do regularization, it doesn't matter, right? Because the only thing that, if a linear classifier, nonclassifier is the only way in which the magnitude affects the results is if you have regularization based on the norm. If you don't, then actually it doesn't matter. >> Right. >> Greg Shakhnarovich: Right? You can scale some feature by a factor of a thousand and the classifier will just, linear classifier will scale the equalization. >> The point, I agree it doesn't really matter. learning problem would be harder. I'm just saying the >> Greg Shakhnarovich: That's right. What we do effectively is we -- so for now at least conceptually, it is still two stages in a sense. We compute all these features and then we classify them. So when we classify them and treat this now, forget how we compute them. We have a feature vector. You would apply the standard normalization tricks you would want to apply in any case which, for example, would mean taking some sample images, computing the mean [stern] division, normalizing them. You could fold it into the process which collects this feature, but it is kind of conceptually is the same. We do have a multilayer network on top of this. So it is important and we definitely found that it is crucial to do it, do it right. But the process itself is fairly pedestrian, in that sense. So you have to think about the features having reasonable sort of, reasonably normalized magnitude, but why there is no normalized division, maybe we don't care about this. >> Right. >> Greg Shakhnarovich: Okay. Other questions about this slide? It is kind of important slide. So dwell for a few seconds. If there are no more questions, we'll move on. >> So just to clarify, the extent of the zoom-out region is entirely defined by the grown receptive sub [indiscernible]. >> Greg Shakhnarovich: Correct, exactly. >> Rather than being defined by somebreaking superpixels [indiscernible]. >> Greg Shakhnarovich: Correct, correct. That's my distinction from the gestalt approach, right? That's right. So it is entirely driven, the features and the end zoom-out levels are entirely driven by the network. And you basically can think of the regions obtained by kind of succession of dilation with three-by-three boxes because we use three-by-three filters. And increasing the field due to the size, resizing and pooling layers. So these are the stats. So the numbers here are not, the numbers here are empirically computed by us by taking a bunch of superpixels and evaluating what was the underlying receptive fields. It is not something we can compute in closed form ahead of time because all of the sizes actually depend on the regional superpixel, right? You start with the, start dilating and increasing the size. But we typically have superpixels which are 30, the larger area of superpixels, about 30 pixels. And so the larger receptive field, if you consider a box, bonding box inside the field, start the 30. It increases, so 30 to 36. Then there is a pooling layer. This is for 16 layer VTG network which we end up using for most of our interesting experiments. Then there is a jump because of the pooling, kind of slow increase until next pooling. Then another jump. And by the time we get to the last layer we really are looking at most of the image. The image is similar, typically 500 by 300 pixels. By this point we are looking at a large part of the image. Now, in addition we have this global zoom-out. So global zoom-out you take the image, run it through the network and take the last fully connected layer, which is the feature representation which is then traditionally used, for example, in image MAP classification to classify a thousand classes. We take that representation, use it as a global zoom-out representation. And then here we do something which is kind of arguably a hack, but I think it makes sense. I'll explain in a second. This is, we take and here we decided on some, by some heuristic reasoning on the size of the bounding box around the superpixels. So we just take the sub-image and do the same as we do for a global. We run this entire subimage through a network and compute the last vertical layer. Why do we do this? If you think about that picture with the dining room, that didn't really happen there. But you can think of a big window it clearly the picture is inside of a dining room. But there is a big window. In the window you see a pasture with the sheep, right? So if the entire room is clearly indoors, but if you take just the some way measure around that window, mostly it's outdoors. You are looking at something which has very different characteristics. We found that some images there is this kind of very different subscene which informs what you should do for pixels in that subscene, different from the entire global image. So that adds a few percentage points. And so we ended up using that as well. So all tolled we have about 12 and a half thousand feature dimensional representation. If you concatenate all of those things ->> Why wouldn't the subscene ... [indiscernible] receptive field? >> Greg Shakhnarovich: Ahh, because the receptive field, even a large receptive field we still compute only the features which are computed the convolutional layer which are not quite as complex. >> [indiscernible]. >> Greg Shakhnarovich: He they are simpler. You want to have really highest level semantic features which you can think of standing what is going in the image, but compute it only for a limit of spatial extent. And some cases it doesn't add anything because the same thing as anywhere else. You take any picture of any part of this is going to be similar, except maybe there are no people that are here and there are people here and many images it is not really significant, but some images it is. Okay. Some minor issue here, which kind of annoying is that there is a very huge imbalance, there is a huge imbalance between classes. So most pixels are background in this data set, Through I think the COCO, I would like to know the stats, what percentage much pixels in COCO background. >> I'm sure, but they are in the background as well. >> Greg Shakhnarovich: >> [indiscernible] I'm sure it's less than this. would say higher. >> Greg Shakhnarovich: Sixty? So basically the most common class after background is person. It's almost an order of magnitude smaller. By the time you get to the least common class, bottle, you see it's in VOC, a percentage point. It's two orders of magnitude, three orders of magnitude difference. Two orders of magnitude. So you can ignore this. There are four things we can do basically. You can ignore this and we know that it's usually produces results, I should probably mention now how we compute accuracy for [indiscernible] The standard way to compute accuracy for this task, which has issues, but has been adopted for the most part for semantic segmentation is the following: For a given class C, you can think of all pixels which you think are the class, and all pixels which are really that class. So all pixels they think they are horse and pixels that are really horse, you take intersection of classes over union, that gives you a number between zero and one. The number is one that means you got perfectly right set of predictions for every pixel. If the number is small, it could be because you under predicted the right things. There are many horse pixels which you fail to say are horse, or maybe it's because there are many pixels which you think are horse but they are not, or maybe both, right? So as both either types of errors increases, you have lower number. This is something which specifies how well you did for a particular class and then. What Pascal VOC benchmark does is average this over classes. So the effect of this is that if you have a bottle class which is 100 times less common than a background class, each pixel of the bottle class is going to hurt you much more if you get it wrong, either way, than the background. It's really terrible if you misclassify 100 bottle pixels and it's okay if you misclassify 100 background pixels unless you misclassify them as bottle, roughly speaking. This is kind of an artifact of this task and to some extent it is mitigated once you switch to objects, but I guess the average of the objects, but only to a limited extent because objects also tend to have different size. And so since we do want to play this game and have better results, we wanted to optimize for this measure. So instead you can do four things. You can ignore the imbalance. This is going to produce forced results. Everything sees this empirically. You can try to impose balance by subsampling the more common class. That turns out to be a bad choice. You basically lose a lot of information. The background class is really rich, lots of things are background. If you reduce it from 70 percent to 7 percent to match the no person, you are going to lose a lot of information. Kind of unnecessary. You can up sample the less common class and that's actually going to be conceptually concurrent with what we are doing but it is very wasteful. You are going to go many, many times, many, many times through the pixels of bottles, versus one pixel per person. So what we do is actually we think the best, the fourth choice which is better than any of those other three, which is use all the data there is, but weigh the loss. It's a pretty standard thing in machine learning. I don't know why it is not more widely used in segmentation work. It's trivial, a trivial change to any code you have to introduce asymmetric loss. Basically you weigh the loss in each pixel inversely proportional to the frequency of this class. So now if you make a mistake on the bottle pixel it is actually going to cost you more in the subjective. And that turns out to be significant. Not a dramatically huge change, but a few percentage points improvement if you train it with this loss. >> [indiscernible] key optimize for that loss object? Right now you have a loss. At the loss there of your neural network there is a soft max. Right now you are doing long leg YOUR... You can write down what is the expected intersection by expected union produced by these goals? >> Greg Shakhnarovich: about -- It is going to be another surrogate. It is all >> So Sebastian had this paper where he showed the expected intersection by expected union, approximate size ->> Greg Shakhnarovich: Yeah, the surrogate. [overlapping speakers.] >> Approximate factor and the approximation goes on by order one over N. >> Greg Shakhnarovich: Where N is number of -- >> Number of super pixels. >> Greg Shakhnarovich: differential. It is not that bad. Well, actually looking good, maybe. And it's >> We have this paper that just couldn't get published because it just -we didn't have a good segmentation bottle. But we showed that I over U, you can differentiate through that and you can train it. >> Greg Shakhnarovich: We have a good segmentation model, but okay. should talk more about that, maybe. Okay. [chuckles.] We >> Greg Shakhnarovich: I mean, on the other hand there is maybe something unseemly about trying to optimize this measure which everybody criticizes, but ->> [speaker away from microphone.] >> Greg Shakhnarovich: We can say how bad it is and then do it because we want to Win the competition. >> This is over the entire data set, right? >> Greg Shakhnarovich: Correct. Oh, yeah, yeah, it is over the entire data set and of course that is an approximation which saying you can't avoid because it is going to be hard to optimize. Even if you knew the ground truth and you had superpixels, finding the optimal assignment of superpixels under this measure for the entire grading set is going to be hard because, right ->> The thing is, like you have like small, I don't know, small motorcycle and an image. If you miss the entire motorcycle, the 20 pixels, who cares? It's such a small number of pixels and the other images of motorcycles. >> Greg Shakhnarovich: many motorcycles. Sure, sure, but if you do it for many pixels for >> Yeah. >> Greg Shakhnarovich: It's an approximation of an approximation, right? There's like at least three levels of approximation I can think of. But this is, you know, you can also think of individual image approximating the data set and then pixels and the image approximating the decision over the entire image, but empirically this is, we found this to be an improvement over not doing this and that's the bottom line here. I'm just mentioning it because I really don't understand why I found two papers on segmentation then out of hundred I looked at which actually used this symmetric loss. In fact, they reported better results than not using it. So it's kind of strange why it is not more commonly used. All right. So now some results. We now take the 16 layer network I mentioned before. We extract the features. And the first thing we did was take the linear models, a simple linear model, soft max, you can think of soft max on top of those 12,000 features and you can look by taking a subset of features how much each level contributes and so this is kind of grouped them together by -- we didn't run all possible combinations. We grouped them roughly by kind of groups of layers in the 16 layers, where before each pooling layer. The first two get you 6 percent average IU, which is pretty bad. Better than chance but not by much. If you go to four, it's significantly better. Now it's ten. But by the time if you take all the 13 convolutional layers you get to a number that would have been state of the art four and a half years ago. So you had a time machine, you can go and win. But then it wouldn't be published because it uses neural network. [laughter.] >> Greg Shakhnarovich: There is no way to win there. Now, if you, so this is kind of the most dramatic thing. If you take the 16 layers and you add the global presentation, you jump huge amounts from 42 percent to 57.3. And that actually would have been state-of-the-art I guess eight months ago, nine months ago. In fact, when we were preparing for [indiscernible] that was state-of-the-art. Very excited. This made us kind of, when we were coming close to this number as we added layers and we kind of improved some bugs, at some point we passed the current stateof-the-art and we were doing it without any region proposals, without any pairwise potential, so we were very excited the idea of unary potentials. No, let's get rid of pairwise potentials, structured prediction. Who needs that? We'll just do everything here. So then if you use ->> In is not just another pairwise potential, this is structured loss? >> Greg Shakhnarovich: Structured loss, this is true, which is not used here. So it's a fact that there is no structure at all, right? It will show about nothing. So let's look independently. So if you have a subscene, a scene that is -- so this is significantly worse. That's basically because you do fail to capture a lot of information on the scene. We combine the, if you take only kind of there are a couple of interesting things that are missing here which we do look at a few. If you understand about a particular combination, I might know the answer, but this shows what happens if you ignore the local features up to seventh layer. This is pretty much, I don't remember. I guess you start with the receptive fields which are already quite large, 130 pixels. And you add the global. You actually get a number which is -- but you add a subscene. It's a little bit better than this. If you add all together and get 58.6. So this change is relatively small but keep in mind the dispersed seven filters are probably about something like five or 6 percent of the entire feature presentation because the number of fields is fairly small initially. So figure that if we get 6 percent, .6 percent improvement on this, at least for awhile, half a percentage point was enough to quite Win and statistically significant. One of my advisors used to say statistically significant but not important maybe. [chuckles.] >> Greg Shakhnarovich: There is no reason not to do it if we could, especially given it's relatively small. We didn't think there were a lot of features here because it is a pretty small number of features. These are all numbers on the val which we didn't touch during any of the training, the standard partition of training images and [indiscernible] images. And we expected. This was a linear model. Well, we don't have to use linear model. We can use any classifier we want. Before I talk about that, actually let me show you kind of graphically what happens here. This is maybe the most typical kind of demonstration. What happens is, this is the ground truth. A person. As you start from very local to more and more kind of larger ultimate levels you get a little bit better -- here you get a significantly better partition in background than foreground. But you get all this nonsensical predictions. I think this is dog, I think this is bird. It doesn't really have good simple features in the convolutional layer. By the time you get the subscene features, it looks here. It basically says, well, there is no dog here. There is no bird. There's a person. So a lot of these labels go away and are replaced by direct labels. Maybe dog is killed here. Not actual dog but the dog labels are killed. The next most reasonable thing is person. So that gets substituted. If you actually use the full representation, it is -in this case it is slightly worse in terms of person boundary but certainly removes a lot of the spurious incorrect lanes. And similar things happen to other images. One immediate thing you notice here, if you look at it, there are two things. First of all, you can see that there is still we know that we lose about 5 percent accuracy because superpixels boundaries and that explains this jagged kind of -there's actually very little local edge information here about the bird versus water. And superpixels here have really hard time localizing the boundaries well. And so the assignment level we still get the right high level kind of boundaries with bird and back grappled, but locally it is jagged and probably would be improved if we kind of cleaned up the superpixels boundaries. Another thing you can see here that even in this kind of high segmentations which are full zoom-out presentation, there's a lot of noise. And in addition to this, this is not just the jagged boundaries. We have some kind of some very irregularly shaped boundaries even though straight boundary for train would be much more reasonable here. And you can argue that that could be improved significantly if you actually brought back the previously dismissed idea of structured prediction. And in fact, I'll talk a little bit about car instead of art in the [indiscernible] and people do seem to get a lot out of it. I still suspect that we might not need it but it remains to be seen. So going back to the classification, linear classification gets you faith .6 percent. That was about 7 percent better than published state-of-theart when we did the VCPR. If you actually use seven layer network, it's a loss worse. So in fact it's even larger gain than what people typically observe per classification, relative gain. It seems like it's really important for simulation to have more rich, more layers for richer representation. But there is no reason to restrict ourselves to linear classifier. We can stick some nonlinear classifier. Given that we already had all the machinery set up, the obvious choice was just multilayer neural networks. It is not a convolutional neural network, what used to be called multilayer per cep tron 30 years ago. So what you get here is if you go through and get a huge jump, 68 percent, a significant improvement here, the three-layer network. We couldn't get any more, squeeze any more out of more layers, more hidden units. It drop out here, et cetera, but it basically stopped there. >> If you run back to [indiscernible]. >> Greg Shakhnarovich: No, no. So this brings me to what we'll talk about in a second, what we are doing now. This is still a set up for no good reason essentially, but just because that's what is easiest for us to do. We have a separate station which we use interface to Cafe to extract features. We save them to disk and treat them in a completely different stage as features someone gave us. We gave ourselves. And they are run a classifier on those. The only thing on the back drop here is through the three layers. I sudden say that the network we use here is exactly the VTG network trained on image net. Doesn't know anything about segmentation. We literally take it as is and use it. And it gets us this result. And one interesting thing we tried here, one of the reviewers for VCPR asked for it. So we had kind of rough prediction what would happen. It was mostly what happened. They said well, why do you need superpixels? What if you just do rectangle regions? We said well, we will try. It doesn't change anything. It is still the same zoom-out representation, but how much do we gain from superpixels? It turns out we gain quite a bit, right? So the same, exactly the same architecture, same machinery, but with regular grade of rectangle, gives us about 64 percent and the underlying work, accuracy, how much achievable accuracy drops by even a larger number. So it is similar kind of drops. So basically what you lose here is your ability to localize boundaries well. At least most of the time with superpixels. >> What happens if you change the resolution in superpixels? >> Greg Shakhnarovich: We haven't extensively experimented with this. I can tell, we have experimented a little bit and based, you know, extrapolating between that and my general intuition, I can tell what I think will happen. If you go and increase by a factor of two, let's see, your achievable accuracy goes from 94.4 to about 98 percent. Ninety-seven-point something. But you might gain about maybe one half a percentage to one percentage point in accurate segmentation. Not more. At that point I think you start getting all this noise. Look, there are two things you gain from superpixels. One is expense. It's cheaper to level 500 things than a million. But another thing is actually localization of boundaries and just amount of noise. You lose when you label the whole thing wrong. But if you label the whole thing right, most of the time you get for free. All of them are right, right? If you now have 20 times more pixels, a thousand might be wrong and a lot more chances to get it wrong. So I don't know exactly what is the right setting of that knob, but my feeling is that if you want to work with superpixels, it's roughly correct. A more interesting thing to do which we haven't explored is to actually take the other knob and SLIC and make them even less regular and see what happens there and it's possible that that will be a better way to do it. But I mean, eventually one thing that nobody ever calls me on for some reason, I present it as a fully fit forward architecture. We start with the image, compute the features, classify them. But of course, the superpixels are not the feedforward part. There is an actual loop there to do on okay means. What you would like to do is actually make it feedforward superpixels somehow part of the network. We're thinking about how to do that. It is not clear at all. If you have any thoughts, I will be happy to discuss that. And then we will learn all this and not have to separately tweak the number of superpixels or any knobs. All right. So yeah? >> How much does performance [indiscernible] keep the standard structure of loss? >> Greg Shakhnarovich: So our initial architect paper had a huge claim of a huge drop and you need to run it for a lot longer. So if you use symmetric loss, with the proper, let it run a few more days you get about three percentage points lower than this. It's still quite a bit. So there is no reason not to do it, but I certainly, we had to walk back a little bit the claim of the importance of that. >> The growing of the receptive field is also not learned, right? a consequence of the parameters? That's >> Greg Shakhnarovich: It is a consequence of the -[overlapping speakers.] >> Greg Shakhnarovich: -- of the choice of architecture of the network. You can learn it using an experiment or something. Okay. So where does this fall in the state-of-the-art? So initially, before, when [indiscernible] C our initially, so before when semantic CPR, we used the seven layer network and the numbers for six -- we had 59 percent, something like this, I think. And we had a lot of hand crafted choices. So -- I am proud of how I got this number. It was very nice, very good result. We also have been working on this and now we have 60 - we had 59 percent at the time. We had 62. There are two papers from Berkeley. So hyper columns and fully connected networks. And so first of all, just to give you an idea how things were moving, this is 2010. This is 2012. Kind of this is a progression of numbers. Then these are, I think these are the three best results published at VCPR. I'm not sure. And I'm fairly certain that ours would be actually published in VCPR was this, the deep -- I just went through, the 16 layer network and everything. The fine .6 on the test set. I think that's the best result among the papers published in VCPR. As of lunch time a couple days ago, it was 664 [indiscernible] I haven't checked today in the last few hours. It's possible it went up. So why, kind of what explained this? I think a few things. We pretty much hope that we can get all of them, many of them right once we reimplement this, which is what we are doing now. >> [speaker away from microphone.] >> Greg Shakhnarovich: So that is strongest -- so if you look, here is the thing. If you look at the numbers, so these numbers are with structured prediction, CRF on top of the CNNs, trained on COCO, data in addition to Pascal, and by everybody including these two numbers, fine tune the network properly, end-to-end learning for the segmentation task. But if you look at the results, if you remove -- so we don't fine tune, but if you look at no CRF -- sorry, no CRF but only on Pascal data, I think the best results was 72 percent. It is only two percentage points, two and a half percentage points higher than us. It is entirely conceivable we need to shoit that once we fine tune properly the network we will get better than that. And if you train on COCO as well, you might get a number -- I'm still not convinced that empirically, like I can't, I wouldn't be able to support it, but my hunch is that once we do it properly we will get as good results as with CRF. But of course, if we add CRF on top of this, we could get even better results. It's possible. Now, so why haven't we done all of these things? Trained on COCO? Fine tune the networks? The main issue here is that we can't do it because of this poorly designed set, external setup when we extract the features set to disk, et cetera. In order to implement it properly, you need to actually -- there is a natural implementation of this as a fully connected network. I'm saying this because as we all know all networks are fully connected. >> Convolutional. >> Greg Shakhnarovich: Convolutional. I'm sorry, what did I say? Fully convolutional. I'm so glad, I did not have that, I escaped the Wrath of the Khan at VCPR. I kind of stayed low. [chuckles.] >> Greg Shakhnarovich: Anyway, so there's natural way to share computation. As you said, you just represent this last -- actually we thought about this when we were writing it, but we didn't quite do it. The main tricky part, if you're interested in technical nitty gritty details, if you go back to the sort of, to this picture, you do it -- so you can Frankenstein this pretty much, this is from existing letters in Cafe, for example. I don't know if people use Cafe here. Certainly they are looking at it. There is a deconvolution allayer which will do up sampling, but it kills the memory because if you start up sampling everything to the resolution, by the time you get to the high convolutional layers that have thousands of layer, it kills the memory. You have to represent all the representations to all the batch images at full resolution, it's thousands of channels. It's just impossible to do. Now, you don't actually have to do it. Conceptually you do, you can fold all of this, up sampling it to compute the responses, right -- to computing the responses, right? Because up sampling is equivalent to just weighting the lower resolution pixels with different weights. Implementing that in Cafe has been a somewhat challenging process for us, high potential for bugs. So kind of is -- but I think we are pretty much getting there now. So hopefully within a few weeks we will be able to run this experiment and release the code and everything. But that has been the main challenge. There is another, equivalent to that is that you might want to, we tried also a kind of building the pooling layer, which takes arbitrary shape regions that also is a good layer to have but might not need it if you do it the way I described. So that's kind of the main thing we are doing now for this, trying to get end-to-end training. Once we do it, we can also train on COCO easily, et cetera. So the other thing I mentioned is superpixels. Do we need them? It seems like we are benefiting from them. That's one of the distinctions between that and hypercolumn/FCN. So hypercolumns, they have the boundaries of the entire region, which is a pain from region proposals, the bonded box around it. But they don't really explicitly look at superpixels. FCN, fully convolutional networks, doesn't at all use superpixels. The other distinction, in hyper columns, one other thing with hyper columns, they don't have the global representation. They have something almost like the subscene but not quite because of the limits of the bounding box, just the object itself. It kind of ignores all the other things around it. Whereas FCN actually doesn't have, it has something similar to our subscene. Maybe even slightly larger. But it doesn't actually have the local level features because it doesn't -- it starts pooling information, starts at a relatively high in the network. But more important, I think, FCN doesn't pool the features. It pools predictions. Basically what they do, they take some intermediate convolutional layer and say let's try to make predictions here. The next layer we are going to up sample them a little bit. And at that point you up smaple by a factor of two or four or eight maybe max and then pool the predictions. We defer, it's like early [indiscernible] we defer all of the averaging of features until the very last moment when we have all the information. Then we make a decision. I think that's really important that probably explains the significant jump in accuracy. So anyway, the pixels we need to learn how to fold them into the network and inference, we are trying to look at ways, as many people are now doing, to fold inference and CRFs into a network and representing it as feedforward process. It's pretty obvious, I think, to do it at this point. It remains to be seen how important that is. All right. I have, it seems like I've over stayed my welcome a little bit. Hopefully not too much. So questions? [applause.] >> What do you see as the main failure in your cases right now? >> Greg Shakhnarovich: I don't know if I have like an exact -- so it seems like the main, the most annoying -- I know what the main character is, there are multiple sources of failure. We haven't done a failure [Derrick Coyne] style diagnosis, but by eyeballing lots of images, it seems like okay, we lose a few percentage points because of the bad luck representation of the pixels. Of course, we have some degree of some number of, some percentage due to horses labeled as sheep and similar categories. The most annoying thing which I think is really bad is noise, like here you have this shoe of that guy labeled as motor bike because there is a motor bike in the scene. The global features are very strong, probably here. And because we know that a lot of the accuracy comes from this. And this really looks great, looks similar to this motor bike. So it says motor bike. Clearly if you had some hierarchical model CRF it would have improved this. There are many similar cases like this. So here there is very noisy, in this label. So I think that's the main thing I would like to improve. >> Seems like the boundaries have a lot of issues to do with superpixels as well. Is Michael shouting in your ear the whole time: We should be evaluating this based upon boundaries, and if the boundaries are correct rather than [IOU]? >> Greg Shakhnarovich: I haven't known Michael to shout. >> [speaker away from microphone.] >> Greg Shakhnarovich: Right. [chuckles.] >> Greg Shakhnarovich: Hmm, I don't think so. >> I can see that would be nice. >> Greg Shakhnarovich: Well, we are doing something related now with him, naturally. Not just writing, but training based on boundaries as well. So yeah, it's a possible improvement there. >> Like shift -- to the field would be sort of evaluating that venture. >> Greg Shakhnarovich: Then you want to do semantics. Okay, so one thing I did, there's the paper, Stanford -- Berkeley's, just a sec. The paper, Barrett's paper which introduced the additional semantic segmentation, there is a name for that. Semantic boundaries, semantic contour, right? There was a recent paper in VCPR by [Pirarri], which improved the accuracy of that. So I did the following baseline. I took our segmentations here. I intersected them with the threshold of UCM boundaries and got something which is about twice more accurate than that VCPR paper. So than kind of, it is already ->> [speaker away from microphone.] >> Greg Shakhnarovich: Yeah, we discussed with Michael. He actually, we together came up with the baseline gradually. Yeah, maybe we will do. You can't publish that. The problem is that we were trying to come up, we had to work and still have work on improving semantic boundaries and we said well, let's try this baseline. The baseline completely killed us and everybody else, right? It's lots, twice better than the current state of the art, but it shows that once you have good segmentation you might actually not ->> There is meaning to it now. >> Greg Shakhnarovich: I agree. But the point is it is already kind of not clear how much difference there is in the boundary versus this, but also remember that we have the statistic which tells us that if we knew how to look at pixels you would only lose 5 percent in intersection of reunion. Admittedly it's a very different measure, but once you get, my sense is that once you get 95 percent, if you are 100 percent they are equivalent, 100 percent accuracy to 100 percent accuracy. So I don't know how much it is important. I think also semantic segmentation is not a real task in general. Segmentation is not a real task. For example, if you go to domains where you actually might want to do this in a meaningful way, medical imaging, for example, then this will be a killer because probably actual kill patients because in some cases apparently one of the main features that the doctor looks at is how jagged the boundaries are. If they are jagged versus smooth it's like malignant versus benign lesions. So there it is not even -- and that is one more thing. If you are actually 20 pixels off but the shape is perfectly, they will make their accurate classification. If you are 99 percent overlap and the boundary is 99 percent accurate but it is like this, it is going to be wrong classification. So it's really tricky. So it is hard for me to kind of choose which one to focus on. The noisy, the noisy labels which can be removed by some sort of structured inference seems to be the most, both the most kind of appealing to me right now and lowest hanging fruit to some extent. So we are focusing on that for now. We'll see about boundaries later. Other questions? Okay. >> Okay, thank you. [applause.]