>> Phil Chou: All right. Welcome, everyone. Thanks for coming. Today Arrigo and I have the distinct pleasure of hosting Ivana Tosic from Ricoh Innovation from Menlo Park, California. Ivana got her Ph.D., I think, around 2009 from EPFL. Spent a few years after that at Berkeley at the Redwood Center for Theoretical Neuroscience and then after leaving there went to Ricoh. So we're very excited to hear what she's got to tell us, thanks. >> Ivana Tošic: Thank you, Phil. Thank you for the introduction and for the invitation to come here. It's a great pleasure first time for me at Microsoft Research here. So I am currently with Ricoh Innovations, but all I'm going to talk about today I've done during my post-doc at the Redwood Center. So I'm going to talk about dictionary learning for 3-D scene representation and I will explain during the talk what dictionary learning is. So basically the whole idea and the whole goal of this particular work and of my research is what are the best representations for 3-D scenes and how to acquire them and how from the images and information that we acquired how we build those representations. So this is joint work with Bruno Olshausen, he was my post-doc advisor at U.C. Berkeley, and Jack Culpepper, a Ph.D. student at U.C. Berkeley and now he's with IQ Engines also in Berkeley and Sarah Drewes., who was a post-doc at Berkeley in the math department and now she's with Duetche Telecom in Germany. So before going into 3-D representations, I want to give you a little bit of motivation and go back to the fact to see how we actually capture the information about 3-D scenes. So there are many ways to capture and here I show two most used ways to capture. One is if we have a network of cameras that is capturing a 3-D scene, and we can have multiple views of that scene from different angles. Another way is to use hybrid image and depth sensors. So this is a time of camera from PMD Germany company, and it captures depth information, and I will usually show depth images, depth map in color where I represent closer. And further is blue. And gets also here intensity mapper, a map of the object or Microsoft Kinect where you can have the depth map and you have the color image that are co-registered. So in both ways of capturing, we have reached 3-D visual information. And it has many applications in 3-D television, in surveillance, robotics, exploration and just to name a few. But what is worth pointing out is that depth or disparity is central to both these approaches of capturing 3-D scenes. So here the depth is contained implicitly in just by the paralex between different views. And here we have explicit measurements of depth, like distance from the sensor. And we need that depth information to, if we want to interpolate in between the distant views. So we cannot have a view with current technologies we can not have views from every single point in space, so we can just sample that space of camera positions and then interpolate in between and extrapolate. But there are problems with depth and main problem is its measurements are unreliable, so we get usually very noisy data. And here I'm showing you an example from a time of light camera, this is the intensity, reflects the image, and this is the depth image. And if you zoom in, you can see that there are a lot of kind of salt and pepper type of noise. This is much different than the noise than the noise we see in regular images. Or we can have from laser range scanners, this is a tree in front of a pillar. And you can see that there is errors in acquisition around the boundaries of the objects. And this is from structured light sensors, where you have missing pixels where you have occlusions between two views in a scene. So the bottom line is that for different type of depth sensors, we need algorithms to do denoising, remove the noise or inpainting to remove the information and they're illposed inverse problems and to solve them we need some prior information about the data that we are reconstructing. So for depth. And with for this, to solve it, we actually need good representations. And the idea of using representations solve inverse problems is not new. It's been used for decades in image processing. And for basis has been used to represent images, wavelets and in the last decade we've seen a lot of people using other complete dictionaries where you have a much larger number of elements in your basis than you have, than is the dimensionality of your signal. And here I'm just showing an example of a dictionary that's been burned from a database of images. So the elements of the dictionary that we usually call atoms, if you vectorize them and put them in a large matrix you have a big fat matrix, and to reconstruct the signal. Here I'm just showing a image patch from Atlanta, you need to multiply it with a tall vector. And if you want to do this really efficiently, you can say, well, I want this vector to be sparse, to have some small number of non-zero components. And if we want, now let's go back to our problem of noisy images. So what we will observe is not exactly this vector, this signal, but something with the added noise. So then the denoising problem if we are using representations becomes estimating this vector A, vector of coefficients to reconstruct our original signal. And in the denoising, people for images usually assume either stationary Gaussian noise or Poisson distribution of the noise. But it's usually for the Gaussian it's usually considered to be the same distribution of the, same variants of the Gaussian throughout the whole image and for Poisson it's also linked to the intensity of the image. So pretty well known models for images. But what about depth? So in depth we have seen that the noise is usually spatially varying. So it's what I will call nonstationary but it's not in time it's nonstationary throughout the whole image, and we can also add the time dimension, but for this talk I'm just going to keep on still images, still depth maps. But also a bigger difference is that the statistics of depth maps differ from the statistics of images. So just by looking at depth map you can see there's much less texture than we have in images and we have much more sharp and oriented edges. So let's see if we want to represent with, do denoising with wavelet thresholding, simplest representation, we get a lot of ring effects around the edges. Or we can say fine, that's great, we know that because wavelets, orthogonal wavelets are not optimal we can use wavelets that represent boundaries, we get better results or we can say we're not going to use representations we're just going to use some walk of filter or nonlock hold, and this is with a nonlock hold means denoising. And. >>: So that is the depth from ->> Ivana Tošic: Yes, this is done on, only on depth. You can see this is a great result but if you want to zoom in for details you can see lost some details here. This is a part of my hair, and smooth it out, a lot of the information in the depth maps. So we have looked at this problem and said, well, let's see if we can learn sparse representations of that. So not just start from representations that exist for images, like wavelets or curvelets, but let's learn them from the data. So in the rest of the talk, I will first go briefly through sparse representations, just the background. And please interrupt me at any time if you want clarifications on anything I suppose that people have already heard. Has anyone heard about sparse representations? Is there someone who is not familiar? Okay. Good. And we can go quickly through that. And then the two parts of the talk will be learning sparse representations for depth only. And then the second is a newer work, it's learning representations jointly on images and depth. So intensity and depth. And for both of these I will show you how we model the data, how we learn the dictionaries and what do we get results on inverse problems. Okay. So sparse representation these are also linear representations like the transform recording of wavelets and curvelets, but the difference is now that we have this big dictionary, which is ever complete, has a much larger number of elements of basis functions or atoms than the dimension of the signal. So our coefficient is a long vector. And to reconstruct it then becomes a combinatorial problem. So it's not easy to find these coefficients and we have an infinite solution set. So what people have proposed is to look for a sparse solution and there have been some suboptimal algorithms like the matching pursuit or basis pursuit denoising. And I will just quickly explain the basis pursuit denoising in the next slide, because that's the type of algorithms that we use. And then another bigger problem is, well, great we can get the coefficients, but how do we find this big dictionary? Nobody's going to give that to us. So for that people have developed dictionary learning or they're called also sparse coding methods. Just from given data to learn what is the best dictionary for that data. Okay. So this is just one slide to go over how we find the sparse solutions once we are given a dictionary, and then we'll go into the dictionary learning problem. So if we want to pose our sparse, we want to find the sparsest solution to our reconstruction problem, basically we want to minimize the L-0 of the norm of our coefficient vector. So you have the smallest number of nonzero entries. And this is NP hard. This is under quadratic constraint that the approximation is bounded by some error. So what people then propose is let's relax an L1 norm to an L 0 norm to an L1 norm and have that as our objective function and keep our quadratic constraint, which is convex. And it can also be formulated as unconstrained. And the advantage, of course, this is convex and easy to optimize. So that is the basis pursuit denoising [inaudible] by Saunders in 1999. So great. Once we have the dictionary, we know how to find a vector of coefficients. It's not necessarily the same solution as the LO norm but it's most of the cases it's a pretty good approximation. Then, of course, you wonder why would we use sparsity? And it turns out that this is a really good prior in solving inverse problems, such as denoising in painting because it gives a good generative model of the images. And I'm just showing here an example again on the line on the images but I'll show you depth maps in just a second. This is denoised using sparse representation in translation invariant valid forms. By adding overcompleteness to our representation we can get better results than, for example, using clean wavelet threshold. A simple example showing that overcompleteness already gives us something and sparsity as well. But what we can also do, it's really important, that we can adapt this dictionary to signal statistics. So we don't have to use this translation invariant wavelet frames. We can learn our own. And this is ->>: You're saying that the noise, the distance of the noise while you're running the dictionary? >> Ivana Tošic: At this point I'm just telling you the background, which is learning the signal, the statistics of the signal. Let's say you have no noise in your signal. Or the noise is actually subsumed in the approximation error. So you say I'm going to find signal model up to this approximation error that well represents my signal. But for that depth I will show you in just a sec. So this is -- so people have previously proposed dictionary learning or sparse coding, and it was first proposed by Olshausen Field in '97. This is the maximum likelihood learning where they said, well, we'll have a linear image model, and for the learning we're going to maximize the log likelihood that natural images arise from the model above given that dictionary. So they posed this as this optimization problem to maximize the log likelihood over the set of dictionaries. And they have used images for training from this one hedron database of natural images, and, whoops, supposed to be -- and then -- okay. We have a dictionary -- but before I show you what they obtained as the dictionary, I'll show you how they solved actually this problem. Because now they had a model of the signal to put into the model. So they uncoupled the conditional probability of the images given the dictionary and the coefficients times the prior on the coefficients. So this is important to put the prior on the coefficients because that's where we put the sparsity assumption on the coefficients. So if you put this prior to be a Cartotic [phonetic] description tightly detailed and this one is just a Gaussian distribution if the noise is assumed to be Gaussian. It turns out that this objective, actually this optimization problem, can be cast as equivalently as minimizing the energy where energy has this form, we have the quadratic approximation error, plus the L1 norm, a lot of times the L1 norm of the coefficient. So you can see this is exactly the same objective that is for basis pursuit in denoising except for here we have unknowns which are both the dictionary and the coefficients. So to solve it, they proposed a two step alternating optimization. So they have a big set of natural image patches and they initialize the dictionary randomly. In the first step they fixed that dictionary and minimized the objective over the sparse coefficients. Basic pursuit denoising again. Then they fixed in the next step, in the learning step, they fixed the sparse coefficients and minimized the energy over the dictionary. So that's the learning step. Basically it's just taking, you can do simple gradient descents. take gradient steps in the coefficients and the dictionary and coefficients in the dictionary until they converge. You And starting from a completely random dictionary, they learned a dictionary that has oriented band pass filters that are Gabor-like. So they look a lot like Gabor wavelets and this was just learned from the statistics of the data. This was an important result actually in computational neuroscience because they show that this sparse coding can also be [inaudible] in encoding information in the brain because this emerged dictionary looked a lot like the receptive fields of neurons in the primary visual cortex [inaudible]. Not just the fact that they're band pass and Gabor-like but distribution orientation and in frequency. Okay. So that's for the background. Do you have any questions so far, because from now I'm just going to build upon -- okay. So then we looked into how to learn these dictionaries for depth. And we had a challenge. So how can we learn it such that we're above this spatial [inaudible] noise so we have to learn from noisy data. And the noise is not any more stationary Gaussian noise, that was assumed previously in the dictionary learning. So what we propose to do is to add another, so we had the linear model dictionary times the coefficients and we have the approximation error and then we can put that to be stationary. But we have another type of error here which is just due to the acquisition device, and we modeled it as a multivariate Gaussian. So we wanted to be general here and say, well, let's assume that each pixel has a Gaussian notion but that variance of that noise can vary along the image. Some pixels can be more noisy, some pixels can be less noisy. And we're going to infer those statistics with also inferring the sparse coefficients. So we assumed a multivariate Gaussian noise and this is its covariance metrics, and we assume that these are -- the noise is not correlated between the pixels. So that's one of the limitations of the model. So how does that change our graphical model for the images? So we still have the sparse coefficients. We have the dictionary. And then we have different noise variables added to each pixel in the depth map. So this is just to highlight the differences with respect to the basic dictionary learning sparse coding method. So these are the new variables, the new noise that we have added just to be able to deal with different noise and depth maps. Okay. How does that change our learning objective where we can still use the maximum likelihood. Now when we break down our likelihood, we have three terms. This terms is the Gaussian distribution of our approximation noise. So this is fine. And then we have the prior on our coefficients. We can put that as Laplacian. And a prior on our co-variants metrics. Prior on our noise. So then for the prior, on the noise, we decided to do a noninformative Jeffreys prior to end up with any structure on the noise to stay as general as we can. And so now what the difference is with respect to the regular objective in dictionary learning is that we have another set of variables that we need to infer and then we have one other distribution and one other parameters here. Okay. And then the energy function is also slightly modified. So here we have the sparsity term, which remains the same, but in this term we have the log of the variants, which is due to our Jeffreys prior, and then the error at each pixel is divided by two times the variance of the noise of that pixel. the nose that we have to find. So these are always on So it's pretty simple formulation, and it can be also modified if you know some specifics of your noise variance, how it's related to the signal you can put it here. >>: Seems like there's a couple of common models that I've seen for error variance in depth images. One is that it varies square of the data. And the other is that it more along the lines of a Laplacian model. The variance is the brightness. >> Ivana Tošic: Yes. >>: But this doesn't seem to be motivated really by either of those. >> Ivana Tošic: No. So we wanted to keep it general, as general as we can. Adapt it to take more specific relations into account but we just wanted to keep it as broad as possible. So it's not always related just to depth that you can use it in other scenarios, too. But if you have that kind of deterministic relation, you can easily put it, and it's going to simplify finding your variance. If it's related you don't have to infer it anymore. relate it to the signal. You can just >>: So if it's proportional to say F, I guess, that is the -- is the depth ->> Ivana Tošic: Uh-huh. You can still use -- use, you can still use the same objective, it's just this might become then harder to optimize. >>: [inaudible]. >> Ivana Tošic: Yes, uh-huh. And you can also change the prior. This turns out, we also wanted to infer this so we know the reliability, how reliability is it to depth estimate. So basically we wanted to see for each measurement of the depth how reliably, how well it fits to the signal model. So not necessarily always linked to the acquisition device. But you know, occlusions or some other noise. >>: The lambda -- some Lagrange multiplier. >> Ivana Tošic: Yes. >>: Can it be specially variate as well, let's say I have my depth map and I know exactly where the bad measurements are, so can that be taken into account, while -- the basic lambda variant as well, does it make solving the convex problem harder? >> Ivana Tošic: So what you can do is use this solvers that actually find your lambda while it's finding your solution. For each patch. So it finds the best lambda for each patch. You can already use a solver that deals with that problem. Or if you know how to estimate lambda yourself, then you can put it there and ->>: Special variant does not ->> Ivana Tošic: Well, you cannot vary it here for each pixel. vary it for each patch in the image. You >>: For the prior [indiscernible]. >> Ivana Tošic: Yes, it might be possible to do it for pixel. I think there's some work on that, but I can't really remember right now. But yes. You can -- oh, you can vary it per coefficient. Yes, the work I've seen I can point you to the reference, is you can change different lambda for different coefficients. They call it kind of like the scale model. That's kind of different what you were saying. >>: Seems like your model assumes that the noise [indiscernible] in the pixel [indiscernible] so if it's the same pixel, the same [indiscernible]. >> Ivana Tošic: So right now I'm just doing static depth maps. not -- it's not video depth map. I'm >>: [indiscernible]. >> Ivana Tošic: Just static. Still depth maps. If you do it image per frame, you can just find different variance for each of the pixels in different frames, or you can constrain it to smoothly change from one frame to another. But it's different -- the point is that it's different per pixel. So, now how do we solve this, how do we learn the dictionary? So we have two parts. Again, we have inference first so we need to find all the coefficients and we need to find the variance for each noise for each pixel. And we solve that in an iterative manner. So we first initialize our sigmas to be equal and very large value. So we start from hypothesis that all the pixels are unreliable, and then we find -- we infer the coefficients, then we fix the coefficients and solve for the variance, and this is a closed form solution. So usually it takes, just a couple of iterations between these two to find the solution between for the As and the sigmas. And then once we have that, we solve for the dictionary. And this is just to show that it's averaged over all the dictionary, all the examples in our image base dataset. So it's a similar principle. We have inference and learning. We iterate. It's just here in the inference step that we also find the variance of noise per pixel. Okay. >>: So sigma I squared that equation, it goes to 0 the top -- minus [indiscernible] especially? >> Ivana Tošic: Yes. >>: [indiscernible]. >> Ivana Tošic: No, because we have the sigma zero term which is added to it. It will if this, so the sigma. >>: Sigma I -- so it's zero. >> Ivana Tošic: Yes, some good point. So basically this sigma I we add, put it here to be more, just simpler sigma I has plus the sigma zero squared, which is this approximation. So, yeah, basically sigma I is the sum of the sigma I plus this. Sorry, it's -- sigma zero is constant. That's a good -- just for like ease of representation. Basically this sigma I hat is sigma I squared plus the sigma zero squared. That's why it's here. We don't let it go negative. If it goes negative, it becomes zero. >>: I minus [indiscernible] then you are ignoring basically the difference in depth? [indiscernible]. >> Ivana Tošic: Sorry? Can you repeat that? >>: When the difference between FI and FI hat is vague, basically the denominator becomes ->> Ivana Tošic: Large. >>: The same as the top. >> Ivana Tošic: Yes. >>: So you're just ignoring that? >> Ivana Tošic: Exactly. So if the measurement doesn't really fit the model and just isolated, doesn't fit. Then it reduces that term and says, well, this is not reliable measurement. Because it doesn't really fit well our model. >>: Assume you only have visually here -- so here -- I thought [indiscernible] also a variable. >> Ivana Tošic: Yes. >>: That's excellent. >> Ivana Tošic: That's the learning, yes. So here we fix dictionary, we start from random and then here we learn it. We optimize over the dictionary. >>: So there's a lot of variables here. Use F. So basically [indiscernible] F, and A [indiscernible] and also [indiscernible], determined model. >> Ivana Tošic: Yes, it's hard. >>: >> Ivana Tošic: Yes, exactly. >>: If it's converging, is it because of the convexity? >> Ivana Tošic: So each step, because the objective is convex, each step is convex, but it's not necessarily the whole objective because it has not ->>: [indiscernible]. >> Ivana Tošic: I have never gotten a nonconvergent dictionary, unless you have really small number of training sets. >>: Are you learning one dictionary per image? >> Ivana Tošic: No, one dictionary for a database of images. >>: Okay, because I thought you were learning this for one unit. >> Ivana Tošic: No. >>: Because you assume that -- as before, the voice model, pixel, [indiscernible]. >> Ivana Tošic: image. Oh, that's what I meant. No, no it changes image to >>: Use different. Seems different. Seems different than [indiscernible] even want to say pixel. So that means -- different signal about different things. >> Ivana Tošic: Yeah, that's why we infer sigma for each pixel. per pixel per image. Yes, >>: Per image. >> Ivana Tošic: Pixel per image. So, yeah, and here see we average for the dictionary. So this is done per each patch. And then because it's independent, you can just do sparse [indiscernible] for each patch. But then here for learning step, we average over all the patches. Over all images. >>: The position of the patches is just regular ->> Ivana Tošic: Yes, randomly chosen. They take a lot of images and you randomly chose patch. So I can show you, this is from the Middlebury benchmark. We took all the depth maps and I think we had something like 30 depth maps. And we extracted 16 by 16 patches of 16 by 16 pixels. These are some examples. And so here on most of the irregularities are the occlusions or these occlusions. So these are, they're missing pixels and we don't mark them as missing pixels. We just let our algorithm figure out if they're nonreliable data. So we have learned the dictionary, and I will show you here in grayscale because I think it's easier to see. And what we saw is that we have a lot of oriented edges. They're a bit different than for the images, they're kind of more sharper edges. And we got some couple of slants like here. We think that this -- it's kind of used for the ground, because usually in this depth maps the ground is kind of tilted. It looks because the sensor is looking that way. And this can be multiplied by positive or negative coefficients. So it doesn't really matter if it's white in the front or black in the back. But the biggest conclusion is that it is different than the dictionaries people use for images. So we have tried to see, well, how better do these dictionaries do, and we have done that on the denoising task. So this is just a mathematical formulation of denoising task. So in denoising we now have our learn to dictionary, and we have our coefficient -- and we have our noisy image here. So we want to reconstruct like denoised image F hat. So what we need to do is to infer both the coefficients and the covariance metrics of the noise for that patch. So here dictionary is fixed. So it's just our inference step is the same. And so I'll show you here example, first this is the first real depth maps but synthetic noise. Just to be able to see, to have the ground truth and to compare against it. So this is an original depth map. This is obtained with a laser range scanner. This is from Yang and Purves database, Purves, database. And we have added nonstationary Gaussian noise. We've corrupted one percent of the pixels in depth maps. And the variance was randomly chosen for each of the pixels. And we have done total variation denoising and got results that smooth it out some parts but still left some of the noise, denoise. Curvelet thresholding gave also some really bad results. Nonlocal means gave nice smoothed results but lost a lot of the details. This is a trees, forest in the back. And this is the result that we obtained with our nonstationary sparse coding. And ->>: Just a question. So when -- you're talking like even with these, of course [indiscernible]. >> Ivana Tošic: Uh-huh. >>: So do you try to [indiscernible]. >> Ivana Tošic: Yes. I tuned like the total variation, the regularizer, the lambda. I chose the best one for nonlocal means and for curvelet, yes. >>: So I'm trying to understand what exactly is going on here. And I feel that exactly were you basically trying to learn the dictionary and the condition of the pixels? So in the literature, the estimation, there's a lot of work on robust estimation images, where you have those things to make a destination robust. And from the way you threshold, what it seems you're doing is basically just take away the outliers in the processing. >> Ivana Tošic: >> Phil Chou: Uh-huh. The many pixels. >> Ivana Tošic: Well, they also estimate something about those outliers. You're not just thresholding them saying these are useless. I have reliability measure. >>: You were doing that because when it's greater than sigma zero, you basically are throwing it away because you're second term ->> Ivana Tošic: No, it's smaller. When it's smaller than sigma. >>: If I had us, minus sigma 0. So you had a sigma 0. So we had it modest in my head and the [indiscernible] was also in my head. So that data has been sewn away. >> Ivana Tošic: If it's below sigma 0? Yes if it's subsumed in the approximation error we don't care about it. >>: Below sigma 0, then you're using the data. 0 than you're just slow. >> Ivana Tošic: But if it's above sigma No, it's just appropriately weighted. >>: The weighting ->> Ivana Tošic: It's thresholded when it's -- >>: [indiscernible]. >> Ivana Tošic: It's thresholding. >>: So if it's on the top condition, then it's zero, which is sigma 0. So you're waiting every pixel with every same weight in the 0. >> Ivana Tošic: Plus additional one. >>: And the second time, when you put it there, because the problem is sigma I squared plus sigma 0. So you end up with this one-half minus half hat square, and top is if I minus FI square so you get a one [indiscernible]. >> Ivana Tošic: Oh, oh I see what you mean. No, because when you are -- because in the first step here you are inferring A and these sigmas are from the previous iteration. So you're never dividing with the same. This is you solve it after and you go back. >>: Well in robust estimation you can say I threshold, I get rid of 10 percent estimate [indiscernible] I [indiscernible] for the outliers and search again. Seems very sigma like. >> Ivana Tošic: I will show you on an example. So we get per pixel some estimates of the variance. And I'll show you how they look like. They're not thresholding there's some variability in them. This is just similar, showing the results. Averaged over different images in the same per database, and this is the nonstationary sparse coding result that performs much better than the other ones. And this is just we corrupt it with a different number of bad pixels. And so I think maybe here you can see what we get from the data. So this is from laser range data, the one that I showed before. And this is the original, when we put into the nonstationary sparse coding we get an estimate, and then we get the reconstruction, and then we also get per pixel the inferred variance so you can see that it has different values for different pixels and then you can also use this data if you wanted to do further processing of this, you can use this data to kind of put a reality measure on each of the pixels. And this is what you get from nonlocal means. So you also remove some of the noise but you smooth out some small details like here, here. And this noise, which is basically correlated noise, we were not able to denoise, because it doesn't fall under the assumptions of the model that the noise is not correlated. So this is the limitation of this algorithm. But I think the best is, the best properties that gives us both denoised image and denoise depth map and inferred variance. >>: Seems to be mostly on the edges. >> Ivana Tošic: Uh-huh. Yeah. So we can get some of the correlated. So these are correlated along the edges so they're not just single pixel outliers, but when it has a lot of correlation there and it has problems. And we have done that also on the time of light data. So this is the original depth and this is the denoised and I can just, oops, it doesn't zoom in. Basically we removed all this pixels and then we kept some of the fine details of the depth map. We can also do inpainting just as an example from Kinect maps. And again we can inpaint well around the edges so here. But when we have big pieces of the information, which is bigger than our patch size, we cannot feel it ->>: What's the attach size. >> Ivana Tošic: 16 by 16. bigger machines. You can learn it larger but then you need >>: Pixel [indiscernible]. >> Ivana Tošic: We learned that it was just complete. You can also make it more overcomplete and it will be better. So it was 256 elements. >>: Can you tell me what we've done hierarchically where you start with smaller patch sizes where you get it right and follow bigger ones, you just get the area where it's bad. >> Ivana Tošic: I think yes and I think some people have done it in a kind of like a multi-resolutionnal dictionary learning. Yes, there's also work on that. That's a good point. Okay. So I will go now through the second part and it will be a bit quicker, but stop me, please, if you need some more information. So here we decided to go step further and see if we can learn multi-model representations of intensity plus depth. So it's pretty easy to see that there is a lot of correlation between the intensity effect [indiscernible] time of eye cameras and depth, or intensity and disparity and structured life. Here we need two model presentations, basically we need our dictionary atoms to have two parts, their intensity part and their depth part. And let me quickly tell you why some very simple models that you can think of will not work on this. Two simple examples. This is just syntactic, the illustration. If you have like a 3-D edge, observed in the hybrid sense, you have an image, besides continuity and depth again. Or another example if you have a slanted surface with some texture on it. Let's say a signed rating, if you look at it the image will be a chirp and then the depths will just be a gradient. And you can say, well, let me just put on top, we put intensity on top of the depth and put each atom to have its intensity and depth just put one vector on top of another and multiply with the same coefficient. So in the example one, you can do that, but what will happen is because you have the same coefficients for both intensity and depth, you will have to put the variability in terms of the contrast within the atoms, which will lead to a combinatorial explosion of the dictionary. So we can now do this text model. What people then have proposed is the let's say you have a common dictionary model, people refer to it as joint sparsity. Well, you have the same dictionary and two different coefficients. And here that would be great. We just put like a high wavelet or step function and multiply with different coefficients and that will account for the contrast difference. However, in the second example, we won't be able to use this model, because we cannot have one atom that can represent both a slant and a chart. So the conclusion is that we need a model that has different atoms and different coefficients, but still model the correlation between the two modalities. >>: They did encoding like I can see [indiscernible] in the depth issues. So [indiscernible] [indiscernible]. >> Ivana Tošic: You can do that, but then you have a problem of face wrapping, right? >>: Yes. >> Ivana Tošic: It has been -- I can point to you where it's been done for videos, but not depth, but optical fault type. What we proposed is to have this like a sparse model for image, sparse model for depth, and then just have coupling variables that multiply with the magnitudes to give the coefficients. So these ideally binary, but kinkeder likes to be between 0 and 1, and they just say if it's 0, we're turning off both intensity and depth. If it's one, then we're turning them on. So in the end we want to have sparsity on the coupling variables. And you can put that into a system model so you'll have atoms for intensity, atoms for depth. These are the magnitudes of the coefficients per for intensity and depth and decoupling variables. you can have different noise intensity and depth. And So then it becomes a problem of how do we then -- do we inference? So if you look, put it as an optimization problem, we want to minimize the L1 norm of our coupling variables. Since this is a positive, this is just a sum. And then we have a set of constraints. This is the quadratic constraint representation of the intensity and here I'm just going for the stationary type of Gaussian noise. It's another step further to for the nonstationary. And the same for depth, and let's just say that our magnitudes are bounded by certain values. They can depick, whatever. This problem is nonlinear. So we can't optimize in a straightforward way. But just doing a simple change of variables where we put that A is times X, and B similar, we can get a convex. It's the second order program, and we can use already existing solvers to find the coefficient. So this optimization problem gives us the values for X and for the As, for the A and B. And equivalent of the magnitude. So we named it joint basis for pursuit because it jointly finds the intensity and depth. >>: XI is independent? >> Ivana Tošic: Yes. So for and depth, we get different X variables. They're estimated are these the parameters that each patch, for each pair of intensity variables. So X, As, M, are hidden for each image and only the dictionary are estimated for the whole database. So then again in a similar way, we can have a two step iterative process to learn the dictionary. We initialized it randomly and then the joint basis proceed to find the sparse coefficient vectors and X as well. And then use that in the learning just to learn the dictionaries. And iterate between these two steps. So very similar. And we have learned it again on the Middlebury benchmark database. We used -- now we used both intensity and depth, and we have learned that twice over complete dictionary. And here I show in each of these little panels an atom that had its image and disparity or the depth part. And just by looking at it, we can see that there's coinciding edges, which we would expect because an edge in 3-D would induce imaginables and intensity and depth, and there is a texture. We saw that sometimes we have a slant in depth and dictionary in the image. So this type of atoms which are a bit more rare than these ones. But they still exist. And we can also plot a scatter plot where on this axis I evaluated a gradient angle of a depth atoms. So basically this is the gradient angle, the normal. If you would see it as a slant, it would be like a normal on the surface. And it's a texture part which I just fit with each item Gabor and found its orientation, so if you do a scatter plot and here each point corresponds to as an atom in the dictionary. You can see there's a big diagonal structure. All the points are around the diagonal. Basically there's a 90-degree angle between the government and Gabor representation. And just for comparison, they didn't show it here, we did a similar experiment with group [inaudible] algorithm and it gave us completely uniformly distributed. It isn't that all give us the same tendency of the correlation between these two. So this was an interesting result, because well a lot of people have looked into the statistics of intensity and depth and they found out that closer objects than to be more, have to have higher luminance and higher intensity values. But this was to the best of my knowledge the first time that we have shown that there exists some type of correlation between the gradient angle of the depth and the orientation of the texture. And of course we have done some inpainting experiments, just to show if we have dictionaries that are learned both on intensity and depth, we can get slightly better in painting results than just the depth dictionary. Here they remove 90 percent of the pixels from the depth map. And here I removed 96 percent, just see how far we can go. And just from four percent of the pixels, plus the intensity we can reconstruct this depth map. Which is still blurry and it's not perfect, but it's really from four percent of the data. >>: So that's really missing pixels for the depth value and image value? >> Ivana Tošic: No, just the depth. Basically we are relying on intensity to give us the -- it would be a good experiment to try removing intensity and depth and removing both at the same or removing both at different. >>: The correlation between the moving [inaudible]. >> Ivana Tošic: Yes. It's a prior. So if you see a gradient -oriented texture in the intensity, it will try to fit a slant. Or if it finds an edge, try to fit an edge in the depth. >>: Not understanding why that is [indiscernible] just if it weren't missing anything would it be learning? >> Ivana Tošic: So I think that it's blurry because these models are essentially linear. So they're kind of, linearly combining these atoms. So if you're bad at estimating one, you're just removing some of the -- it's not like a layered model in a scene per depth where you would nicely represent that just ->>: Is lambda turned up so high that it's not sparse? >>: So here, because we have used this solver, we don't even put the lambda in. So it finds the best lambda to solve it. For the group lasso, I just did the comparison. Group lasso, I just ran it for a bunch of lambdas and I took the best one. If you have -- if you turn up lambda, it will be smoother for the group. But here you don't have that option to tune it. >>: [indiscernible] because you have [indiscernible]. >> Ivana Tošic: Yes. That's true. But it sits a linear model. That's true. It's overlapping. >>: So in here in your prior you're still 16 by 16? >> Ivana Tošic: These are 12 by 12. We had to reduce them because now, the dimensionality of the problem increases. >>: Is this the antagonistic area. >> Ivana Tošic: Yes. >>: Any elements? >> Ivana Tošic: So 288. >>: What's the -- trend? How many images do you need to translate? >> Ivana Tošic: This was like all the Middlebury that I could download. So I think it's around 30 intensity depth pairs, the depth images. But within each image you take random patches which are 12 by 12. So you have a lot of data. I've been -- yeah, it's just kind of the dataset that I found they can do this type of training on. I'm not aware of any other database that I can use. Maybe just maybe from the MPEG dataset. Yes. >>: When was your next slide here found a representation between the angles and depths intensity. Do you think that same relationship would still hold if you looked at raw patches from the depth and intensity rather than the dictionary? >> Ivana Tošic: That's a good question. I don't know. That's something worth checking. I suppose people would have already looked at something like that. But ->>: I don't know. >> Ivana Tošic: I know we tried it. I can show you. We have one slide. I think it's this one. So this is a plot with a group lasso, and it's completely different. If you do dictionary learning with group lasso, you don't get that. Maybe I can also show you this. So I said that some people have already looked into the relation between luminance and how close the objects are. And I have looked at the same thing for my atoms and it turns out that in atoms also we get the closer -- if you look at like closer and further than you also get the closer objects appear brighter. So it's pretty dominant over here is the bright surface is closer and histogram in the red one is when the darker surface is closer. So we got the same tendency that people have already observed. >>: Just to be clear, if you didn't have even any depth pixels in a 12-by-12 region you couldn't indicate that, there's no propagation for cells early [phonetic]. >> Ivana Tošic: You would have to learn about your dictionaries. you don't have a data, you don't have a system. If >>: Larger as in bigger patches or larger in bigger number of patches? >> Ivana Tošic: Bigger patches. So, if you don't have any data, F, your observations are empty sets. So you don't have anything. >>: You get any blocking artifacts? >> Ivana Tošic: So what we do is usually average like we are shifting patches and then averaging. So if you don't do that, you can have the blocking artifacts, yes. So I think on this one I did -- I moved every two pixels and then averaged. Not every single pixel. So yes so just I liked always to conclude with these representations of David Marr, just to show you where I think we are with sparse representations right now. He put this representation in going in from the pixel representations to kind of edges and blobs and images to two and a half D representations where you take into account local surface of orientation and discontinuities, et cetera, et cetera. And these are all kind of viewer representations, and these are mostly what we have worked on so far. So what I presented today is mostly in this block. And to go through 3-D representations that are objects centered, it's still a part of future work, for sparse representations. So I think that's it. Thank you. [applause]. >> Ivana Tošic: So please if you have any further questions. >>: This work, let's say I have prior knowledge of my -- for example, most of my patches are multi-planner or [indiscernible] most of my stuff will be on a movie set. I logged certain surfaces. Would adding -- mix our terms is that a problem that enforces that somehow, does that help, or is it if my training set just [indiscernible]? >> Ivana Tošic: So this is a completely nonparametric learning. So we're learning pixel for pixel for each image. If we have 12 by 12 image patch we have 144 parameters per patch. If you know you have surfaces you say I'm going to learn a normal surface that will completely define my surface and that will have three parameters, right. So you can reduce the number of parameters. I don't know how your objective function will look like, it will be convex or not, but you definitely -- definitely reducing the number of parameters. >>: This prior should be scale invariant, right, so it's equivalent to as you get closer to the subject or further away ->> Ivana Tošic: So there are people who have tried to do like a multi-scale, do like larger objects of smaller objects. I have not seen much difference in the way that the dictionaries look like. >>: The 12 by 12 could be scaled to any size. >> Ivana Tošic: Yes. >>: With the same priors. >> Ivana Tošic: Yes. >>: So you could, if you had a patch that had -- a 12 by 12 patch that had no depth to them, you could apply your entire dictionary but scale it out so that you're basically reusing the same [indiscernible] but on a larger scale. >> Ivana Tošic: Well, you can. You're filtering basically the higher frequency from your data, right, if you're just upscaling. You won't be able to reconstruct higher frequency information from that. >>: But you would still be able to maintain things that patches that were completely devoid of that, too. So you kind of -- your region with no depth, you could make that using the same dictionary. >> Ivana Tošic: Yes, that's a good point. >>: It's basically tantamount to shrinking the original image to find your algorithm and then spelling it back. So then you could penetrate larger regions. >> Ivana Tošic: Yes, yes, that's it. Good thing to try. >> Phil Chou: All right. Thank you very much. [applause]