>> Jin Li: It's our great pleasure to have Oscar Au from MST to come to Microsoft Research and give us a talk on video denoising. Oscar get his Ph.D. from Princeton University in 1991 and then he become faculty member in Hong Kong UST in 1992 where he is currently a director of multimedia technology research center and advisor for computer engineering program in Hong Kong UST. Oscar is a veteran in MPEG and H-dot 26 series video standards, and I have read many, many good papers from him and his group. And there are now -- and he has been serving a number of distinguished positions in the society. He is associate editor for IEEE transaction, circuit and systems part one and IEEE transaction on circuit and system video technology. He's a chair of the technical committee on multimedia systems and application and have held many distinguished positions. Without further ado, let's hear Oscar's talk on video denoising. >> Oscar Au: Okay, Jin, and it's great for me to be here. My honor to be here. And I just want to say, show this picture, a warm welcome, warm hello from Hong Kong, okay, so we're on the other side of the globe, okay. Right now Hong Kong is still 20 some degrees. See, wee use celsius, 20 some degrees celsius, I think it's probably 70 some degree F. Okay. So compared with here, we're quite a bit warmer. And the weather is nice, okay. And we are fortunate enough that this is the university. The picture in the center is actually the university and it is quite pretty, okay, and these are some attractions, tourist attractions there, okay. So we have Disneyworld with ocean park and stuff like that. So it's kind of interesting, okay. If you haven't been to Hong Kong before, I would love to invite you to come and visit us, too, in the university, okay. So our university is actually quite pretty, so that's kind of nice. Okay. All right. I'm going to go and start the talk, okay. All right. So my name is Oscar and I'm from Hong Kong, okay, UST. I am with the department of electronic and computer engineering, okay. What I'm going to show you today, video denoising is basically the work done by my student, leeway wanting. He has just finished his Ph.D. and his whole these is actually on video denoising and video modelling kind of thing. Okay. Already. So this is what I'm going to do today, okay. Talk a little bit about motivation, talk about contribution. Basically we'll be talking about a multi hypothetical motion compensated filter, okay. We will talk about noise robust motion estimation for temporal denoising, okay, and then integration of temporal denoising and hybrid video code and hybrid video codecs, and then I will come to some conclusion and perhaps some demo, okay. But actually you know what, I think I'm going to go to the demo first because this have to motivate people to understand what they're talking about, okay. What we're talking about here is basically like okay, if you look at this video, okay, this is a regular sequence IQ and we've add noise to it, and when you look at this video, okay, all right, it has a lot of noise, yeah, okay, unpleasant. People don't like it. Not only is it unpleasant, a big problem about noisy video is that it cause you a lot of bits to compress. It cost you a lot of bits to transmit as well. And this is all undesirable, okay. Not only that, imagine when you do compression, for example, MPEG H.264 kind of thing, basically what you need to do is you need to do something like motion estimation, ECT quantization and stuff like that. Okay. Imagine doing motion estimation with this kind of video, okay, with all the noise fleshing and everything you're going to encounter a lot of difficulties, and very often the motion factors that you get out of this kind of video is going to be pretty bad, all right. And costing you quite a lot of bits to encode. As a result, what we are talking about is A, can we remove this voice in video denoising, okay? So this is what we have performed, okay. And it looks quite okay, okay. It is not that it is perfect, it is not perfect, okay. You can see some flashing here, okay. But on the other hand, okay, what makes this work perhaps a little bit special compared to other existing denoising work is that I think we all learn in university, when we are in school, that to do denoising, a very simple way to do it is to just run a low pass filer there. I mean it's easy, right? Because noise by nature is high frequency in nature, okay. You should uncorrelate it and stuff like that. Using low pass filer you'll be able to remove quite some noise, okay. The problem of low pass filter is that very often the edges will become blurred, okay, and that is undesirable. Okay. You'll be talking about the difference between a cheap, a VCD versus a high quality DVD, right, or a regular DVD versus HD, right, something blur and something sharp. That makes a big difference, okay. People expect something sharp especially when people pay for it. Okay. And so as a result, it is important that when we do denoising, we are maintaining the edge integrity at the same time, we are preserving the edge sharpness at the same time. And this is our goal in our work, that we try to do denoising while preserving the edge sharpness, okay. So this is our goal, okay. You notice earlier on if I can show this again, okay, the first frame looked pretty bad, okay. Now, what happened is because our method turns out to be a temporal of the main method, so the first frame is pretty bad actually. Okay. You basically have nothing in the pass that helped you. So the first frame is pretty bad. But if you go to the next frame, it goes better, next frame it goes better, it goes better. Okay. And then it basically becomes quite okay, all right. So going back to the beginning just play, okay. It will be okay, okay. And notice that the edges are quite sharp. Notice that, okay. Edges are quite sharp. And this is how it goes. We want the edges to be sharp. Okay? So that is what we are trying to do. Okay? Yes? >>: [inaudible] showed us [inaudible]. >> Oscar Au: Good question. Good question. Okay. All right. Now, what we're going to show is a class of temporal filtering, right? Okay. And one prerequisite we said we want to establish correspondence between the current frame and the previous frames, okay. Something here also happening in the previous frame and happening in the previous frame. Okay? So what is question is seeing is that what if the object is moving so fast that you cannot find when you find something here, you cannot find something in the pass? That's the question. Okay. The answer is unfortunately in this case then you cannot work on well. Okay. Which means basically that you would need to combine this method, perhaps with some spatial method, okay, because what I'm saying is this method is limited by -- well, this method has a basic assumption that we can establish correspondence between current frame and the previous frame. Now, this method by itself is not going to die just like that, okay. Basically what this method would do is that when it detects that the correspondence between the current frame and the previous frame is low, the similarity is low, then basically it's going to take a weighted average between the current frame and the previous frames. Okay? Well, when it detects that the previous frame doesn't look like the current frame, it's going to put a big weight on the current frame. Basically that would mean that you have very little denoising effect. It's not going to damage the video too bad, but it's going to have locally noisy effect. >>: [inaudible]. >>: When you say correspondence, it reminds me that earlier you said [inaudible] that means you can't get a good response, right, so is there some -I'm sure you're going to come to that, but it seems to me that a little bit of chicken, egg problem. >> Oscar Au: Okay. Yes. Okay. This is exactly why in our simulation we don't use fast algorithms because fast algorithms make assumptions. We are actually an expert in doing fast estimations. We have some method that went into the standard, standard okay. But then again we realized that in those fast algorithm we make assumptions which may not be true in noisy video. And so as a result we resolved to full search, okay. When you do full search actually it is not too bad, it is not too bad. When you actually have correspondence but you had just disturbed by noise, you basically would be able to find it. You basically would be able to find it. It's just slow, okay. But later on, I'm going to talk about method to make it fast, okay. That's one thing. But unfortunate if the motion is real going so fast you're doing rotation, crazy motion, okay, the eyes blinking and stuff like that, then there's nothing you can do, okay, and the method would basically fall back into doing very little denoising. Okay. So locally you may have some region which has a bit more noise. This is our method. Yeah? >>: [inaudible] will be great and then go up again of the denoiser. So if the [inaudible] between the camera and the decoder you start losing the motion that comes from ->> Oscar Au: Did you say [inaudible]? >>: [inaudible]. >> Oscar Au: When you drop frames it didn't mean you have bad motion or good motion it's just you just drop frames. >>: It's like [inaudible]. >> Oscar Au: High motion. Oh, okay. When you're in [inaudible] situation. Okay. >>: [inaudible] and you drop say between three and six and you drop, you don't [inaudible] this information. >> Oscar Au: That's correct. Okay. >>: [inaudible] the quality of the denoiser to go down, then when the frames come in and wouldn't this be [inaudible] to the [inaudible] that's one question. >> Oscar Au: Yes. >>: And the second one is the inter(inaudible) what do you do with it? >> Oscar Au: Okay. All right. Sure. First question. >>: [inaudible] so I mean. >>: [inaudible]. What are you going to do, are you going to start after they introduce it? >> Oscar Au: Okay. All right. One problem at a time, one question at a time. What's the first question again? >>: Drop frame. >> Oscar Au: Drop frame. Yes. Okay. All right. For drop frames, okay, you in fast motion situation, okay. Basically what that will mean that we're going to fall back to his problem. Basically when you cannot establish correspondence between current frame and the previous frame. So this method will not work very well. And you need to rely on some spatial methods to do it. Okay. But there are some existing methods, spatial method that can work reasonably well actually. But one duty at least for that particular situation you mention is that when motion move fast your eye cannot see very well. So actually if you blur it a little bit, it didn't matter. So actually I would tend to use spatial methods to blur it because keeping the edges sharp is not the top priority when the motions are sharp -- are fast. Right? So I wouldn't worry about that too much. Okay. So basically this method is really should be designed for -- has to be -has to work with spatial methods. What I'm saying is what I'm looking at you and all of a sudden I change to another scene, scene change, when you do scene change, this method would not work at all, because you don't have history, right? So you don't have anything in the past that can help you. So this method actually has to work with spatial methods. Yeah. So. >>: [inaudible]. >> Oscar Au: For [inaudible] content. Okay. Before I talk about that, one thing I do want to mention, we certainly look at spatial methods and one problem about spatial methods is the complexity is very high. Every single pixel, there are so many pixels in the frame and every pixel, suppose you just do a three by three low pass filter, stupid low pass filter, it turns out that the complexity will be so high, okay, if you use a larger, larger filter, okay, there will be so many operation you need to perform and spatial methods are just so slow, okay, the beauty of this method is that we only do a few operations where actually, you know what, we're only averaging nine numbers or five by five, 20 five numbers, we're not doing a weighted average of 25 numbers, we're taking weighted average of three numbers, four numbers, that's it. The current frame and the previous frame and maybe the previous, previous, and just that's it. That's all. We're just taking weighted average of three or four numbers. So complexitiwise it's quite simple. When it comes to the filtering operation, motion estimation by itself is very slow, okay. So that's why later on I'm going to talk about a fast method to do it. Okay? This method is basically a prefiltering technique, meaning that before you compress we assume independent of the codec you are going to do some prefiltering, denoising, okay, and so as a result if you want to do the motion estimation by itself, it's going to cost you a lot of computation. Okay. So but, yeah, before -- I should answer the second question. You talk about the interlacing, okay. I don't understand the question. The interlacing versus denoising, what the [inaudible]. >>: [inaudible] so when you have field and the field F plus 1 and there is that [inaudible] between the lines would this affect also the quality of your notion -- of your denoiser or not, before going into the decoder? >> Oscar Au: I don't think it's going to be a problem. I think it will work. Because the same problem happened to frame and field. I don't see field coding being more difficult and frame coding. It should be quite similar. >>: [inaudible] interlaced contents. But that question is [inaudible] sometimes [inaudible] what you call the [inaudible] where things are. >> Oscar Au: Right. >>: From T to T minus 1, T minus 2. >> Oscar Au: Yes. >>: But it depends on the method. Some of them, maybe this one would [inaudible]. I don't know. >>: [inaudible]. >> Oscar Au: [inaudible] in the second part of the talk, I will talk about some robust motion estimation. Perhaps that can address this issue. Basically we understand that in noisy situation, motion estimation is just hard. How can you make it robust, how do you make it trustworthy and hopefully make it fast as well? So we do have a certain way to do it that can cover at least part of this problem. >>: I may be getting ahead, but so in your work, do you assume that noise is what [inaudible] or because the reason I'm asking is that the camera has a non linear response and so noise is actually not [inaudible] noise, so the -- if you make that assumption then wherever [inaudible]. >> Oscar Au: Okay. Right. Sure. We use white Gaussian noise in our simulation, but I don't think we need to make that assumption that it is white Gaussian. It didn't need to be. In the formulation we don't need it to be white Gaussian. There's no need to be white Gaussian. So the method should work for other situations, too. Okay. All right. So maybe it's time for us to go back to the power point and see what we do. Okay. All right. Full screen. Okay. All right. All right. So motivation again I think we've talked about a lot of motivation already. Okay. Video quality. Okay. When you have noise, okay, noise is going to corrupt the video making look bad, okay, and actually especially when you in high degree situation, DVD, HD coding kind of thing, because of very high bit rate noise actually can survive compression. Noise can survive. So as a result, somebody need to take care of the noise somewhere. So denoising is a way to handle that. Okay. And not only that, with the presence of noise it is going to cause bits to represent, and so this is going to make the file bigger. If you denoise it, it is a quite good chance that you can make the file smaller, okay. And so actually there are situations in which sometimes even if the video is not very noisy, even if the video is not very noisy, you still want to apply denoising because you want to apply this as a means to reduce the file size. Okay. You do it maybe not to do denoising but rather to make the file small. In other words, you are intentionally using denoising to remove some image detail, so that your file can be smaller. Do you know what I'm saying? So denoising is another way to achieve that smaller file size. So this is kind of interesting. Okay? And okay, because of the presence of noise very often the motion estimation is going to have trouble stopping, okay, and so as a result the presence of noise actually will make your encoder slower, more compressed, more things to compute. And not only that, because with the presence of noise, okay, your motion estimation, no matter what you do, is not going to do very well, okay, so the residue is going to contain a lot of energy. And so as a result you need to spend so much more complexity to encode all these nonzero, DCT coefficients, okay. So actually in our experiment we realize to our surprise that denoising not only can make the file smaller, it can make your encoder faster, also. That's simply doing denoising. So it's kind of interesting. Okay. So this is quite a cool thing to apply to your codec, all right, so it's kind of fun. So, okay, in this example we have compressed the former sequence. Okay. We have add noise to foreman, okay, and then try to compress it, okay. And number one, with denoising, the file can be quite a bit smaller, okay. Number two, the compression time actually is much faster, can be faster, all right. Now, this is a little bit extreme. Of course this is -- we are doing all the favorable condition to make sure we work well, okay, but this give you the idea. You think denoising you can make the file size smaller and also you can make your coding faster. So that's an advantage. Yes? >>: You feel like losing some quality when doing that? >> Oscar Au: Yes. Yes. What we are doing here is what -- well, you need to understand. If you have an input video, if you treat the input video as holy, you cannot touch anything. If you treat it like that, then, yes, we are causing distortion. But remember, we are doing denoising, we actually changing the input, okay. We are saying that, no, the input is not holy, the input is actually noisy, there is something undesirable inside. I'm going to make it better while at the same time making the file smaller, making everything faster. So this is a little bit different. But, but on the other hand, you are right, actually, we are indeed losing, we are indeed losing a little bit of the detail, and so as a result we may have some problem. For example, if you look at foreman, okay, if you move the foreman sequence, okay, there is quite a bit of dots on the wall, okay, you will notice if you compared this with the region that some of the dots are gone. Some of the dots are gone. Okay. Now, so this is a price you're going to pay, okay. Yes, you're going to lose a bit of the detail. But certainly for the big edges and so on, you are not going to lose that. But for small, fine details, you may lose that, okay. For example, if you look at me, maybe I got all these dots because of my beard, whatever, okay, after you do denoising maybe you look nicer. Now, is that good or bad? I don't know. Sometimes, you know if you look Ugly Betty something, sometimes the director want a person to look ugly and now you make them pretty that may not be so good. Sometimes there may be detail that the director wants to keep in the content, right? So in that case you want to keep that. If you fail to keep that, you are not doing the right thing, okay. And so as a result it depends. But yes indeed you are going to pay a price for that, yes, you're right. Okay? >>: [inaudible] coding type reduction here, is it tied to the [inaudible]. >> Oscar Au: Two things. Motion estimation and DLC. And DLC, too. >>: So that means that in this particular experiment, the criteria using motion estimation is that you need to reach a certain level of upper bound for what the distortion can be between two blocks, right, when the reference frame and the mainframe, and once you achieve that, you stop the motion estimation, so filtering helps you basically achieve that threshold area? Is that the [inaudible] main gain of the [inaudible]. >> Oscar Au: No. This number may be a little bit misleading because in your codec, when you have a codec, you may not be using full search. But here in this example we are using full search. We are using full search, full search brute force full search and so as a result, you can increase it, make -- you can gain so much in the coding time. Bottom if you already are using fast search, how can you gain -- you can gain so much. >>: [inaudible] once you achieve a [inaudible] threshold. >> Oscar Au: No, we full search is full search, full search is everything, yes. Right. It does not [inaudible] criteria. We basically find out that early termination can be -- can get you into the wrong decision, because of the noise. So that's why we don't do a determination. Yeah. So it is [inaudible] combination strategies. Any other questions? >>: Do you assume that there are no [inaudible] in your original [inaudible] because you know, when you capture video typically it has [inaudible] nobody [inaudible]. >> Oscar Au: Oh, I see. >>: So if the original video which is [inaudible] has compression artifacts, will your algorithm actually be [inaudible]? >> Oscar Au: Okay. We did do experiment on that. Okay. We assume the video is original with noise, okay. If you have coding artifact, I think to a certain extent we should be able to suppress it a little bit, okay. But then the assumption is a little bit different because the noise is no longer so independent, the noticed is more systematic. And I am not sure, I'm not sure how good it will be. I think we need to do some experiment on that to try it. But I would guess that you would be able to suppress it somehow. Right. Right. So that's [inaudible] okay. So we'll continue. Okay. So temporal denoising. Okay. So the method we use is called multihypothesis motion compensation filtering. We utilize multiple pixels along the motion trajectory and basically do a linear combination of them to find the best estimate and what we are doing is actually an optimal solution, we are doing a linear minimum square error estimate, okay, so it is -- you can then trust that it should work pretty well. Okay. Because it's an optimal solution. Okay. So let's see. Okay. Peer-to-peer and P0. So what we are seeing here is suppose, suppose this is a current frame, okay, there's a certain region noisy, okay. If you can imagine to find in the previous frame something corresponding and something corresponding, what we are talking about is basically take a weighted average between this one, this one, and this one weighted average, okay. Of course in the weighted average situation, the coefficient is important, the coefficient is important, right? And so basically we use LMMSE to determine the coefficient, all right, to find the best coefficient for that particular situation. Okay? All right. So this page looks a little bit, you know, a lots of equations. Basically what we are saying is that we want the denoise version to be equal to something plus a DCD, a DC, okay. Here W is the weight, O is the observation, okay, now O contains FN and then P1, P2. Okay. FN is the current frame. FN is the current frame. P1 is the T minus 1, frame T minus 1. P2 is frame T minus 2 and so on. So here in this example we are talking about M frames, okay. Current pixel and then the previous few pixels. Here we already assume that we have performed motion estimation, you find the corresponding motion, you find the corresponding matched location, okay? So what we are saying is that for every of these guys you're going to weight it by a certain number, W. So this is just a simple weighted average of the past few numbers, okay? Now, mind you, we usually assume that noise is kind of temporally stationary, meaning that the noise variance tends to be stationary over time, but here we do not assume, we allow individually P1, P2 to have different variance, different noise variance because, okay, while we need to talk this, take the noise variance into account, we also need to take into account the motion mismatch, just like what you said earlier on. But the current frame and the previous frame, when the motion estimation cannot do so well, you have a mismatch, right? So that mismatch is not due to the noise, that miss match is actually due to the motion mismatch and so on. And so as a result, we are going to take that motion mismatch into account and put that into the variance of P1 or P2 and so on. All right? So that may be a situation in which P1 you can find a good match so the variance of this will be pretty small. Okay? But for P2, maybe motion estimation doesn't work very well in that case P2 may have a large variance. Okay? So in this case with this formulation then we would smartly allocate more weight to the guy with smaller variance, right? So that's what LMMSE does, right so this method works quite well, okay? So what we want to minimize, we want to minimize the expected value between the denoise version and original value. Okay? We want to minimize this one, okay? And the solution is the standard, standard LMMSE solution okay. Covariance matrix, you think, one over that and so on. Okay? This is the standard thing. I mean, if you open a textbook you'll find it. So this I did not put a lot of detail here because this is the standard solution. Okay in and it is not hard to find, okay, so this is the solution. And basically in here you is so called prediction error vector which is made in the current frame correspond to FN, the error vector is basically the noisy observation minus the original. So actually this is just the noise, whereas U 1, U 2, and so on, U 1, U 2 and so on are the temporal prediction error as well, okay? So, yes, so this is best solution. Okay. And then the minimum, the mean square error of the temporal denoising is equal to this, okay. With this solution, you know, it is well known that you can estimate the mean square error and that is equal to this expression, okay? In the special case that all these guys are independent, then basically this term would mean that this is basically sigma 1 square plus sigma 2 square plus sigma 3 square, so on, something like that. Okay. Yes? >>: [inaudible] noise in the original? >> Oscar Au: We need to estimate -- we actually in this -- for this method we need to estimate the noise variance. We need to estimate the noise variance. >>: The variance of the noise. >> Oscar Au: Yes, we need the noise variance. Yes. We also need to not only the noise variance but also we need to estimate the noise variance due to the motion mismatch. So basically what we do is we still do base motion estimation, okay, and basically with that then you can calculate the mean square error between the current block and the predicted block. Basically we use that to estimate the mismatch variance. >>: The noise is the pixel noise, right, it's not [inaudible]. >> Oscar Au: Pixel. >>: Okay. So the noise can vary? >> Oscar Au: Well, when we do simulation every -- it is white noise, white Gaussian noise with fixed variance throughout the whole sequence. When we do the simulation, we do this, right. >>: That's why I asked that question because in reality that's not true. >> Oscar Au: Okay. Yes? >>: Because the -- because of the camera not [inaudible] the amount of noise is a function of the brightness itself. >> Oscar Au: So brighter region is less noise or more noise? >>: I'm sorry? >> Oscar Au: Brighter region has more noise or less noise? >>: No, not necessarily. Actually it goes like up and down little hump. But, yeah, it's wide because you make the assumption the noise is constant then this is not optimal. >> Oscar Au: You're right. You're right. Yes. We cannot assume noise is constant but in our noise parameter estimation we do assume noise is constant when we estimate the [inaudible]. This model doesn't require the noise to be constant. Maybe it does. Yes. Sorry. Yes, yes, right. All right. But so good point, good point, yes. Perhaps we can make it better by doing -- dividing a video into a smaller region and locally estimate something, perhaps that would be a little bit better. >>: [inaudible] estimating the noise power in here? >> Oscar Au: Noise power? >>: Yes. >> Oscar Au: Okay. What we do is that we look at the video and we look at -we assume that the video frame contains some part which has high frequency, some part which has low frequency, which is smooth, right, and basically when you look at the smooth region, the variance should be almost zero and if there is any variance it would be due to noise. So basically we look at the locally small regions. Every region we look at the variance, and we look at the minimum, the last, smallest three percent or something and take the average of those things as the estimate of the noise. >>: [inaudible] spatial [inaudible]. >> Oscar Au: Yes. We use spatial domain stuff to do it, right. Okay. Okay. Now, when we do this, we have a choice, okay, we can choose to do this temporal filtering in the recursive manner or non recursive manner, meaning that suppose you have, you have denoise frame, all the previous frame and now you're trying to denoise a current frame, okay. Well, now, we want to use time, T minus 1, T minus 2 and so on, right, to help you. The question is are we going to use the original noisy video to help you or are wee going to use the denoise video to help you? The effect is going to be different, right? If you use the original noise video, this is not recursive, right, but if you use the denoise one, then this is recursive, okay, and we find that if you use the recursive the effect actually is much better. And so it is -- the reason why when we average only three number or four number and get very good result has a lot to do with this recursive filter, okay, because the previous few frames has been so much denoise already that when you compare with the current frame you tend to put a big weight to the -- on to the previous frames so that you have a very big denoising effect. Whereas if you consider a spatial filtering, spatial filtering, even this three by three, very small filter, you cannot denoise very much, you cannot denoise very much. Because every one of them has so much noise. But in the recursive situation, the previous guys have been so much noise denoised that actually you can have very good denoising effect even with three or four numbers averaged together. >>: So when you do recursive, you face the danger of [inaudible]. >> Oscar Au: Yes. Yes. >>: That was my question on the previous slide, those previous things you reference those were the original sources or the denoise sources? >> Oscar Au: Okay. >>: Which were they, the denoise ->> Oscar Au: In my simulation? >>: In your equation on the [inaudible]. [brief talking over]. >>: Are those denoised frames or originally noisy frames. >> Oscar Au: [inaudible] the formulation can apply to both. >>: Which ones do you use. >> Oscar Au: We use the recursive. >>: [inaudible] why you get the [inaudible]. >> Oscar Au: Yes. >>: [inaudible]. >> Oscar Au: So actually we will ->>: If you look at it, it looks like the IFR [inaudible] essentially it's an IR filter [inaudible]. >> Oscar Au: Yes. So you think that we're not doing -- yes, we are doing IR, yes, right. >>: [inaudible] I mean, it's -- it could be three, four numbers off but the [inaudible] filter can be quite, quite [inaudible] right? >> Oscar Au: Yes. I did not look very closely into that. Yes. But yes. >>: [inaudible]. >>: But it's that noise [inaudible] also has mismatch noise in there as [inaudible]. >> Oscar Au: Well, the noise has -- the noise parameter estimation is done in the spatial per frame, so we don't need to worry about the motion mismatch for that. Okay. So anyway, yeah. So anyway, when the variance goes down, that means the [inaudible] will go down. Anyhow, so this method works quite reasonably well. Okay. That's about it. Okay. To give you an idea of the simulation, okay, all right, now, this is mobile sequence, foreman sequence, new sequence, okay. You can use only, you know, zero frame, which means no denoising, okay. You use one frame, so that means you average two guys, okay. M equal to 2, 3, okay. All right. With no denoising, this is the [inaudible] you get. Okay. No denoising. Okay. All right. Using one guy, you can increase it from 22 to 25 DB, okay. One more you can go from 25 to 26 and then 26 to 26 again. Okay. You see that the major gain comes from one guy, okay. Two guy you can a little bit more, three guy, a little bit more. Doing further it's not going to give you a lot of gain, okay, so that's why from our simulation we tend to believe that you get the maximum benefit out of either two or three. I tend to use two myself, right, but it depends. Yeah. Okay. Complexity of course when you go from one to two to three, complexity is going to increase linearly as well because of the motion estimation. Okay. >>: Where is the framework? >> Oscar Au: 30 frames per second. Full rate. Yes. Right. Something like that. So okay. Some example, okay, so this is noisy video, this is a denoise video with M equal to 2, so that means 1, T minus 1 and T minus 2, that's it. Okay. And you can do quite well actually, all right. And yeah, this is quite a bit cleanup, okay, and this part can still be quite good, okay. When we have good motion as correspondence, this method actually works quite well. This method actually works quite well, okay. All right. So okay. Another example, this is a flower garden, okay, and original noisy video M equal to 2 and this is what you get, okay. And the sky looks green and you still see some textures here, so quite reasonable. You do see, you do see around here there's some noise, okay. Yeah, you do see some noise here. Okay. Now, when you locally, when the motion mismatch, when you have motion mismatch, this is the kind of effect you're going to see in this method, okay. Here motion is what -- motion estimation is doing well. Okay. But here locally motion estimation doesn't work too well. So in that case you're going to see some locally some residue -- you know, residue noise there. Okay. So basically I think around the house rooftop you see some noisy patches here and there. And this is a place where motion doesn't work so well, and so we don't do so well. >>: What is [inaudible] there's like a [inaudible]. >> Oscar Au: I think ->>: [inaudible]. >> Oscar Au: I missed the question. >>: So the question is why around the pole there you're missing the noise so badly compared to other regions where you get much better [inaudible]. >> Oscar Au: Why is this so bad? Well, basically any place where the motion estimation doesn't do so well you're going to have that kind of effect. >>: [inaudible]. Actually copy on your model in the frame that the average [inaudible] that frame [inaudible] might have a much higher weight than the frame so. >> Oscar Au: Yes. >>: The danger is that you are -- get the wrong block. So getting block from the [inaudible] and that's what seems to me to have happened [inaudible]. The pole on the middle, right. [brief talking over]. >> Oscar Au: Okay. All right. I -- if you ask particularly why this think is happening, I'm not very sure, okay, I'm just pointing out to you that, yes, if this happened whatever the motion doesn't work too well this would happen. When you're talking about that sequence specific, why this is happening here, right. >>: Ask you how often [inaudible]. >> Oscar Au: How often it happens? How often it happens? Well, basically when motion doesn't work so well that's what would happen, I mean that's -- how often does motion work well or not well? So that -- same question. It's actually the same question. Yeah. It's a sequence dependent, right, and here we have too many translation of motion, right? If you change the model, allow yourself to have smaller blocks, allow yourself to have rotation of motion and stuff like that, you can do better. But on the other hand, we ->>: [inaudible]. >>: It happens twice already, right, so the [inaudible]. >> Oscar Au: You're talking about right here, right? >>: And there's one other [inaudible]. >>: Same pole. >> Oscar Au: Same pole here? >>: [inaudible]. >> Oscar Au: Here? >>: Yeah, yeah. >>: [inaudible]. >> Oscar Au: Yes. >>: So if you don't do a good job, you're going to see a lot of [inaudible]. >> Oscar Au: Yes. To solve the problem, to solve the problem is actually not very difficult, okay. To solve the problem, you need to combine this with spatial method. The idea is quite simple. When this method work well, you use this method. Because this is actually better than spatial method. But on the other hand, when this method doesn't work well, you need to put a bigger weight on to the spatial method. In other words, you want to actually do a linear combination between the temporal prediction, a temporal denoise version versus a spatial denoise version. Take a weighted average between them. When this method works well, you trust this guy. Give this guy a big weight. When this method doesn't work well, you give the weight to the other guy. Do you know what I'm saying? >>: [inaudible] spatial noise [inaudible]. >> Oscar Au: Yes. Later on, later on you're going to see when we are looking -when we look at the spatial the robust motion estimation, we are actually going to do, we are actually going to do spatial denoising as well. We are actually going to do a spatial denoising. And we are going to combine that with this one, because anyhow, we do that over there already with the spatial denoising. We use that, if we combine that one with this one, then you're going to get a better result. >>: [inaudible] this frame. >> Oscar Au: Yes. >>: Is the motion actually [inaudible]. >> Oscar Au: I don't remember -- well, this is obviously hanging like that, right? But I think things have become a little bigger over time, over -- I think. >>: I assume that they do [inaudible]. >> Oscar Au: Yes. >>: If you don't do rotation. >> Oscar Au: No, we don't do rotations. We just do a regular motion estimation. >>: So [inaudible] wonder if the motions [inaudible]. >> Oscar Au: I don't exactly remember this sequence. I think this sequence is filmed by probably a car driving along the street, something like that. And so as a result, there's a scale change here. >>: So -- >> Oscar Au: You don't have just plain, simple translation of motion. The things that becoming smaller and smaller. Okay. And so on. So motion estimation has some problem in here. >>: [inaudible]. >> Oscar Au: I think we use 16 by 16 just regular. >>: So for all [inaudible] 16 by 16 block you assume the same [inaudible], right? So maybe. >> Oscar Au: Same motion vector. >>: Same motion vector. So maybe a very good estimate for saying the [inaudible] and then the edge corresponds [inaudible]. >> Oscar Au: That's possible. >>: You're forced to use one motion vector for this whole. >> Oscar Au: Process, yes. Yes. >>: Because I didn't notice anything that would suggest -- I'm just wondering ->> Oscar Au: This particular example doesn't show, but there are examples in which you see a noisy pattern which are blocking. And this is do to a blocked based motion estimation. Yes. Okay. All right. So noise or motion estimation. So here we're addressing that issue now. When you do motion estimation, okay, the presence of noise cause you a lot of trouble, okay. For a particular -- for some situation, okay, when you have noise free situation, when you have no noise, you may have a pretty nice residue surface. Right? Error surface. Okay. The presence of noise is going to cause this to become very fluctuating, okay, noise like. Okay. So the -- this thing is noise like. And unfortunately many of the fast motion estimation methods assume some kind of gradient descent, gradient algorithm, okay, and we're going to fail and that method will fail pretty badly in this kind of situation, right. When the error surface goes crazy. Okay? And the early termination can cause problem, too, okay. Sometimes you can be easily trapped in here, sometimes good, sometimes bad, all right. You think it's good enough, but actually it may not be so good after all. So as a result, we find out that in noisy situation it's just hard to do motion estimation. Early termination strategy has problem, gritty, local search thing is going to have problem. And things like that. Okay. And this will greatly increase the motion estimation complexity and also reduce the accuracy as well, okay. And so as a result, what we propose to do, what we do is that we propose to do two things, okay. We apply a filter first, we apply an edge low pass filter, okay, and then we do fast noise robust motion estimation. Without this one, we cannot do fast algorithm, and we withdrew full search. So that makes it very still. Full search is very still. After applying this one, then actually we find out that it is now not so bad to do fast algorithm. After you apply low pass filter, then you do it, then you can do quite a bit better, okay, and after that we just do the regular denoising stuff, okay. >>: Why does it [inaudible]. >> Oscar Au: Because the edge is important. Because the edge is important. >>: [inaudible]. >> Oscar Au: No, no, edge is very important. Edge will determine whether your motion is going to be correct or not, right? >>: So [inaudible]. >>: I think he's trying to say is that assume that the [inaudible] certain filter [inaudible] video. As long as you're consistent with the way it effects the [inaudible] is preserved regardless of whether a [inaudible] edges are sharp. >>: No, because you could just do a low pass filtering and then the resolution decreases. So then when you do the mash, you can't resolve to within that resolution. So you preserve the high frequency edges, [inaudible]. >>: [inaudible]. [brief talking over]. >> Oscar Au: If you read on one of the goals we have is that we want to preserve edge sharpness after you do the denoising, then edge precision is actually quite important. If you low pass it, then the location of the edge is not so precise. >>: [inaudible]. [brief talking over]. >> Oscar Au: What is what? I'm sorry, I missed that. >>: Do you recall get a [inaudible]. >> Oscar Au: Yes. Yes. I'm going to show you. Well, we've got to speed up compel full search. We've got to speed up compel full search. Okay. Now, while we can go a little bit further down to look at the results, the number of search locations this guy, this is full search method, this is the speedup we're talking about. We are getting this one gives you about 50 times speed up, 50, 60 times speed up. This one give you, I don't know, close to 100 times speed up. Okay. But 100 times speed up. Okay. Computation, complexity, number of search location, computation, complexity. This is a regular method, full search, this is our method and number of addition is this many, so I don't know, 20 times? Number of addition -- we need to do a little bit more multiplication, so we do a little bit more multiplication but not very much, okay, so, yeah, so we do -- we manage to reduce the complexity quite a bit. The speedup processing speed for -- okay, the original method we can only process on our P4, 3.2 gigahertz, one Gigaram, Window XP platform, we run it. The original method using full search we can only do two frames per second, three frames per second. By using the fast search we can do 28 frames per second, 20 something per second, right, so this is quite a bit of speed. Quite a bit of speed up. Okay. The motion field becomes smoother. The motion field would become smoother. And also the quality would become better, too, okay. This is noisy, this is the original method, this is the fast method. The fast method actually can give you -- well, okay, in this case, similar [inaudible]. Sometimes you can get better [inaudible]. Okay. Believe it or not, we are using fast method but we can get better [inaudible] so it's kind of funny. >>: [inaudible]. >> Oscar Au: I'm going to go back to that, yes. Okay. We like that filter, right. In this example this is shown sequence, the noisy video with no -- okay. I think this is the amount noise, three different situation. This is not so much noise, bigger noise and bigger noise, so that it appears now it's smaller, okay. And M equal to 1, 2, 3, it's pretty different situation an original method, this is a fast method. You can see that the fast method has a little bit of gain over the previous method. But in some case we can actually gain quite a bit, gain quite a bit. All right. So it's kind of interesting. Okay. The most important thing is you don't lose. You don't need to lose the [inaudible] when you use this method. Okay. Interesting. Okay. All right. I'll go back and look at the thing what we do. Okay. Okay. All right. Okay. This is the filter we use, this is the filter -- we look at a three by three window. It's a very small, small filter. It's not a very big filter, it's a small filter, okay. We like this filter. We use it quite many times. This is -- this used to be our special, special domain low pass filter, denoising filter. We used to use this for denoising filter. We used to use this for even inverse, inverse half toning. If you know half tone data very noisy, this method actually can suppress it quite a bit. Quite interesting. The equation, my student write this equation very poorly so it's very hard to see, okay. But basically this is a three by three neighborhood, consider three by three neighborhood. There are 9 numbers, okay. We're going to take the weighted average of the 9 numbers, that's all. But how do we determine the weight? Okay. Now, especially we are talking about edge preserving. Very often when you do edge preserving, you need to do some kind of edge detection, and then do the direction of filter along the edge, right? That's one way to do it. But we don't want to do that because that is going to incur a lot of complexity, very trouble some, okay. Or if you want to do median filtering it would reserve edge, also, but then you need to do sorting and stuff like that, which is better. This method doesn't do sorting, this method is a simple three by three weighted average thing. The beauty is really in how you determine the weight. This method is actually quite interesting. The weight now. Here. Imagine I this thing as the 9 numbers within this neighborhood, 9 -- three by three neighborhood. Okay. Each one of them is weighted by a certain number, okay. Well, that number is going to add up to one, all right, as expected. Okay. Basically this number is proportionate to something complicated, okay. Basically this equation is a very, very clear, okay. What you want here is that suppose you have a target value within a certain neighborhood. Suppose you have something called a target value, okay, and if you are close to that value this guy is trustworthy and you're going to give a big weight to that guy. If you are very far away from the target value, then we give a small weight. Okay. In a sense this method allows us to suppress our [inaudible] basically without doing edge detection. If you can do edge detection, if you know that you are on this side of the edge and you forget about those other guys, right, you only take this side and take the weighted average, then you will have no problem, you will be able to preserve the hedges. Within the neighborhood if you know there's an object here something on this side, something on that side, if you decide the weighted average between these values and those values, you could have a very bad value because those value has nothing to do with you and you try to weighted average, those guys [inaudible] can badly affect your results. The basic idea here is I want to forget about those guys, I only use the guys that are trustworthy and just weighted average, okay. >>: It's basically a bilateral filter. >> Oscar Au: Bilateral filter? >>: [inaudible] where you basically looked at the spatial and the extent, in other words, you try to filter locally and look at not only where it is but how different it is from the intensity so that you downweight any pixel. >> Oscar Au: Yes, yes. >>: [inaudible]. >> Oscar Au: Yes. That's the idea. Yes. So we're doing that. Okay. So this filter is a little bit more complicated than the regular three by three low pass filter, okay, because we do need to do a multiplication that that weight is important, okay, but on the other hand, it is not so bad, it is relatively simple compared to a lot of other methods, okay, so, we like this method a lot. Okay. So eventually works quite well, okay. So for example, if you look at this example, okay, this is the original -- this is original video, this is the noisy video. If you apply our filter, this is what you are going to get. So as I told you, this method actually is a denoising filter. It is a denoising filter. But the beauty of it is that you can preserve the edges. If you look at this edge, this edge is very, very sharp. Okay. This is what I'm talking about. This filter can preserve sharp edges. But for the thin edges, it does not preserve it. But for sharp edges, yes, it would preserve it. Okay. All right. So yes, so these are actually pretty sharp, okay. So this is a pretty good edge preserving filter. So it's kind of nice. Okay. All right. So after applying the edge preserving filter, okay, now you got something which is quite good, all right. Now we are going to use this to perform a motion estimation. We're going to use this -- we don't want to use this one, we want to use that one to do motion estimation. But later on, later on, when we obtain the motion vector, we will still use the noisy one. We will still use the noisy one to do the denoising. In other words, at this point, we are only using this guy to do the motion estimation, but we will not use this for the denoising. Okay? We will use the original noisy one to do the denoising. Because we do understand that this method where we actually suppress some of the detail, it actually will suppress some of the image detail, right, so we cannot fully trust it, okay. But for motion estimation, this is good enough, motion estimation this is good enough. And not only that, as I mentioned earlier on, in the case where the temporal denoising doesn't work, we actual have the option to do a weighted average between that guy and this one. Because this one actually doesn't look too bad, this actually doesn't look too bad, it says you just lose some detail, okay. When temporal denoising doesn't work, it's better than nothing, all right. So actually take the weighted average between this one and the temporal denoising works quite well. Okay. All right. So here, this is a second stage of the motion estimation. So what we do is that we consider free motion vector predictors, okay. The first one is a median filter, median predictor, okay, meaning that if the current block, you have to -- the block on the left, on the top, on the left upper left, okay. And they already motion vectors, and we treat them as, you know, likely predictor. Since they are so close, it's quite likely that this, the motion vector this guy similar to the neighbors. So we take the median of them, okay, and treat this as a median predictor for the current block. Okay. All right. We also look at if this is a current frame, this is a previous frame, you will look at the code locator block. The previous frame also has a motion vector. Given that in the real world a lot of things don't move so fast, so the co-locator block, the motion vector of that guy perhaps is a good predictor of the current block motion vector as well. So that is another predictor, okay. And then there's also another thing called temporal predictor, meaning that we -we usually remember we are going to do -- if this is a current frame, this is T minus 1, T minus 2, T minus 3. When you do T minus one, okay, you do whatever motion estimation. But when you do T minus 2, you already have a motion vector for T minus 1 and it is -- it makes sense for us to just double it and use that as a predictor for T minus 2. And so this is a temporal predictor we're talking about. Okay? And, yes, so you scale the pass, low pass -- the pass motion vector temporal distance as a predictor. Okay. So this is free predictor we use. Okay. >>: [inaudible]. You experience here, how sensitive those methods are to resolution, you know, if I start with a set video, well, you know, I mean, this [inaudible] detail, but [inaudible] increasing, going to the HD category solutions, clearly the video becomes oversampled, especially in the spatial domain. Is there some [inaudible] you typically like to incorporate in the way you define kernels, you know, depending on the resolution. >> Oscar Au: Okay. That's no particular [inaudible] but one thing I can comment is that these are predictors, and predictors a small resolution. Predictors don't work so well in HD, in large resolution. They don't. Okay. So for -- I do a lot of work on fast motion estimation, and we notice that when you have small resolution video, very often just as long as you examine the predictor, you already [inaudible] it happen quite often, okay. But in HD kind of situation the predictor not so accurate and almost always you need to rely on some kind of local search to find the optimal location. Yes. So local search is very important in HD. Not so important in the small resolution. >>: [inaudible] high priority [inaudible] but the video is [inaudible] can denoise a lot [inaudible] still doesn't occur [inaudible]. >> Oscar Au: Okay. Sure. >>: So [inaudible] some constant strength for the filtering. >> Oscar Au: Okay. We don't adjust it. We also did not do a lot of simulation on the HD kind of thing. We actually did a low resolution so HD I cannot say too much. But basically, no, we don't, okay. And afterall remember, this low pass at this edge preserving low pass is used to produce something from motion estimation only. In this so far over here, okay. So as a result, even if you lose a little bit more detail is actually not the end of the world. It's not the end of the world. >>: So my question is do you know [inaudible]. >> Oscar Au: Okay. Filtering will stop hurting the video quality? >>: [inaudible]. >> Oscar Au: Oh, denoising, when is it going to affect -- oh, I see. Okay. I cannot say okay, I mean this is a pretty long story if you want to talk about that. Coding. How coding is interplay with the denoising result that's ->>: [inaudible]. Can denoise a lot [inaudible]. >> Oscar Au: How by draw something on the board. Is it okay? I find a pen. When you have noisy video, this is what you're going to get. When you have noisy video normally when you have clean video we expect this, right. But the bit rate is high, these are not as high, right? This is what you expect for clean video. For noisy video, this is what you're going to see. This is what you are going to see, okay. Now, with no denoising, with no denoising, there's a certain region that you can just lightly compress it, and compression is subacute noise, okay, because whether you have clean or noisy, you get the same thing. Compression would queue the noise for a certain bit rate. When a bit rate is high, the bit rate is high enough to preserve even the noise. And because with the noise you cause the video to become bad. >>: [inaudible]. >> Oscar Au: Yes. Yes. We have -- we actually have some paper to direct this point. You can calculate this point. You actually can calculate this point, okay, and that's good, all right. Which is -- this is very, very good because what that means is that if I know the bit rate my bit rate is going to be lower than this point. Forget about the noise. I just do compression and I get 3D noisy from the compression. Okay? But if I know that my bit rate requirement is higher then what am I going to do? Well, if I know this point and I will say that I would never, even if I want to be hear, I would want -- I would not want to give the bit rate to this guy, because more bit rate means I'm going to get bad performance. I'm going to only go up until here and that's it. I would never allow my QP to go below a certain point so that I'm going to get worse performance and more bits. This is just stupid. With denoising, with denoising what would happen is that you are going to get something like this. With denoising. Okay? You will not be as good as that, but you will be better than this point. You know what I'm saying? You will be better than this point because of denoising. Okay. And that now there's a reason for you to increase the bit rate higher. With no denoising there's no point to go beyond this point. All right. So that is the situation. Okay. So coming back, when we do motion estimation, this is the cost function we use, okay, J equal to the first time is basically SAD, just SAD, plus lambda is -- this is lambda rate where the second term basically is V minus the P median predictor. The idea here is that we would like the whatever motion vector you find, we would like it to be close to the median vector, okay, in order that the motion fee would become smooth, okay. There is reason to believe that whenever you look at a particular block, okay, this block is part of a big region, and so this guy should be similar to the motion vector in the neighboring region, so it makes sense to the motion view to be smooth. And this will be especially so for large resolution, HD kind of situation. For low rest rates, less true, for HD it will be more true, okay. So and then basically we do two kind of search pattern, large diamond search and small diamond search in different situations, okay. And oh, I see. And another thing is that because we are doing -- let me see. If this is frame T, this is T minus 1, T minus 2, T minus 3. What we are seeing here is that at a certain point if the -- if the [inaudible] square is less than a certain number all right, then perhaps you don't need to do anything further because it is good enough and the motion search can stop. Of will otherwise you keep going. Okay. So something like that. Okay. All right. So this is the overall strategy, okay. You start with M equal to 1, okay, and then you search it, okay, and if M equal to 1 this is what you do, if this is M equal to -- for M equal to 2 you do this, M is equal to 3, I think 3 you do this, okay. All right. For M equal to 1, you do -- we do free search, we do free search, okay. We do large diamond search starting from 00, okay, and then after the last diamond search start you go to small diamond search, okay. All right. And then the second thing we do here is that we also use the median predictor as a starting point and do small diamond search. Another thing is that we use the co-locator block from the previous frame, co-locator block, PC. Yes. Co-locator block, yes. Okay. And that start doing a small diamond search, okay, and basically compare the three guys and see which one is better and choose that as the motion vector, okay. And therefore M equal to 2, then we just do a large diamond search from the PTT is the temporal predictor, from a temporal predictor, okay, and then you do the small diamond search. And then for the other guys you use small diamond search from the temporal predictor, or something like that. Okay. So basically this is what we do in the search. And so once again, this is what we got. Okay. We can get quite a bit faster than the original method. Now, mind you this method use brute force full search, okay, and once again because of the presence of the noise we find that the fast method doesn't work so we have to use full search for this. And now we find out that yes, it is possible to do fast search. Okay. And this is the kind of speed we can get. In order to do this, we will need to include a edge preserving low pass filter in it before we do fast search. Now the fast search become truss worthy, the result become more reasonable. Complexity wise, we are quite a bit smaller, right, so -- okay. And the speed up, okay, and the motion field becomes smoother as expected, because of the SAD thing, because of the predictor, it tends to be much smoother, okay. And, yes, so the reason is -- the results are reasonable. Okay? And [inaudible] can be a little bit higher, can be a little bit higher. Okay. Okay. Yeah. Okay. So now so this is -- so, so far I've talked about two things. Number one, I talk about temporal, temporal the interlacing framework using multiple reference frame to do denoising, second part is that for the first part we need to do motion estimation and we blindly use full search. Second part we try to do robust motion estimation but fast, fast and robust motion estimation. So we're doing two things. Third thing, okay. Now, so far we have been talking about denoising as a prefiltering measure, meaning that you don't -- denoising is one thing, codec is another thing. They are separate, okay. But now here we are seeing that is it -is that possibility that we can integrate denoising in the encoder in that means we do both things together, okay. Now the encoder is willing to do something for you, okay. How -- is it possible to combine them together to do something reasonable? All right. Can we do that? Okay. If you imagine then can we later on you should bear this in mind, can we do special filtering with encoder? Can we do other things with encoder? I'm not very sure, okay. But I tend to think that we perhaps are the only method that you can integrate seamlessly into an encoder. It turns out that we are doing exactly the original filter. The temporal filter we are not changing. We don't modify it. We do exactly the same thing. And we can integrate it into the encoder such that if the encoder complexity is this much now, with the denoising the complexities are this much. You just increase a tiny bit and you have the advantage of denoising. Okay. All right. So this is -- we're going to show you how we can do this. This is regular motion estimation, well hybrid coding, okay, MPEG-1, MPEG-2, everybody do this. Input video. You do motion estimation here, you subtract the prediction to get the residue and then you do DCT quantization and stuff like that and then here you do locally decode that and add it back to the predictor to get reconstructed video and so basically this whole thing is the decoder, okay, and so on. Okay. All right. Now, you see that we leave quite a bit of space here because we are going to add something there. Okay. When we said we integrate something. What we are saying is that we want to keep this whole thing except that we modify a little bit such that we achieve the effect of the denoising. Why are doing the whole compression at the same time. We are doing denoising within the whole encoding loop. Okay in all right. Okay. So we have two things. We encode -- we integrate the denoising into video encoder. We also can integrate the denoising into the decoder. Okay. So we the video encoder with integrated denoising video decoder, we've integrated denoising. We claim that this is equivalent to cascade scheme, meaning that when we do integrated video and noise encoding and integrating noise we are effectively doing this. We are effectively doing prefiltering followed by decoding. We are getting the same result. For this guy we accept the getting, decoding plus denoising afterwards using our method. But that's a constraint. And the constraint is that in the past remember there's a parameter M. You can use current frame, you can use one frame, two frame, three frame, and I recommend that perhaps two frame is good, right. Using this method you can only use one frame. You cannot use two frame, you cannot use three frame, you can only use 1 frame using this method, okay. So that means it's not going to be -- you are not going to get the maximum benefit. >>: [inaudible]. >> Oscar Au: Huh? >>: Using a cascaded method or using an integrated? >>: Integrated. >> Oscar Au: For integrated method, okay, you can do -- you can do an equivalent to a cascade method but with M equal to 1 only. We cannot integrate M equal to 2 and integrate it, we cannot do that. >>: [inaudible]. >> Oscar Au: Yes. P frame. Uh-huh. >>: [inaudible] frames. >> Oscar Au: Yes. You are right. Yes. Yes, you are right. Yes. Excellent. You are exactly right actually. Yes. Yes. For the case of P frame, we can only do M equal to 1. If you allow yourself to have two reference frame, then we allow, we can do M equal to 2, yes. Yes. Exactly depends on how many reference frame you use for your motion estimation in your codec, you are right. Okay. Well, the denoise video is basically if you look at it, okay, if you recall the previous formulation that the noise video is equal to the noisy video plus the prediction, okay, and the plus sum D, right. This is the original formulation, okay. Multiply by certain weight and this weight obtained by LMMSE kind of thing, okay. Well, and we can -- this is what we're going to do, okay. We are going to add something here, we are going to whatever the residue is, this guy minus the prediction, give you the residue. For the residue we are going to multiply by a certain weight and then add a certain DC to it. Oh, yes, yes, that's right. Oh, yes. One thing -- one thing -- another thing is that we are going to restrict ourselves to have only one weight for the whole block. In the past every single pixel can have a different weight. Here the whole block need to use the same weight. I'm sorry, I forgot to mention this. So it's not fully [inaudible], it's not fully [inaudible] but quite close. Okay. Now, mind you, okay, with this modification it is very, very simple. You know that motion estimation is very, very complicated. After you do your motion compensation, this is the residue. You multiply by simply a number and then you just add a DC to it. So it's very easy. Okay. Actually my student did not do it right, okay. We should have put this after the DCT. Because DCT is linear, you exactly the same, whether you multiply before, after, it's -- yeah, it's linear, right. And DC it turns out that when you add a DC, you don't need to add 16 numbers, you actually can only need two, you only need to add it to the DC value. So this is a little bit faster. So we could have put this thing right after the DCT, and it will be faster. Except the equivalent but just faster, okay. And then for the decoder, there is something like that. Okay. All right. Here if this is how we derive the method, okay. Now, residue, let's imagine now let's imagine now we are trying to do this. Imagine now we are doing denoising followed by encoding. Denoising followed by encoding. Okay. In that case, in that case, let's imagine what is the recipe. When I do motion estimation, okay, I am going to have the current frame. I do a search in the previous frame. I find the best guy. I subtract them to get the residue, right? So the residue is equal to some denoise guy minus the prediction. The current frame has been denoised already. The current frame has been denoised all right, right. I'm going to decode it. Well, the current frame is basically equal to the current noisy video plus a weighted version of the prediction plus some DC, right. I mean this is, this is just the current frame denoise, this is how we write it, okay. And it turns out that, well, you see P here and you see P here. You can combine them together. So W1, W0, and W1 add together will be equal to 1. W1 and this guy so this is W1 minus 1, so this is actually 1 minus W1. 1 minus W1 is actually W0. So actually so you -- this guy can simplify into W0 FN minus W0 P plus D, but FN -- FN minus P, FN minus P, what is FN minus P? FN is the original noisy video minus the prediction. In other words, if I just encode original noisy video, if I just encode original noisy video, what will I get? I would be using my noisy video minus prediction to obtain the residue. In other words, to do this cascade denoising followed by encoding, you effectively just like you were encoding the regular noisy video accept that you multiply the original residue by a number and then you add a DC to it. What I'm trying to say here is that if you do the cascade denoising followed by the -- by encoding, this is exactly equivalent to doing forget by the denoising, just do encoding but in the encoding loop, you multiply the residue by a constant, by W and then add a D to it. The two things are equivalent. Yeah? >>: [inaudible] for multiple frames [inaudible] and now notice when you complain the [inaudible] if you have two reference frames, right, you have W1 and W2. These two number may not be the same. And the residual -- I mean B is calculated in some kind of fixed format [inaudible]. >> Oscar Au: That's true. >>: Does not work, right? >> Oscar Au: Good point. Good point. Yes. So you will not be equivalent. It will still work, but it will not be equivalent to the M equal to 2 temporal denoising. You're right. >>: [inaudible]. >> Oscar Au: Yes. Yes, yes, you're right. You're constraining that, correct. Good point. It will still work to a certain extent. It would still work to a certain extent. Okay. >>: Prediction now will be weighted, weighted prediction of two frames and you subtract it from the original, so it becomes -- I think the complexity will be [inaudible] probably extend [inaudible]. >> Oscar Au: Given the fact that while I think you will use B frame instead of P frame when B works better than P, right, in other words, when the B prediction is better than the P prediction and I think this will only be true when you have good motion estimation for the both -- for both of those frames. In other words, I think in that case most likely they are weight naturally when use denoising their weight will be very similar. In other words, we will probably be very close to the optimal situation for M equal to 2. This will only be true when B frame works, B block works better than P frame, P blocks, right. And this will only be true when you have actually good motion estimation correspondence. Otherwise you wouldn't do it. So in a way, this method should still work quite well. Okay. But I cannot -- but you are right, yes, I don't think it is equivalent. It's not equivalent, but quite close. So basically with these few simple lines, this established the correspondence between separate purely prefiltering followed by encoding cascade version versus the integrated version. They are equivalent in this case, all right. So that's good. And the beauty of this is that it is very, very simple, very, very simple. In the DCT, after you calculate the residue and then you do the DCT, you just multiply it by a simple threshold, simple scaler, you know, this sing and then add a DC to it. It's very, very simple. Super simple, and you can get the benefit of denoising. >>: That means this might actually be a [inaudible] only need to modify the DC coefficient. >> Oscar Au: This is standard for quantization. This is standard for quantization. >>: [inaudible]. >> Oscar Au: Oh, I see. >>: [inaudible] conditions [inaudible]. [brief talking over]. >> Oscar Au: Different blocks. >>: For each block you only need to [inaudible]. >> Oscar Au: Actually true. True. Yes, right. Yes, I agree. I agree. >>: It will be just adding more conditions of the [inaudible] for these blocks based on the calculations. >> Oscar Au: Yes. >>: [inaudible]. >> Oscar Au: That's right. True. But once again, remember, doing this current denoising you may lose a little bit of the detail in the video, right? We mentioned earlier on. So you may want to think twice before you do it, okay. You're going to you're going to gain something, you're going to lose something. What you gain is that the [inaudible] model, the [inaudible] model. Okay. You're going to have denoising with that. What you lose is perhaps some small image detail that you want to preserve. So we do need to be wise in order to do that. But one thing remember, if you throw away this guy, if you allow W0 to be equal to 1, you get back the original. That means you can -- you can perfectly preserve all the original detail you want, and using this method you get a W0 and if you really want, you can use a brute force method and force yourself to get a value between the value of 1 and 1 and you can get -- you can have a continuous tuning of whatever you want. It is possible. It is possible. Whether you want to do it is something else. But it is possible. >>: [inaudible]. >> Oscar Au: That's about it, I think. Okay. All that we will claim is that, hey, complexity is not very much, okay. Either performance is good and you can also do it basically in the -- and the quality is good, too, okay. Actually not only that, okay, this picture is interesting. Almost done what I'm saying is almost done, okay. This one says that if you could look at this one, this is original cascade method, this method actually is an integrated method and you see that this one looks better than this. And you say how can this be possible? How can this be possible? You are doing integrated -- I mean we just said this is integrated, why can this be better? Okay. We have a reason for this. Okay. And the reason is that in the original cascade version what we are doing is that in the current frame you have a previous frame and all the previous frame has been denoised, right, they are quite clean and so when you do weighted average they are good and I tend to give them more weight, right. So I got a certain effect. Well, in this scheme with the current frame, with the previous frame, the previous frame has been denoised and encoded, so notice that they have been encoded as well. An encoding actually give you some is denoising effect as well. And so as a result they are actually even more denoised than the original cascade method. So that's why putting it into the integrated version, okay, you have added benefit of even cleaner than before. So you have a potential to get something a little bit better. But probably losing a little bit of detail, too. I don't know. Right? But anyhow, when you do coding, you are going to lose some detail anyway. You -- it is lousy coding after all, right. So we don't need to beat ourselves to death. So actually sometimes it works quite reasonably well. So anyway, for the de-- for our method, okay, motion estimationwise we don't do motion estimation at all, but forecast indicated method we still need to do motion estimation. So 0 complexity for motion estimation. For the filtering it is a little bit simpler than the cascade method and so the total complexity for us is very small. I think you can easily see that it is very, very simple. Okay. Now, when it comes to the decoder, it is also possible to integrate this into the decoder. The application of this is that perhaps nowadays we have all these DVD player, DVD player, okay. The video has been already had encoded by Hollywood, and I can -- I cannot change the video, okay. But it turns out that maybe their video is a little bit noisy, maybe, okay, especially in that region's, a lot of the rain kind of very dark region you may have a bit more noise to it, okay. And in that case, it is possible to integrate the denoising into the decoder, also. It is possible to do that, okay. Basically one thing we need to face is that the decoder [inaudible] need to be consistent. You cannot modify the decoder look just like that, okay, because otherwise you could have drifting errors, okay. And so as a result we leave the original decoder intact but basically we do a denoise version and for the display we don't display the original decoder thing. We apply this little thing and display only the denoise output, okay. And basically we need to add a loop similar to before. We need to multiply by 00 and then add a D to it, okay. And basically this is still within the same loop, but we need to attack this next to this guy such that we can generate something which has the denoise effect already. So it is possible. But you can see that we don't have as much advantage as before, because this is a bit more complicated than the encoder version, okay. So for the decoder guy, okay, for the decoder guy, the equations are similar. You can show that they are equivalent to M equal to 1, temporal denoising, cascade version, okay, and let me see. Our complexity. Okay. Remember previously our complexity advantage is quite a bit higher than the cascade version. But now we are not as, you know, the cane is not so big. It is a little bit smaller, okay. But then we actually can get these denoising effects. So this is something to consider. This is less attractive than the encoder actually. But still we for completely said we also derive the integrated denoising into the decoder. So it is possible to do that. Okay. So with that, I think I end the talk. Right. So maybe we can look at a demo again. Okay. >>: [inaudible] other content to show [inaudible]? >> Oscar Au: Huh? >>: Do you have a different video to show? >> Oscar Au: Sure. >>: Than the one with a little more motion in it? >> Oscar Au: Sure. Yeah. I have foreman. >>: The foreman. Okay. >> Oscar Au: This is noisy foreman and this is denoise foreman. >>: And [inaudible] mention that Professor Au will be here with us today. Any one of you want to talk to him afterwards, I mean, please let me know and I will try to see if I can squeeze [inaudible]. >>: The next one is [inaudible], right? >>: [inaudible]. >>: [inaudible]. >> Oscar Au: Blocky noise. Blocky noise. Okay. And see blocky noise here. So this happens when motion estimation doesn't work so well, right, and so locally this part is clean up, this part is not so clean up. So this part, this method by itself has some problem, and need to be combined with I think special method to make it better. Something like that. >>: [inaudible]. >> Oscar Au: Say that again. >>: Did you do any experiments with [inaudible]. I mean this video was added on ->> Oscar Au: Oh, yeah, yeah, yeah. Okay. Good point. Yes. We tried to -- we tried to the apply this to denoise some TV signal. It turns out our TV's reception is very poor and the TV's very noisy. We tried to use our method to denoise it. Okay. We find that we can only denoise it to a certain extent. We cannot completely ->>: [inaudible] over. >> Oscar Au: Analog TV. Analog TV very noisy. You know, you've seen those. They're noisy. Okay. Yeah. We tried to use this method to denoise that thing and we have some effect, but we cannot -- we cannot get something as clean as this. It is super noisy and it is less noisy but ->>: [inaudible]. >>: [inaudible] with that really noisy [inaudible]. >> Jin Li: The time consideration, let's thank you Professor Au for his [inaudible]. >> Oscar Au: Thank you. [applause]