>> Jin Li: It's our great pleasure to have... Microsoft Research and give us a talk on video denoising. ...

advertisement
>> Jin Li: It's our great pleasure to have Oscar Au from MST to come to
Microsoft Research and give us a talk on video denoising. Oscar get his Ph.D.
from Princeton University in 1991 and then he become faculty member in Hong
Kong UST in 1992 where he is currently a director of multimedia technology
research center and advisor for computer engineering program in Hong Kong
UST.
Oscar is a veteran in MPEG and H-dot 26 series video standards, and I have
read many, many good papers from him and his group. And there are now -- and
he has been serving a number of distinguished positions in the society. He is
associate editor for IEEE transaction, circuit and systems part one and IEEE
transaction on circuit and system video technology. He's a chair of the technical
committee on multimedia systems and application and have held many
distinguished positions.
Without further ado, let's hear Oscar's talk on video denoising.
>> Oscar Au: Okay, Jin, and it's great for me to be here. My honor to be here.
And I just want to say, show this picture, a warm welcome, warm hello from Hong
Kong, okay, so we're on the other side of the globe, okay. Right now Hong Kong
is still 20 some degrees. See, wee use celsius, 20 some degrees celsius, I think
it's probably 70 some degree F. Okay. So compared with here, we're quite a bit
warmer. And the weather is nice, okay. And we are fortunate enough that this is
the university. The picture in the center is actually the university and it is quite
pretty, okay, and these are some attractions, tourist attractions there, okay.
So we have Disneyworld with ocean park and stuff like that. So it's kind of
interesting, okay. If you haven't been to Hong Kong before, I would love to invite
you to come and visit us, too, in the university, okay. So our university is actually
quite pretty, so that's kind of nice. Okay.
All right. I'm going to go and start the talk, okay. All right. So my name is Oscar
and I'm from Hong Kong, okay, UST. I am with the department of electronic and
computer engineering, okay.
What I'm going to show you today, video denoising is basically the work done by
my student, leeway wanting. He has just finished his Ph.D. and his whole these
is actually on video denoising and video modelling kind of thing. Okay. Already.
So this is what I'm going to do today, okay. Talk a little bit about motivation, talk
about contribution. Basically we'll be talking about a multi hypothetical motion
compensated filter, okay. We will talk about noise robust motion estimation for
temporal denoising, okay, and then integration of temporal denoising and hybrid
video code and hybrid video codecs, and then I will come to some conclusion
and perhaps some demo, okay.
But actually you know what, I think I'm going to go to the demo first because this
have to motivate people to understand what they're talking about, okay. What
we're talking about here is basically like okay, if you look at this video, okay, this
is a regular sequence IQ and we've add noise to it, and when you look at this
video, okay, all right, it has a lot of noise, yeah, okay, unpleasant. People don't
like it. Not only is it unpleasant, a big problem about noisy video is that it cause
you a lot of bits to compress. It cost you a lot of bits to transmit as well. And this
is all undesirable, okay.
Not only that, imagine when you do compression, for example, MPEG H.264 kind
of thing, basically what you need to do is you need to do something like motion
estimation, ECT quantization and stuff like that. Okay. Imagine doing motion
estimation with this kind of video, okay, with all the noise fleshing and everything
you're going to encounter a lot of difficulties, and very often the motion factors
that you get out of this kind of video is going to be pretty bad, all right. And
costing you quite a lot of bits to encode. As a result, what we are talking about is
A, can we remove this voice in video denoising, okay? So this is what we have
performed, okay.
And it looks quite okay, okay. It is not that it is perfect, it is not perfect, okay.
You can see some flashing here, okay. But on the other hand, okay, what makes
this work perhaps a little bit special compared to other existing denoising work is
that I think we all learn in university, when we are in school, that to do denoising,
a very simple way to do it is to just run a low pass filer there. I mean it's easy,
right? Because noise by nature is high frequency in nature, okay. You should
uncorrelate it and stuff like that. Using low pass filer you'll be able to remove
quite some noise, okay.
The problem of low pass filter is that very often the edges will become blurred,
okay, and that is undesirable. Okay. You'll be talking about the difference
between a cheap, a VCD versus a high quality DVD, right, or a regular DVD
versus HD, right, something blur and something sharp. That makes a big
difference, okay.
People expect something sharp especially when people pay for it. Okay. And so
as a result, it is important that when we do denoising, we are maintaining the
edge integrity at the same time, we are preserving the edge sharpness at the
same time. And this is our goal in our work, that we try to do denoising while
preserving the edge sharpness, okay. So this is our goal, okay.
You notice earlier on if I can show this again, okay, the first frame looked pretty
bad, okay. Now, what happened is because our method turns out to be a
temporal of the main method, so the first frame is pretty bad actually. Okay. You
basically have nothing in the pass that helped you. So the first frame is pretty
bad. But if you go to the next frame, it goes better, next frame it goes better, it
goes better. Okay. And then it basically becomes quite okay, all right. So going
back to the beginning just play, okay. It will be okay, okay.
And notice that the edges are quite sharp. Notice that, okay. Edges are quite
sharp. And this is how it goes. We want the edges to be sharp. Okay? So that
is what we are trying to do. Okay? Yes?
>>: [inaudible] showed us [inaudible].
>> Oscar Au: Good question. Good question. Okay. All right. Now, what we're
going to show is a class of temporal filtering, right? Okay. And one prerequisite
we said we want to establish correspondence between the current frame and the
previous frames, okay. Something here also happening in the previous frame
and happening in the previous frame. Okay? So what is question is seeing is
that what if the object is moving so fast that you cannot find when you find
something here, you cannot find something in the pass? That's the question.
Okay. The answer is unfortunately in this case then you cannot work on well.
Okay. Which means basically that you would need to combine this method,
perhaps with some spatial method, okay, because what I'm saying is this method
is limited by -- well, this method has a basic assumption that we can establish
correspondence between current frame and the previous frame. Now, this
method by itself is not going to die just like that, okay. Basically what this method
would do is that when it detects that the correspondence between the current
frame and the previous frame is low, the similarity is low, then basically it's going
to take a weighted average between the current frame and the previous frames.
Okay? Well, when it detects that the previous frame doesn't look like the current
frame, it's going to put a big weight on the current frame. Basically that would
mean that you have very little denoising effect. It's not going to damage the
video too bad, but it's going to have locally noisy effect.
>>: [inaudible].
>>: When you say correspondence, it reminds me that earlier you said
[inaudible] that means you can't get a good response, right, so is there some -I'm sure you're going to come to that, but it seems to me that a little bit of
chicken, egg problem.
>> Oscar Au: Okay. Yes. Okay. This is exactly why in our simulation we don't
use fast algorithms because fast algorithms make assumptions. We are actually
an expert in doing fast estimations. We have some method that went into the
standard, standard okay. But then again we realized that in those fast algorithm
we make assumptions which may not be true in noisy video. And so as a result
we resolved to full search, okay. When you do full search actually it is not too
bad, it is not too bad. When you actually have correspondence but you had just
disturbed by noise, you basically would be able to find it. You basically would be
able to find it. It's just slow, okay.
But later on, I'm going to talk about method to make it fast, okay. That's one
thing. But unfortunate if the motion is real going so fast you're doing rotation,
crazy motion, okay, the eyes blinking and stuff like that, then there's nothing you
can do, okay, and the method would basically fall back into doing very little
denoising. Okay. So locally you may have some region which has a bit more
noise. This is our method. Yeah?
>>: [inaudible] will be great and then go up again of the denoiser. So if the
[inaudible] between the camera and the decoder you start losing the motion that
comes from ->> Oscar Au: Did you say [inaudible]?
>>: [inaudible].
>> Oscar Au: When you drop frames it didn't mean you have bad motion or good
motion it's just you just drop frames.
>>: It's like [inaudible].
>> Oscar Au: High motion. Oh, okay. When you're in [inaudible] situation.
Okay.
>>: [inaudible] and you drop say between three and six and you drop, you don't
[inaudible] this information.
>> Oscar Au: That's correct. Okay.
>>: [inaudible] the quality of the denoiser to go down, then when the frames
come in and wouldn't this be [inaudible] to the [inaudible] that's one question.
>> Oscar Au: Yes.
>>: And the second one is the inter(inaudible) what do you do with it?
>> Oscar Au: Okay. All right. Sure. First question.
>>: [inaudible] so I mean.
>>: [inaudible]. What are you going to do, are you going to start after they
introduce it?
>> Oscar Au: Okay. All right. One problem at a time, one question at a time.
What's the first question again?
>>: Drop frame.
>> Oscar Au: Drop frame. Yes. Okay. All right. For drop frames, okay, you in
fast motion situation, okay. Basically what that will mean that we're going to fall
back to his problem. Basically when you cannot establish correspondence
between current frame and the previous frame. So this method will not work very
well. And you need to rely on some spatial methods to do it. Okay. But there
are some existing methods, spatial method that can work reasonably well
actually.
But one duty at least for that particular situation you mention is that when motion
move fast your eye cannot see very well. So actually if you blur it a little bit, it
didn't matter. So actually I would tend to use spatial methods to blur it because
keeping the edges sharp is not the top priority when the motions are sharp -- are
fast. Right? So I wouldn't worry about that too much.
Okay. So basically this method is really should be designed for -- has to be -has to work with spatial methods. What I'm saying is what I'm looking at you and
all of a sudden I change to another scene, scene change, when you do scene
change, this method would not work at all, because you don't have history, right?
So you don't have anything in the past that can help you. So this method actually
has to work with spatial methods. Yeah. So.
>>: [inaudible].
>> Oscar Au: For [inaudible] content. Okay. Before I talk about that, one thing I
do want to mention, we certainly look at spatial methods and one problem about
spatial methods is the complexity is very high. Every single pixel, there are so
many pixels in the frame and every pixel, suppose you just do a three by three
low pass filter, stupid low pass filter, it turns out that the complexity will be so
high, okay, if you use a larger, larger filter, okay, there will be so many operation
you need to perform and spatial methods are just so slow, okay, the beauty of
this method is that we only do a few operations where actually, you know what,
we're only averaging nine numbers or five by five, 20 five numbers, we're not
doing a weighted average of 25 numbers, we're taking weighted average of three
numbers, four numbers, that's it. The current frame and the previous frame and
maybe the previous, previous, and just that's it. That's all. We're just taking
weighted average of three or four numbers. So complexitiwise it's quite simple.
When it comes to the filtering operation, motion estimation by itself is very slow,
okay. So that's why later on I'm going to talk about a fast method to do it. Okay?
This method is basically a prefiltering technique, meaning that before you
compress we assume independent of the codec you are going to do some
prefiltering, denoising, okay, and so as a result if you want to do the motion
estimation by itself, it's going to cost you a lot of computation. Okay.
So but, yeah, before -- I should answer the second question. You talk about the
interlacing, okay. I don't understand the question. The interlacing versus
denoising, what the [inaudible].
>>: [inaudible] so when you have field and the field F plus 1 and there is that
[inaudible] between the lines would this affect also the quality of your notion -- of
your denoiser or not, before going into the decoder?
>> Oscar Au: I don't think it's going to be a problem. I think it will work.
Because the same problem happened to frame and field. I don't see field coding
being more difficult and frame coding. It should be quite similar.
>>: [inaudible] interlaced contents. But that question is [inaudible] sometimes
[inaudible] what you call the [inaudible] where things are.
>> Oscar Au: Right.
>>: From T to T minus 1, T minus 2.
>> Oscar Au: Yes.
>>: But it depends on the method. Some of them, maybe this one would
[inaudible]. I don't know.
>>: [inaudible].
>> Oscar Au: [inaudible] in the second part of the talk, I will talk about some
robust motion estimation. Perhaps that can address this issue. Basically we
understand that in noisy situation, motion estimation is just hard. How can you
make it robust, how do you make it trustworthy and hopefully make it fast as
well? So we do have a certain way to do it that can cover at least part of this
problem.
>>: I may be getting ahead, but so in your work, do you assume that noise is
what [inaudible] or because the reason I'm asking is that the camera has a non
linear response and so noise is actually not [inaudible] noise, so the -- if you
make that assumption then wherever [inaudible].
>> Oscar Au: Okay. Right. Sure. We use white Gaussian noise in our
simulation, but I don't think we need to make that assumption that it is white
Gaussian. It didn't need to be. In the formulation we don't need it to be white
Gaussian. There's no need to be white Gaussian. So the method should work
for other situations, too.
Okay. All right. So maybe it's time for us to go back to the power point and see
what we do. Okay. All right. Full screen. Okay. All right. All right. So
motivation again I think we've talked about a lot of motivation already. Okay.
Video quality. Okay. When you have noise, okay, noise is going to corrupt the
video making look bad, okay, and actually especially when you in high degree
situation, DVD, HD coding kind of thing, because of very high bit rate noise
actually can survive compression. Noise can survive. So as a result, somebody
need to take care of the noise somewhere. So denoising is a way to handle that.
Okay. And not only that, with the presence of noise it is going to cause bits to
represent, and so this is going to make the file bigger. If you denoise it, it is a
quite good chance that you can make the file smaller, okay.
And so actually there are situations in which sometimes even if the video is not
very noisy, even if the video is not very noisy, you still want to apply denoising
because you want to apply this as a means to reduce the file size. Okay. You
do it maybe not to do denoising but rather to make the file small.
In other words, you are intentionally using denoising to remove some image
detail, so that your file can be smaller. Do you know what I'm saying? So
denoising is another way to achieve that smaller file size. So this is kind of
interesting. Okay? And okay, because of the presence of noise very often the
motion estimation is going to have trouble stopping, okay, and so as a result the
presence of noise actually will make your encoder slower, more compressed,
more things to compute. And not only that, because with the presence of noise,
okay, your motion estimation, no matter what you do, is not going to do very well,
okay, so the residue is going to contain a lot of energy. And so as a result you
need to spend so much more complexity to encode all these nonzero, DCT
coefficients, okay.
So actually in our experiment we realize to our surprise that denoising not only
can make the file smaller, it can make your encoder faster, also. That's simply
doing denoising. So it's kind of interesting. Okay. So this is quite a cool thing to
apply to your codec, all right, so it's kind of fun.
So, okay, in this example we have compressed the former sequence. Okay. We
have add noise to foreman, okay, and then try to compress it, okay. And number
one, with denoising, the file can be quite a bit smaller, okay. Number two, the
compression time actually is much faster, can be faster, all right. Now, this is a
little bit extreme. Of course this is -- we are doing all the favorable condition to
make sure we work well, okay, but this give you the idea. You think denoising
you can make the file size smaller and also you can make your coding faster. So
that's an advantage. Yes?
>>: You feel like losing some quality when doing that?
>> Oscar Au: Yes. Yes. What we are doing here is what -- well, you need to
understand. If you have an input video, if you treat the input video as holy, you
cannot touch anything. If you treat it like that, then, yes, we are causing
distortion. But remember, we are doing denoising, we actually changing the
input, okay. We are saying that, no, the input is not holy, the input is actually
noisy, there is something undesirable inside. I'm going to make it better while at
the same time making the file smaller, making everything faster. So this is a little
bit different.
But, but on the other hand, you are right, actually, we are indeed losing, we are
indeed losing a little bit of the detail, and so as a result we may have some
problem. For example, if you look at foreman, okay, if you move the foreman
sequence, okay, there is quite a bit of dots on the wall, okay, you will notice if you
compared this with the region that some of the dots are gone. Some of the dots
are gone. Okay. Now, so this is a price you're going to pay, okay. Yes, you're
going to lose a bit of the detail.
But certainly for the big edges and so on, you are not going to lose that. But for
small, fine details, you may lose that, okay. For example, if you look at me,
maybe I got all these dots because of my beard, whatever, okay, after you do
denoising maybe you look nicer. Now, is that good or bad? I don't know.
Sometimes, you know if you look Ugly Betty something, sometimes the director
want a person to look ugly and now you make them pretty that may not be so
good. Sometimes there may be detail that the director wants to keep in the
content, right? So in that case you want to keep that. If you fail to keep that, you
are not doing the right thing, okay. And so as a result it depends. But yes
indeed you are going to pay a price for that, yes, you're right. Okay?
>>: [inaudible] coding type reduction here, is it tied to the [inaudible].
>> Oscar Au: Two things. Motion estimation and DLC. And DLC, too.
>>: So that means that in this particular experiment, the criteria using motion
estimation is that you need to reach a certain level of upper bound for what the
distortion can be between two blocks, right, when the reference frame and the
mainframe, and once you achieve that, you stop the motion estimation, so
filtering helps you basically achieve that threshold area? Is that the [inaudible]
main gain of the [inaudible].
>> Oscar Au: No. This number may be a little bit misleading because in your
codec, when you have a codec, you may not be using full search. But here in
this example we are using full search. We are using full search, full search brute
force full search and so as a result, you can increase it, make -- you can gain so
much in the coding time. Bottom if you already are using fast search, how can
you gain -- you can gain so much.
>>: [inaudible] once you achieve a [inaudible] threshold.
>> Oscar Au: No, we full search is full search, full search is everything, yes.
Right. It does not [inaudible] criteria. We basically find out that early termination
can be -- can get you into the wrong decision, because of the noise. So that's
why we don't do a determination. Yeah. So it is [inaudible] combination
strategies. Any other questions?
>>: Do you assume that there are no [inaudible] in your original [inaudible]
because you know, when you capture video typically it has [inaudible] nobody
[inaudible].
>> Oscar Au: Oh, I see.
>>: So if the original video which is [inaudible] has compression artifacts, will
your algorithm actually be [inaudible]?
>> Oscar Au: Okay. We did do experiment on that. Okay. We assume the
video is original with noise, okay. If you have coding artifact, I think to a certain
extent we should be able to suppress it a little bit, okay. But then the assumption
is a little bit different because the noise is no longer so independent, the noticed
is more systematic. And I am not sure, I'm not sure how good it will be. I think
we need to do some experiment on that to try it. But I would guess that you
would be able to suppress it somehow. Right. Right. So that's [inaudible] okay.
So we'll continue.
Okay. So temporal denoising. Okay. So the method we use is called
multihypothesis motion compensation filtering. We utilize multiple pixels along
the motion trajectory and basically do a linear combination of them to find the
best estimate and what we are doing is actually an optimal solution, we are doing
a linear minimum square error estimate, okay, so it is -- you can then trust that it
should work pretty well. Okay. Because it's an optimal solution. Okay. So let's
see. Okay. Peer-to-peer and P0. So what we are seeing here is suppose,
suppose this is a current frame, okay, there's a certain region noisy, okay. If you
can imagine to find in the previous frame something corresponding and
something corresponding, what we are talking about is basically take a weighted
average between this one, this one, and this one weighted average, okay.
Of course in the weighted average situation, the coefficient is important, the
coefficient is important, right? And so basically we use LMMSE to determine the
coefficient, all right, to find the best coefficient for that particular situation. Okay?
All right. So this page looks a little bit, you know, a lots of equations. Basically
what we are saying is that we want the denoise version to be equal to something
plus a DCD, a DC, okay. Here W is the weight, O is the observation, okay, now
O contains FN and then P1, P2. Okay. FN is the current frame. FN is the
current frame. P1 is the T minus 1, frame T minus 1. P2 is frame T minus 2 and
so on. So here in this example we are talking about M frames, okay.
Current pixel and then the previous few pixels. Here we already assume that we
have performed motion estimation, you find the corresponding motion, you find
the corresponding matched location, okay? So what we are saying is that for
every of these guys you're going to weight it by a certain number, W. So this is
just a simple weighted average of the past few numbers, okay?
Now, mind you, we usually assume that noise is kind of temporally stationary,
meaning that the noise variance tends to be stationary over time, but here we do
not assume, we allow individually P1, P2 to have different variance, different
noise variance because, okay, while we need to talk this, take the noise variance
into account, we also need to take into account the motion mismatch, just like
what you said earlier on. But the current frame and the previous frame, when the
motion estimation cannot do so well, you have a mismatch, right? So that
mismatch is not due to the noise, that miss match is actually due to the motion
mismatch and so on. And so as a result, we are going to take that motion
mismatch into account and put that into the variance of P1 or P2 and so on. All
right?
So that may be a situation in which P1 you can find a good match so the
variance of this will be pretty small. Okay? But for P2, maybe motion estimation
doesn't work very well in that case P2 may have a large variance. Okay? So in
this case with this formulation then we would smartly allocate more weight to the
guy with smaller variance, right? So that's what LMMSE does, right so this
method works quite well, okay?
So what we want to minimize, we want to minimize the expected value between
the denoise version and original value. Okay? We want to minimize this one,
okay? And the solution is the standard, standard LMMSE solution okay.
Covariance matrix, you think, one over that and so on. Okay? This is the
standard thing. I mean, if you open a textbook you'll find it. So this I did not put a
lot of detail here because this is the standard solution. Okay in and it is not hard
to find, okay, so this is the solution. And basically in here you is so called
prediction error vector which is made in the current frame correspond to FN, the
error vector is basically the noisy observation minus the original. So actually this
is just the noise, whereas U 1, U 2, and so on, U 1, U 2 and so on are the
temporal prediction error as well, okay? So, yes, so this is best solution. Okay.
And then the minimum, the mean square error of the temporal denoising is equal
to this, okay. With this solution, you know, it is well known that you can estimate
the mean square error and that is equal to this expression, okay? In the special
case that all these guys are independent, then basically this term would mean
that this is basically sigma 1 square plus sigma 2 square plus sigma 3 square, so
on, something like that. Okay. Yes?
>>: [inaudible] noise in the original?
>> Oscar Au: We need to estimate -- we actually in this -- for this method we
need to estimate the noise variance. We need to estimate the noise variance.
>>: The variance of the noise.
>> Oscar Au: Yes, we need the noise variance. Yes. We also need to not only
the noise variance but also we need to estimate the noise variance due to the
motion mismatch. So basically what we do is we still do base motion estimation,
okay, and basically with that then you can calculate the mean square error
between the current block and the predicted block. Basically we use that to
estimate the mismatch variance.
>>: The noise is the pixel noise, right, it's not [inaudible].
>> Oscar Au: Pixel.
>>: Okay. So the noise can vary?
>> Oscar Au: Well, when we do simulation every -- it is white noise, white
Gaussian noise with fixed variance throughout the whole sequence. When we
do the simulation, we do this, right.
>>: That's why I asked that question because in reality that's not true.
>> Oscar Au: Okay. Yes?
>>: Because the -- because of the camera not [inaudible] the amount of noise is
a function of the brightness itself.
>> Oscar Au: So brighter region is less noise or more noise?
>>: I'm sorry?
>> Oscar Au: Brighter region has more noise or less noise?
>>: No, not necessarily. Actually it goes like up and down little hump. But,
yeah, it's wide because you make the assumption the noise is constant then this
is not optimal.
>> Oscar Au: You're right. You're right. Yes. We cannot assume noise is
constant but in our noise parameter estimation we do assume noise is constant
when we estimate the [inaudible]. This model doesn't require the noise to be
constant. Maybe it does. Yes. Sorry. Yes, yes, right. All right. But so good
point, good point, yes. Perhaps we can make it better by doing -- dividing a
video into a smaller region and locally estimate something, perhaps that would
be a little bit better.
>>: [inaudible] estimating the noise power in here?
>> Oscar Au: Noise power?
>>: Yes.
>> Oscar Au: Okay. What we do is that we look at the video and we look at -we assume that the video frame contains some part which has high frequency,
some part which has low frequency, which is smooth, right, and basically when
you look at the smooth region, the variance should be almost zero and if there is
any variance it would be due to noise. So basically we look at the locally small
regions. Every region we look at the variance, and we look at the minimum, the
last, smallest three percent or something and take the average of those things as
the estimate of the noise.
>>: [inaudible] spatial [inaudible].
>> Oscar Au: Yes. We use spatial domain stuff to do it, right. Okay. Okay.
Now, when we do this, we have a choice, okay, we can choose to do this
temporal filtering in the recursive manner or non recursive manner, meaning that
suppose you have, you have denoise frame, all the previous frame and now
you're trying to denoise a current frame, okay. Well, now, we want to use time, T
minus 1, T minus 2 and so on, right, to help you. The question is are we going to
use the original noisy video to help you or are wee going to use the denoise
video to help you? The effect is going to be different, right? If you use the
original noise video, this is not recursive, right, but if you use the denoise one,
then this is recursive, okay, and we find that if you use the recursive the effect
actually is much better.
And so it is -- the reason why when we average only three number or four
number and get very good result has a lot to do with this recursive filter, okay,
because the previous few frames has been so much denoise already that when
you compare with the current frame you tend to put a big weight to the -- on to
the previous frames so that you have a very big denoising effect. Whereas if you
consider a spatial filtering, spatial filtering, even this three by three, very small
filter, you cannot denoise very much, you cannot denoise very much. Because
every one of them has so much noise. But in the recursive situation, the
previous guys have been so much noise denoised that actually you can have
very good denoising effect even with three or four numbers averaged together.
>>: So when you do recursive, you face the danger of [inaudible].
>> Oscar Au: Yes. Yes.
>>: That was my question on the previous slide, those previous things you
reference those were the original sources or the denoise sources?
>> Oscar Au: Okay.
>>: Which were they, the denoise ->> Oscar Au: In my simulation?
>>: In your equation on the [inaudible].
[brief talking over].
>>: Are those denoised frames or originally noisy frames.
>> Oscar Au: [inaudible] the formulation can apply to both.
>>: Which ones do you use.
>> Oscar Au: We use the recursive.
>>: [inaudible] why you get the [inaudible].
>> Oscar Au: Yes.
>>: [inaudible].
>> Oscar Au: So actually we will ->>: If you look at it, it looks like the IFR [inaudible] essentially it's an IR filter
[inaudible].
>> Oscar Au: Yes. So you think that we're not doing -- yes, we are doing IR,
yes, right.
>>: [inaudible] I mean, it's -- it could be three, four numbers off but the [inaudible]
filter can be quite, quite [inaudible] right?
>> Oscar Au: Yes. I did not look very closely into that. Yes. But yes.
>>: [inaudible].
>>: But it's that noise [inaudible] also has mismatch noise in there as [inaudible].
>> Oscar Au: Well, the noise has -- the noise parameter estimation is done in
the spatial per frame, so we don't need to worry about the motion mismatch for
that. Okay. So anyway, yeah. So anyway, when the variance goes down, that
means the [inaudible] will go down. Anyhow, so this method works quite
reasonably well. Okay. That's about it. Okay.
To give you an idea of the simulation, okay, all right, now, this is mobile
sequence, foreman sequence, new sequence, okay. You can use only, you
know, zero frame, which means no denoising, okay. You use one frame, so that
means you average two guys, okay. M equal to 2, 3, okay. All right. With no
denoising, this is the [inaudible] you get. Okay. No denoising. Okay. All right.
Using one guy, you can increase it from 22 to 25 DB, okay. One more you can
go from 25 to 26 and then 26 to 26 again. Okay. You see that the major gain
comes from one guy, okay. Two guy you can a little bit more, three guy, a little
bit more. Doing further it's not going to give you a lot of gain, okay, so that's why
from our simulation we tend to believe that you get the maximum benefit out of
either two or three. I tend to use two myself, right, but it depends. Yeah. Okay.
Complexity of course when you go from one to two to three, complexity is going
to increase linearly as well because of the motion estimation. Okay.
>>: Where is the framework?
>> Oscar Au: 30 frames per second. Full rate. Yes. Right. Something like that.
So okay. Some example, okay, so this is noisy video, this is a denoise video with
M equal to 2, so that means 1, T minus 1 and T minus 2, that's it. Okay. And
you can do quite well actually, all right. And yeah, this is quite a bit cleanup,
okay, and this part can still be quite good, okay.
When we have good motion as correspondence, this method actually works quite
well. This method actually works quite well, okay. All right. So okay. Another
example, this is a flower garden, okay, and original noisy video M equal to 2 and
this is what you get, okay. And the sky looks green and you still see some
textures here, so quite reasonable.
You do see, you do see around here there's some noise, okay. Yeah, you do
see some noise here. Okay. Now, when you locally, when the motion mismatch,
when you have motion mismatch, this is the kind of effect you're going to see in
this method, okay. Here motion is what -- motion estimation is doing well. Okay.
But here locally motion estimation doesn't work too well. So in that case you're
going to see some locally some residue -- you know, residue noise there. Okay.
So basically I think around the house rooftop you see some noisy patches here
and there. And this is a place where motion doesn't work so well, and so we
don't do so well.
>>: What is [inaudible] there's like a [inaudible].
>> Oscar Au: I think ->>: [inaudible].
>> Oscar Au: I missed the question.
>>: So the question is why around the pole there you're missing the noise so
badly compared to other regions where you get much better [inaudible].
>> Oscar Au: Why is this so bad? Well, basically any place where the motion
estimation doesn't do so well you're going to have that kind of effect.
>>: [inaudible]. Actually copy on your model in the frame that the average
[inaudible] that frame [inaudible] might have a much higher weight than the frame
so.
>> Oscar Au: Yes.
>>: The danger is that you are -- get the wrong block. So getting block from the
[inaudible] and that's what seems to me to have happened [inaudible]. The pole
on the middle, right.
[brief talking over].
>> Oscar Au: Okay. All right. I -- if you ask particularly why this think is
happening, I'm not very sure, okay, I'm just pointing out to you that, yes, if this
happened whatever the motion doesn't work too well this would happen. When
you're talking about that sequence specific, why this is happening here, right.
>>: Ask you how often [inaudible].
>> Oscar Au: How often it happens? How often it happens? Well, basically
when motion doesn't work so well that's what would happen, I mean that's -- how
often does motion work well or not well? So that -- same question. It's actually
the same question. Yeah. It's a sequence dependent, right, and here we have
too many translation of motion, right? If you change the model, allow yourself to
have smaller blocks, allow yourself to have rotation of motion and stuff like that,
you can do better. But on the other hand, we ->>: [inaudible].
>>: It happens twice already, right, so the [inaudible].
>> Oscar Au: You're talking about right here, right?
>>: And there's one other [inaudible].
>>: Same pole.
>> Oscar Au: Same pole here?
>>: [inaudible].
>> Oscar Au: Here?
>>: Yeah, yeah.
>>: [inaudible].
>> Oscar Au: Yes.
>>: So if you don't do a good job, you're going to see a lot of [inaudible].
>> Oscar Au: Yes. To solve the problem, to solve the problem is actually not
very difficult, okay. To solve the problem, you need to combine this with spatial
method. The idea is quite simple. When this method work well, you use this
method. Because this is actually better than spatial method. But on the other
hand, when this method doesn't work well, you need to put a bigger weight on to
the spatial method. In other words, you want to actually do a linear combination
between the temporal prediction, a temporal denoise version versus a spatial
denoise version. Take a weighted average between them. When this method
works well, you trust this guy. Give this guy a big weight. When this method
doesn't work well, you give the weight to the other guy. Do you know what I'm
saying?
>>: [inaudible] spatial noise [inaudible].
>> Oscar Au: Yes. Later on, later on you're going to see when we are looking -when we look at the spatial the robust motion estimation, we are actually going to
do, we are actually going to do spatial denoising as well. We are actually going
to do a spatial denoising. And we are going to combine that with this one,
because anyhow, we do that over there already with the spatial denoising. We
use that, if we combine that one with this one, then you're going to get a better
result.
>>: [inaudible] this frame.
>> Oscar Au: Yes.
>>: Is the motion actually [inaudible].
>> Oscar Au: I don't remember -- well, this is obviously hanging like that, right?
But I think things have become a little bigger over time, over -- I think.
>>: I assume that they do [inaudible].
>> Oscar Au: Yes.
>>: If you don't do rotation.
>> Oscar Au: No, we don't do rotations. We just do a regular motion estimation.
>>: So [inaudible] wonder if the motions [inaudible].
>> Oscar Au: I don't exactly remember this sequence. I think this sequence is
filmed by probably a car driving along the street, something like that. And so as
a result, there's a scale change here.
>>: So --
>> Oscar Au: You don't have just plain, simple translation of motion. The things
that becoming smaller and smaller. Okay. And so on. So motion estimation has
some problem in here.
>>: [inaudible].
>> Oscar Au: I think we use 16 by 16 just regular.
>>: So for all [inaudible] 16 by 16 block you assume the same [inaudible], right?
So maybe.
>> Oscar Au: Same motion vector.
>>: Same motion vector. So maybe a very good estimate for saying the
[inaudible] and then the edge corresponds [inaudible].
>> Oscar Au: That's possible.
>>: You're forced to use one motion vector for this whole.
>> Oscar Au: Process, yes. Yes.
>>: Because I didn't notice anything that would suggest -- I'm just wondering ->> Oscar Au: This particular example doesn't show, but there are examples in
which you see a noisy pattern which are blocking. And this is do to a blocked
based motion estimation. Yes.
Okay. All right. So noise or motion estimation. So here we're addressing that
issue now. When you do motion estimation, okay, the presence of noise cause
you a lot of trouble, okay. For a particular -- for some situation, okay, when you
have noise free situation, when you have no noise, you may have a pretty nice
residue surface. Right? Error surface. Okay. The presence of noise is going to
cause this to become very fluctuating, okay, noise like. Okay.
So the -- this thing is noise like. And unfortunately many of the fast motion
estimation methods assume some kind of gradient descent, gradient algorithm,
okay, and we're going to fail and that method will fail pretty badly in this kind of
situation, right. When the error surface goes crazy. Okay? And the early
termination can cause problem, too, okay. Sometimes you can be easily trapped
in here, sometimes good, sometimes bad, all right. You think it's good enough,
but actually it may not be so good after all.
So as a result, we find out that in noisy situation it's just hard to do motion
estimation. Early termination strategy has problem, gritty, local search thing is
going to have problem. And things like that. Okay. And this will greatly increase
the motion estimation complexity and also reduce the accuracy as well, okay.
And so as a result, what we propose to do, what we do is that we propose to do
two things, okay. We apply a filter first, we apply an edge low pass filter, okay,
and then we do fast noise robust motion estimation. Without this one, we cannot
do fast algorithm, and we withdrew full search. So that makes it very still. Full
search is very still.
After applying this one, then actually we find out that it is now not so bad to do
fast algorithm. After you apply low pass filter, then you do it, then you can do
quite a bit better, okay, and after that we just do the regular denoising stuff, okay.
>>: Why does it [inaudible].
>> Oscar Au: Because the edge is important. Because the edge is important.
>>: [inaudible].
>> Oscar Au: No, no, edge is very important. Edge will determine whether your
motion is going to be correct or not, right?
>>: So [inaudible].
>>: I think he's trying to say is that assume that the [inaudible] certain filter
[inaudible] video. As long as you're consistent with the way it effects the
[inaudible] is preserved regardless of whether a [inaudible] edges are sharp.
>>: No, because you could just do a low pass filtering and then the resolution
decreases. So then when you do the mash, you can't resolve to within that
resolution. So you preserve the high frequency edges, [inaudible].
>>: [inaudible].
[brief talking over].
>> Oscar Au: If you read on one of the goals we have is that we want to
preserve edge sharpness after you do the denoising, then edge precision is
actually quite important. If you low pass it, then the location of the edge is not so
precise.
>>: [inaudible].
[brief talking over].
>> Oscar Au: What is what? I'm sorry, I missed that.
>>: Do you recall get a [inaudible].
>> Oscar Au: Yes. Yes. I'm going to show you. Well, we've got to speed up
compel full search. We've got to speed up compel full search. Okay. Now, while
we can go a little bit further down to look at the results, the number of search
locations this guy, this is full search method, this is the speedup we're talking
about. We are getting this one gives you about 50 times speed up, 50, 60 times
speed up. This one give you, I don't know, close to 100 times speed up. Okay.
But 100 times speed up. Okay.
Computation, complexity, number of search location, computation, complexity.
This is a regular method, full search, this is our method and number of addition is
this many, so I don't know, 20 times? Number of addition -- we need to do a little
bit more multiplication, so we do a little bit more multiplication but not very much,
okay, so, yeah, so we do -- we manage to reduce the complexity quite a bit. The
speedup processing speed for -- okay, the original method we can only process
on our P4, 3.2 gigahertz, one Gigaram, Window XP platform, we run it. The
original method using full search we can only do two frames per second, three
frames per second.
By using the fast search we can do 28 frames per second, 20 something per
second, right, so this is quite a bit of speed. Quite a bit of speed up. Okay. The
motion field becomes smoother. The motion field would become smoother. And
also the quality would become better, too, okay. This is noisy, this is the original
method, this is the fast method. The fast method actually can give you -- well,
okay, in this case, similar [inaudible]. Sometimes you can get better [inaudible].
Okay.
Believe it or not, we are using fast method but we can get better [inaudible] so it's
kind of funny.
>>: [inaudible].
>> Oscar Au: I'm going to go back to that, yes. Okay. We like that filter, right.
In this example this is shown sequence, the noisy video with no -- okay. I think
this is the amount noise, three different situation. This is not so much noise,
bigger noise and bigger noise, so that it appears now it's smaller, okay. And M
equal to 1, 2, 3, it's pretty different situation an original method, this is a fast
method. You can see that the fast method has a little bit of gain over the
previous method. But in some case we can actually gain quite a bit, gain quite a
bit. All right. So it's kind of interesting.
Okay. The most important thing is you don't lose. You don't need to lose the
[inaudible] when you use this method. Okay. Interesting. Okay. All right. I'll go
back and look at the thing what we do. Okay. Okay. All right. Okay. This is the
filter we use, this is the filter -- we look at a three by three window. It's a very
small, small filter. It's not a very big filter, it's a small filter, okay. We like this
filter. We use it quite many times. This is -- this used to be our special, special
domain low pass filter, denoising filter. We used to use this for denoising filter.
We used to use this for even inverse, inverse half toning.
If you know half tone data very noisy, this method actually can suppress it quite a
bit. Quite interesting. The equation, my student write this equation very poorly
so it's very hard to see, okay. But basically this is a three by three neighborhood,
consider three by three neighborhood. There are 9 numbers, okay. We're going
to take the weighted average of the 9 numbers, that's all. But how do we
determine the weight? Okay. Now, especially we are talking about edge
preserving. Very often when you do edge preserving, you need to do some kind
of edge detection, and then do the direction of filter along the edge, right? That's
one way to do it. But we don't want to do that because that is going to incur a lot
of complexity, very trouble some, okay. Or if you want to do median filtering it
would reserve edge, also, but then you need to do sorting and stuff like that,
which is better. This method doesn't do sorting, this method is a simple three by
three weighted average thing.
The beauty is really in how you determine the weight. This method is actually
quite interesting. The weight now. Here. Imagine I this thing as the 9 numbers
within this neighborhood, 9 -- three by three neighborhood. Okay. Each one of
them is weighted by a certain number, okay. Well, that number is going to add
up to one, all right, as expected. Okay. Basically this number is proportionate to
something complicated, okay. Basically this equation is a very, very clear, okay.
What you want here is that suppose you have a target value within a certain
neighborhood. Suppose you have something called a target value, okay, and if
you are close to that value this guy is trustworthy and you're going to give a big
weight to that guy. If you are very far away from the target value, then we give a
small weight. Okay. In a sense this method allows us to suppress our [inaudible]
basically without doing edge detection. If you can do edge detection, if you know
that you are on this side of the edge and you forget about those other guys, right,
you only take this side and take the weighted average, then you will have no
problem, you will be able to preserve the hedges. Within the neighborhood if you
know there's an object here something on this side, something on that side, if you
decide the weighted average between these values and those values, you could
have a very bad value because those value has nothing to do with you and you
try to weighted average, those guys [inaudible] can badly affect your results. The
basic idea here is I want to forget about those guys, I only use the guys that are
trustworthy and just weighted average, okay.
>>: It's basically a bilateral filter.
>> Oscar Au: Bilateral filter?
>>: [inaudible] where you basically looked at the spatial and the extent, in other
words, you try to filter locally and look at not only where it is but how different it is
from the intensity so that you downweight any pixel.
>> Oscar Au: Yes, yes.
>>: [inaudible].
>> Oscar Au: Yes. That's the idea. Yes. So we're doing that. Okay. So this
filter is a little bit more complicated than the regular three by three low pass filter,
okay, because we do need to do a multiplication that that weight is important,
okay, but on the other hand, it is not so bad, it is relatively simple compared to a
lot of other methods, okay, so, we like this method a lot. Okay. So eventually
works quite well, okay. So for example, if you look at this example, okay, this is
the original -- this is original video, this is the noisy video. If you apply our filter,
this is what you are going to get. So as I told you, this method actually is a
denoising filter. It is a denoising filter. But the beauty of it is that you can
preserve the edges. If you look at this edge, this edge is very, very sharp. Okay.
This is what I'm talking about. This filter can preserve sharp edges. But for the
thin edges, it does not preserve it. But for sharp edges, yes, it would preserve it.
Okay. All right.
So yes, so these are actually pretty sharp, okay. So this is a pretty good edge
preserving filter. So it's kind of nice. Okay. All right. So after applying the edge
preserving filter, okay, now you got something which is quite good, all right. Now
we are going to use this to perform a motion estimation. We're going to use this
-- we don't want to use this one, we want to use that one to do motion estimation.
But later on, later on, when we obtain the motion vector, we will still use the noisy
one. We will still use the noisy one to do the denoising. In other words, at this
point, we are only using this guy to do the motion estimation, but we will not use
this for the denoising. Okay? We will use the original noisy one to do the
denoising. Because we do understand that this method where we actually
suppress some of the detail, it actually will suppress some of the image detail,
right, so we cannot fully trust it, okay.
But for motion estimation, this is good enough, motion estimation this is good
enough. And not only that, as I mentioned earlier on, in the case where the
temporal denoising doesn't work, we actual have the option to do a weighted
average between that guy and this one. Because this one actually doesn't look
too bad, this actually doesn't look too bad, it says you just lose some detail, okay.
When temporal denoising doesn't work, it's better than nothing, all right. So
actually take the weighted average between this one and the temporal denoising
works quite well. Okay. All right.
So here, this is a second stage of the motion estimation. So what we do is that
we consider free motion vector predictors, okay. The first one is a median filter,
median predictor, okay, meaning that if the current block, you have to -- the block
on the left, on the top, on the left upper left, okay. And they already motion
vectors, and we treat them as, you know, likely predictor. Since they are so
close, it's quite likely that this, the motion vector this guy similar to the neighbors.
So we take the median of them, okay, and treat this as a median predictor for the
current block. Okay.
All right. We also look at if this is a current frame, this is a previous frame, you
will look at the code locator block. The previous frame also has a motion vector.
Given that in the real world a lot of things don't move so fast, so the co-locator
block, the motion vector of that guy perhaps is a good predictor of the current
block motion vector as well. So that is another predictor, okay.
And then there's also another thing called temporal predictor, meaning that we -we usually remember we are going to do -- if this is a current frame, this is T
minus 1, T minus 2, T minus 3. When you do T minus one, okay, you do
whatever motion estimation. But when you do T minus 2, you already have a
motion vector for T minus 1 and it is -- it makes sense for us to just double it and
use that as a predictor for T minus 2. And so this is a temporal predictor we're
talking about. Okay? And, yes, so you scale the pass, low pass -- the pass
motion vector temporal distance as a predictor. Okay. So this is free predictor
we use. Okay.
>>: [inaudible]. You experience here, how sensitive those methods are to
resolution, you know, if I start with a set video, well, you know, I mean, this
[inaudible] detail, but [inaudible] increasing, going to the HD category solutions,
clearly the video becomes oversampled, especially in the spatial domain. Is
there some [inaudible] you typically like to incorporate in the way you define
kernels, you know, depending on the resolution.
>> Oscar Au: Okay. That's no particular [inaudible] but one thing I can comment
is that these are predictors, and predictors a small resolution. Predictors don't
work so well in HD, in large resolution. They don't. Okay. So for -- I do a lot of
work on fast motion estimation, and we notice that when you have small
resolution video, very often just as long as you examine the predictor, you
already [inaudible] it happen quite often, okay. But in HD kind of situation the
predictor not so accurate and almost always you need to rely on some kind of
local search to find the optimal location. Yes. So local search is very important
in HD. Not so important in the small resolution.
>>: [inaudible] high priority [inaudible] but the video is [inaudible] can denoise a
lot [inaudible] still doesn't occur [inaudible].
>> Oscar Au: Okay. Sure.
>>: So [inaudible] some constant strength for the filtering.
>> Oscar Au: Okay. We don't adjust it. We also did not do a lot of simulation on
the HD kind of thing. We actually did a low resolution so HD I cannot say too
much. But basically, no, we don't, okay. And afterall remember, this low pass at
this edge preserving low pass is used to produce something from motion
estimation only. In this so far over here, okay. So as a result, even if you lose a
little bit more detail is actually not the end of the world. It's not the end of the
world.
>>: So my question is do you know [inaudible].
>> Oscar Au: Okay. Filtering will stop hurting the video quality?
>>: [inaudible].
>> Oscar Au: Oh, denoising, when is it going to affect -- oh, I see. Okay. I
cannot say okay, I mean this is a pretty long story if you want to talk about that.
Coding. How coding is interplay with the denoising result that's ->>: [inaudible]. Can denoise a lot [inaudible].
>> Oscar Au: How by draw something on the board. Is it okay? I find a pen.
When you have noisy video, this is what you're going to get. When you have
noisy video normally when you have clean video we expect this, right. But the bit
rate is high, these are not as high, right? This is what you expect for clean video.
For noisy video, this is what you're going to see. This is what you are going to
see, okay.
Now, with no denoising, with no denoising, there's a certain region that you can
just lightly compress it, and compression is subacute noise, okay, because
whether you have clean or noisy, you get the same thing. Compression would
queue the noise for a certain bit rate. When a bit rate is high, the bit rate is high
enough to preserve even the noise. And because with the noise you cause the
video to become bad.
>>: [inaudible].
>> Oscar Au: Yes. Yes. We have -- we actually have some paper to direct this
point. You can calculate this point. You actually can calculate this point, okay,
and that's good, all right. Which is -- this is very, very good because what that
means is that if I know the bit rate my bit rate is going to be lower than this point.
Forget about the noise. I just do compression and I get 3D noisy from the
compression. Okay? But if I know that my bit rate requirement is higher then
what am I going to do? Well, if I know this point and I will say that I would never,
even if I want to be hear, I would want -- I would not want to give the bit rate to
this guy, because more bit rate means I'm going to get bad performance. I'm
going to only go up until here and that's it. I would never allow my QP to go
below a certain point so that I'm going to get worse performance and more bits.
This is just stupid. With denoising, with denoising what would happen is that you
are going to get something like this. With denoising. Okay?
You will not be as good as that, but you will be better than this point. You know
what I'm saying? You will be better than this point because of denoising. Okay.
And that now there's a reason for you to increase the bit rate higher. With no
denoising there's no point to go beyond this point. All right. So that is the
situation.
Okay. So coming back, when we do motion estimation, this is the cost function
we use, okay, J equal to the first time is basically SAD, just SAD, plus lambda is
-- this is lambda rate where the second term basically is V minus the P median
predictor. The idea here is that we would like the whatever motion vector you
find, we would like it to be close to the median vector, okay, in order that the
motion fee would become smooth, okay.
There is reason to believe that whenever you look at a particular block, okay, this
block is part of a big region, and so this guy should be similar to the motion
vector in the neighboring region, so it makes sense to the motion view to be
smooth. And this will be especially so for large resolution, HD kind of situation.
For low rest rates, less true, for HD it will be more true, okay.
So and then basically we do two kind of search pattern, large diamond search
and small diamond search in different situations, okay. And oh, I see. And
another thing is that because we are doing -- let me see. If this is frame T, this is
T minus 1, T minus 2, T minus 3. What we are seeing here is that at a certain
point if the -- if the [inaudible] square is less than a certain number all right, then
perhaps you don't need to do anything further because it is good enough and the
motion search can stop. Of will otherwise you keep going. Okay. So something
like that. Okay.
All right. So this is the overall strategy, okay. You start with M equal to 1, okay,
and then you search it, okay, and if M equal to 1 this is what you do, if this is M
equal to -- for M equal to 2 you do this, M is equal to 3, I think 3 you do this,
okay. All right. For M equal to 1, you do -- we do free search, we do free search,
okay. We do large diamond search starting from 00, okay, and then after the last
diamond search start you go to small diamond search, okay. All right. And then
the second thing we do here is that we also use the median predictor as a
starting point and do small diamond search. Another thing is that we use the
co-locator block from the previous frame, co-locator block, PC. Yes. Co-locator
block, yes. Okay. And that start doing a small diamond search, okay, and
basically compare the three guys and see which one is better and choose that as
the motion vector, okay. And therefore M equal to 2, then we just do a large
diamond search from the PTT is the temporal predictor, from a temporal
predictor, okay, and then you do the small diamond search. And then for the
other guys you use small diamond search from the temporal predictor, or
something like that. Okay.
So basically this is what we do in the search. And so once again, this is what we
got. Okay. We can get quite a bit faster than the original method. Now, mind
you this method use brute force full search, okay, and once again because of the
presence of the noise we find that the fast method doesn't work so we have to
use full search for this. And now we find out that yes, it is possible to do fast
search. Okay. And this is the kind of speed we can get.
In order to do this, we will need to include a edge preserving low pass filter in it
before we do fast search. Now the fast search become truss worthy, the result
become more reasonable. Complexity wise, we are quite a bit smaller, right, so
-- okay. And the speed up, okay, and the motion field becomes smoother as
expected, because of the SAD thing, because of the predictor, it tends to be
much smoother, okay. And, yes, so the reason is -- the results are reasonable.
Okay? And [inaudible] can be a little bit higher, can be a little bit higher. Okay.
Okay. Yeah. Okay. So now so this is -- so, so far I've talked about two things.
Number one, I talk about temporal, temporal the interlacing framework using
multiple reference frame to do denoising, second part is that for the first part we
need to do motion estimation and we blindly use full search. Second part we try
to do robust motion estimation but fast, fast and robust motion estimation. So
we're doing two things.
Third thing, okay. Now, so far we have been talking about denoising as a
prefiltering measure, meaning that you don't -- denoising is one thing, codec is
another thing. They are separate, okay. But now here we are seeing that is it -is that possibility that we can integrate denoising in the encoder in that means we
do both things together, okay. Now the encoder is willing to do something for
you, okay. How -- is it possible to combine them together to do something
reasonable? All right. Can we do that? Okay. If you imagine then can we later
on you should bear this in mind, can we do special filtering with encoder? Can
we do other things with encoder? I'm not very sure, okay. But I tend to think that
we perhaps are the only method that you can integrate seamlessly into an
encoder. It turns out that we are doing exactly the original filter. The temporal
filter we are not changing. We don't modify it. We do exactly the same thing.
And we can integrate it into the encoder such that if the encoder complexity is
this much now, with the denoising the complexities are this much. You just
increase a tiny bit and you have the advantage of denoising. Okay. All right. So
this is -- we're going to show you how we can do this.
This is regular motion estimation, well hybrid coding, okay, MPEG-1, MPEG-2,
everybody do this. Input video. You do motion estimation here, you subtract the
prediction to get the residue and then you do DCT quantization and stuff like that
and then here you do locally decode that and add it back to the predictor to get
reconstructed video and so basically this whole thing is the decoder, okay, and
so on. Okay.
All right. Now, you see that we leave quite a bit of space here because we are
going to add something there. Okay. When we said we integrate something.
What we are saying is that we want to keep this whole thing except that we
modify a little bit such that we achieve the effect of the denoising. Why are doing
the whole compression at the same time. We are doing denoising within the
whole encoding loop. Okay in all right.
Okay. So we have two things. We encode -- we integrate the denoising into
video encoder. We also can integrate the denoising into the decoder. Okay. So
we the video encoder with integrated denoising video decoder, we've integrated
denoising.
We claim that this is equivalent to cascade scheme, meaning that when we do
integrated video and noise encoding and integrating noise we are effectively
doing this. We are effectively doing prefiltering followed by decoding. We are
getting the same result. For this guy we accept the getting, decoding plus
denoising afterwards using our method. But that's a constraint. And the
constraint is that in the past remember there's a parameter M. You can use
current frame, you can use one frame, two frame, three frame, and I recommend
that perhaps two frame is good, right. Using this method you can only use one
frame. You cannot use two frame, you cannot use three frame, you can only use
1 frame using this method, okay.
So that means it's not going to be -- you are not going to get the maximum
benefit.
>>: [inaudible].
>> Oscar Au: Huh?
>>: Using a cascaded method or using an integrated?
>>: Integrated.
>> Oscar Au: For integrated method, okay, you can do -- you can do an
equivalent to a cascade method but with M equal to 1 only. We cannot integrate
M equal to 2 and integrate it, we cannot do that.
>>: [inaudible].
>> Oscar Au: Yes. P frame. Uh-huh.
>>: [inaudible] frames.
>> Oscar Au: Yes. You are right. Yes. Yes, you are right. Yes. Excellent.
You are exactly right actually. Yes. Yes. For the case of P frame, we can only
do M equal to 1. If you allow yourself to have two reference frame, then we
allow, we can do M equal to 2, yes. Yes. Exactly depends on how many
reference frame you use for your motion estimation in your codec, you are right.
Okay. Well, the denoise video is basically if you look at it, okay, if you recall the
previous formulation that the noise video is equal to the noisy video plus the
prediction, okay, and the plus sum D, right. This is the original formulation, okay.
Multiply by certain weight and this weight obtained by LMMSE kind of thing,
okay.
Well, and we can -- this is what we're going to do, okay. We are going to add
something here, we are going to whatever the residue is, this guy minus the
prediction, give you the residue. For the residue we are going to multiply by a
certain weight and then add a certain DC to it. Oh, yes, yes, that's right. Oh,
yes. One thing -- one thing -- another thing is that we are going to restrict
ourselves to have only one weight for the whole block. In the past every single
pixel can have a different weight. Here the whole block need to use the same
weight. I'm sorry, I forgot to mention this.
So it's not fully [inaudible], it's not fully [inaudible] but quite close. Okay. Now,
mind you, okay, with this modification it is very, very simple. You know that
motion estimation is very, very complicated. After you do your motion
compensation, this is the residue. You multiply by simply a number and then you
just add a DC to it. So it's very easy. Okay. Actually my student did not do it
right, okay. We should have put this after the DCT. Because DCT is linear, you
exactly the same, whether you multiply before, after, it's -- yeah, it's linear, right.
And DC it turns out that when you add a DC, you don't need to add 16 numbers,
you actually can only need two, you only need to add it to the DC value. So this
is a little bit faster. So we could have put this thing right after the DCT, and it will
be faster. Except the equivalent but just faster, okay.
And then for the decoder, there is something like that. Okay. All right. Here if
this is how we derive the method, okay. Now, residue, let's imagine now let's
imagine now we are trying to do this. Imagine now we are doing denoising
followed by encoding. Denoising followed by encoding. Okay. In that case, in
that case, let's imagine what is the recipe. When I do motion estimation, okay, I
am going to have the current frame. I do a search in the previous frame. I find
the best guy. I subtract them to get the residue, right? So the residue is equal to
some denoise guy minus the prediction. The current frame has been denoised
already. The current frame has been denoised all right, right. I'm going to
decode it. Well, the current frame is basically equal to the current noisy video
plus a weighted version of the prediction plus some DC, right. I mean this is, this
is just the current frame denoise, this is how we write it, okay.
And it turns out that, well, you see P here and you see P here. You can combine
them together. So W1, W0, and W1 add together will be equal to 1. W1 and this
guy so this is W1 minus 1, so this is actually 1 minus W1. 1 minus W1 is actually
W0. So actually so you -- this guy can simplify into W0 FN minus W0 P plus D,
but FN -- FN minus P, FN minus P, what is FN minus P? FN is the original noisy
video minus the prediction. In other words, if I just encode original noisy video, if
I just encode original noisy video, what will I get? I would be using my noisy
video minus prediction to obtain the residue. In other words, to do this cascade
denoising followed by encoding, you effectively just like you were encoding the
regular noisy video accept that you multiply the original residue by a number and
then you add a DC to it.
What I'm trying to say here is that if you do the cascade denoising followed by
the -- by encoding, this is exactly equivalent to doing forget by the denoising, just
do encoding but in the encoding loop, you multiply the residue by a constant, by
W and then add a D to it. The two things are equivalent. Yeah?
>>: [inaudible] for multiple frames [inaudible] and now notice when you complain
the [inaudible] if you have two reference frames, right, you have W1 and W2.
These two number may not be the same. And the residual -- I mean B is
calculated in some kind of fixed format [inaudible].
>> Oscar Au: That's true.
>>: Does not work, right?
>> Oscar Au: Good point. Good point. Yes. So you will not be equivalent. It
will still work, but it will not be equivalent to the M equal to 2 temporal denoising.
You're right.
>>: [inaudible].
>> Oscar Au: Yes. Yes, yes, you're right. You're constraining that, correct.
Good point.
It will still work to a certain extent. It would still work to a certain extent. Okay.
>>: Prediction now will be weighted, weighted prediction of two frames and you
subtract it from the original, so it becomes -- I think the complexity will be
[inaudible] probably extend [inaudible].
>> Oscar Au: Given the fact that while I think you will use B frame instead of P
frame when B works better than P, right, in other words, when the B prediction is
better than the P prediction and I think this will only be true when you have good
motion estimation for the both -- for both of those frames. In other words, I think
in that case most likely they are weight naturally when use denoising their weight
will be very similar. In other words, we will probably be very close to the optimal
situation for M equal to 2. This will only be true when B frame works, B block
works better than P frame, P blocks, right. And this will only be true when you
have actually good motion estimation correspondence. Otherwise you wouldn't
do it.
So in a way, this method should still work quite well. Okay. But I cannot -- but
you are right, yes, I don't think it is equivalent. It's not equivalent, but quite close.
So basically with these few simple lines, this established the correspondence
between separate purely prefiltering followed by encoding cascade version
versus the integrated version. They are equivalent in this case, all right. So
that's good. And the beauty of this is that it is very, very simple, very, very
simple. In the DCT, after you calculate the residue and then you do the DCT,
you just multiply it by a simple threshold, simple scaler, you know, this sing and
then add a DC to it. It's very, very simple. Super simple, and you can get the
benefit of denoising.
>>: That means this might actually be a [inaudible] only need to modify the DC
coefficient.
>> Oscar Au: This is standard for quantization. This is standard for quantization.
>>: [inaudible].
>> Oscar Au: Oh, I see.
>>: [inaudible] conditions [inaudible].
[brief talking over].
>> Oscar Au: Different blocks.
>>: For each block you only need to [inaudible].
>> Oscar Au: Actually true. True. Yes, right. Yes, I agree. I agree.
>>: It will be just adding more conditions of the [inaudible] for these blocks based
on the calculations.
>> Oscar Au: Yes.
>>: [inaudible].
>> Oscar Au: That's right. True. But once again, remember, doing this current
denoising you may lose a little bit of the detail in the video, right? We mentioned
earlier on. So you may want to think twice before you do it, okay. You're going
to you're going to gain something, you're going to lose something. What you
gain is that the [inaudible] model, the [inaudible] model. Okay. You're going to
have denoising with that.
What you lose is perhaps some small image detail that you want to preserve. So
we do need to be wise in order to do that. But one thing remember, if you throw
away this guy, if you allow W0 to be equal to 1, you get back the original. That
means you can -- you can perfectly preserve all the original detail you want, and
using this method you get a W0 and if you really want, you can use a brute force
method and force yourself to get a value between the value of 1 and 1 and you
can get -- you can have a continuous tuning of whatever you want. It is possible.
It is possible. Whether you want to do it is something else. But it is possible.
>>: [inaudible].
>> Oscar Au: That's about it, I think. Okay. All that we will claim is that, hey,
complexity is not very much, okay. Either performance is good and you can also
do it basically in the -- and the quality is good, too, okay. Actually not only that,
okay, this picture is interesting. Almost done what I'm saying is almost done,
okay. This one says that if you could look at this one, this is original cascade
method, this method actually is an integrated method and you see that this one
looks better than this. And you say how can this be possible? How can this be
possible? You are doing integrated -- I mean we just said this is integrated, why
can this be better? Okay. We have a reason for this. Okay. And the reason is
that in the original cascade version what we are doing is that in the current frame
you have a previous frame and all the previous frame has been denoised, right,
they are quite clean and so when you do weighted average they are good and I
tend to give them more weight, right. So I got a certain effect.
Well, in this scheme with the current frame, with the previous frame, the previous
frame has been denoised and encoded, so notice that they have been encoded
as well. An encoding actually give you some is denoising effect as well. And so
as a result they are actually even more denoised than the original cascade
method. So that's why putting it into the integrated version, okay, you have
added benefit of even cleaner than before. So you have a potential to get
something a little bit better. But probably losing a little bit of detail, too. I don't
know. Right?
But anyhow, when you do coding, you are going to lose some detail anyway.
You -- it is lousy coding after all, right. So we don't need to beat ourselves to
death. So actually sometimes it works quite reasonably well.
So anyway, for the de-- for our method, okay, motion estimationwise we don't do
motion estimation at all, but forecast indicated method we still need to do motion
estimation. So 0 complexity for motion estimation. For the filtering it is a little bit
simpler than the cascade method and so the total complexity for us is very small.
I think you can easily see that it is very, very simple. Okay.
Now, when it comes to the decoder, it is also possible to integrate this into the
decoder. The application of this is that perhaps nowadays we have all these
DVD player, DVD player, okay. The video has been already had encoded by
Hollywood, and I can -- I cannot change the video, okay. But it turns out that
maybe their video is a little bit noisy, maybe, okay, especially in that region's, a
lot of the rain kind of very dark region you may have a bit more noise to it, okay.
And in that case, it is possible to integrate the denoising into the decoder, also. It
is possible to do that, okay.
Basically one thing we need to face is that the decoder [inaudible] need to be
consistent. You cannot modify the decoder look just like that, okay, because
otherwise you could have drifting errors, okay. And so as a result we leave the
original decoder intact but basically we do a denoise version and for the display
we don't display the original decoder thing. We apply this little thing and display
only the denoise output, okay. And basically we need to add a loop similar to
before. We need to multiply by 00 and then add a D to it, okay. And basically
this is still within the same loop, but we need to attack this next to this guy such
that we can generate something which has the denoise effect already. So it is
possible.
But you can see that we don't have as much advantage as before, because this
is a bit more complicated than the encoder version, okay. So for the decoder
guy, okay, for the decoder guy, the equations are similar. You can show that
they are equivalent to M equal to 1, temporal denoising, cascade version, okay,
and let me see. Our complexity. Okay. Remember previously our complexity
advantage is quite a bit higher than the cascade version. But now we are not as,
you know, the cane is not so big. It is a little bit smaller, okay.
But then we actually can get these denoising effects. So this is something to
consider. This is less attractive than the encoder actually. But still we for
completely said we also derive the integrated denoising into the decoder. So it is
possible to do that. Okay. So with that, I think I end the talk. Right. So maybe
we can look at a demo again. Okay.
>>: [inaudible] other content to show [inaudible]?
>> Oscar Au: Huh?
>>: Do you have a different video to show?
>> Oscar Au: Sure.
>>: Than the one with a little more motion in it?
>> Oscar Au: Sure. Yeah. I have foreman.
>>: The foreman. Okay.
>> Oscar Au: This is noisy foreman and this is denoise foreman.
>>: And [inaudible] mention that Professor Au will be here with us today. Any
one of you want to talk to him afterwards, I mean, please let me know and I will
try to see if I can squeeze [inaudible].
>>: The next one is [inaudible], right?
>>: [inaudible].
>>: [inaudible].
>> Oscar Au: Blocky noise. Blocky noise. Okay. And see blocky noise here.
So this happens when motion estimation doesn't work so well, right, and so
locally this part is clean up, this part is not so clean up. So this part, this method
by itself has some problem, and need to be combined with I think special method
to make it better. Something like that.
>>: [inaudible].
>> Oscar Au: Say that again.
>>: Did you do any experiments with [inaudible]. I mean this video was added
on ->> Oscar Au: Oh, yeah, yeah, yeah. Okay. Good point. Yes. We tried to -- we
tried to the apply this to denoise some TV signal. It turns out our TV's reception
is very poor and the TV's very noisy. We tried to use our method to denoise it.
Okay. We find that we can only denoise it to a certain extent. We cannot
completely ->>: [inaudible] over.
>> Oscar Au: Analog TV. Analog TV very noisy. You know, you've seen those.
They're noisy. Okay. Yeah. We tried to use this method to denoise that thing
and we have some effect, but we cannot -- we cannot get something as clean as
this. It is super noisy and it is less noisy but ->>: [inaudible].
>>: [inaudible] with that really noisy [inaudible].
>> Jin Li: The time consideration, let's thank you Professor Au for his [inaudible].
>> Oscar Au: Thank you.
[applause]
Download