Document 17864794

advertisement
>> Zhengyou Zhang: Good morning. It's my good pleasure to welcome a
good friend Professor Wen Gao, and new friend [indiscernible] think to
MSR. So we have two talks. And we'll start with Wen's talk. He's a
professor at Virginia University and member of Chinese Academy of
Engineering and very recently he's a vice president of NSF for China.
Not NSF U.S.
NSF China. He's in China. And he was president of
Institute of Computing, Chinese Academy of Science and also vice
president of the graduate school of Chinese Academy of Science. So
long title, I won't take too long. So he has been doing a lot of work
in video coding and he will talk about some new research topic and also
some standardization effort. Wen, please.
>> Wen Gao: Thank you. Thanks, Zhengyou, for inviting me to be here.
So today I want to share with you about video coding and some new
direction, maybe not new direction, some new stuff. Older direction
but get some new stuff about the way coding use the model-based and
there's a showcase in last year in the June we published IEEE standard
which yields new model so I will talk a little bit about the scoop and
the key results about that standard and the summary.
So I think video coding basically is taken out to try to remove the
redundancy from video synchrony. Basically what we can remove, we can
think about three kind of redundancy.
One is temporal redundancy. Another is special, and the other one is
actually what we call coding redundancy. So the system, the standard
system just mix three kind of technology to make the system.
So because you use multiplied technology. So the system normally
called hybrid video coding system. So currently the system basically
use the results, the video in, then so first part yields the transform
technique to remove the temporal, the special redundancy.
Then there's a loop, try to remove the motion-related redundancy.
call that temporal redundancy. And then last 1, there's entropy
coding, try to remove the coding redundancy.
We
So this is the first generation, second generation and the third
generation of video coding standard for chart. So whatever, I start at
2-six price or X or MPEGX or new standard old chart looks like the
same.
So the top one actually is the transform coding and the bottom one is
prediction coding, try to remove the temporary redundancy and entropy
coding used for code the sequences more efficiently.
So this is the state of art for today's coding in our technology. So
actually video coding has made a big impact to the TV industry because
for TV industry, in early time, it's standard TV and actually D TV two
or three years ago 2D TV. And from this year the ultra HD TV. And so
I think the video coding technology is based for this progress.
So well the technology, the coding we show, the coding efficiency start
from 50 to 1 compressed ratio. Right now it's around 225. Sometimes
we said around 300. Then the next generation should be something like
450 to 1, compression ratio.
So but if we think about what's the upper bound for this technology,
then we can later on I will explain the upper bound should be very
high.
So today's the state of art
have quite a big room to do
bound. So we this think of
input matrix -- this is the
only reach quite low, result. So we still
some new research. If I understand the out
2D metrics to transform. For example, gave
IHD sequences. One frame of sequences.
So the resolution is 9020 by 1080. So then we can simply interpose
some zero then to make some square matrix. So then if we can find a
transform to make this success, then the Y becomes in our only element
here is non-zero all other zero. So with this transformer, it's very
clear we can make very efficient compress.
So actually this one exist. So one of them is SVD. Then if the thing
about in 3-D dimension, the sequences. Then the upper bound should be
much, much higher than that, because normally when we do video coding,
coding, we can use at least 15 frames for one group of picture.
Sometimes we can use 60. And if we think about it much longer, the
sequences, then we can make much more higher the outbound in our
compression.
So this is the idea how we can make -- in 3-D dimension, longer
sequences of video. With that we can make much higher the 3-D metrics
transform.
So the data looks like today from start is in year of 1994, this is the
first generation. Then this is the second generation, which finished
in 2003. Today, I think most systems use MPEG-4 or H.264, next
generation. This year, there's new standard finish called actually
HEVC or H.265. So which if we look at in the I2D resolution level it's
around 300 to 1 compression ratio.
Then what's next? If we look at the roadmap, next generation of video
coding should be contribution ratio of 600 to 1 for HDD resolution.
Of course, maybe ten years later HD resolution is too low. Maybe it's
UHD. So the contribution will be a little bit higher than 600 to 1.
So here is the -- so this is standard TV cooperation -- so standard TV
compress ratio. So there's -- even the standard is now changed. But
the compressed ratio get a little better, a little bit better with the
time being, such as in the MPEG 2 time start for SD resolution, the
compression ratio for one SD channel is about five megabits per
channel.
But what happened today is we can use similar technology we can use
about 1.5 to two megabits to compress the same quality of the video.
So let's look at our progress by the optimization of encoder. It's
just now a contribution from standard, but it's the contribution for
main encoder optimization technology.
So people try to make more higher compression in our standard by using
some technology. So one technology actually people try to use in the
big block of data.
For example, in early time, only two types of block of data be used for
transform of all prediction. But for the time being, for example, in
the search generation, there's a lot of different type of blocks that
yield for higher encoding efficiency. For example, 64 by 64, down to
32 by 32 and down to 16 by 16 and then down to eight by eight. So with
a lot of different combinations.
So you can find the right block to use. That's for transform. For
prediction, for motion estimation and so on, we can use very tiny
prediction by different direction.
Early time only 40
second generation,
search generation,
Maybe 32 or 36 and
version we can measure at this much or not. In the
the direction from four go to nine, and in the
actually the direction goes to much, much more, 33.
so on.
So with a lot of direction, then you can hit even -- you can match
that. So which can say the beats. So let's say overall the transform
and prediction. So for the data structure, we can use things like
this.
So there's a big block and small block down to very, very small block
and also different in our combination. So this is the basic idea.
First generation, second generation, third generation coding technology
has been used. But if we look at the detail for the technology, it
must be something missed.
So there is one thing if we look at one step is today's video coding
standard most of it is in targeted on TV industry. TV broadcasting and
so on. So by that there is a lot of limitations for that industry. So
this technology has actually been used for, simply used for video and
Internet video. It's not good because the TV industry has a lot of
special things needed to make their requirement, to match their
requirement.
For example, they count the clock very accurate. So if you delay even
1:00, then the system will be crushed. That is why the formal rate is
very important. You can't change from a rate during the encoding
process or the decoding process.
So if you want to change the rate, you need to restart. So that's why
use this technology for Internet video is not efficient. So of course
our Internet video coding algorithm has made a lot of change. But it's
not standard.
It's a local or it's private feature. So for another missing -- for
video surveillance, surveillance video actually is very unique.
So use this technology for that. Also not efficient. So that's why we
try to look at different application, try to use different technology
combination to make much better in our system for the special
application.
Of course, people very easy to think maybe we can use some visual-based
encoding, it should be much efficient.
So in the last 50 years, people working in that direction spend a lot
of time and energy. But the results are not good because after the
video coding standard is now used, this kind of technology -So but a lot of the research results are already there. So we can use
the idea for the application like Internet coding, video coding and
surveillance video coding.
So there's another factor. Normally people want to write a people.
We're mentioned as the perceptual coding. Because in a one yield code
video, there's distortion.
So well the same distortion, which solution is the best for quality.
So people can argue maybe we can measure the distortion by PSNR the
result is not good. Maybe we can measure that by some metrics. So
eventually people think maybe because all the compressed video after
decoding is looked at by human's eye. So human's reaction for quality
is more important. So this direction is perceptual coding. So with
that also a lot of research has been done.
So this result partly has been inquiry, integrated into the coding
system today. So that's about the video coding. Then why model-based.
So after I explain why it's not good to use the TV already in the
coding standard for Internet video and for surveillance video, of
course we can think a different way for surveillance video, for
example.
Then if we look at what research has been done for the model-based
coding, there's a lot of work, good work has been done, for example,
people have been working in geometric partition for the coding, for
example, can use the triangle or use the mash network to describe the
survey and so on.
So I think later on for purposes, maybe it will be mentioned in some
part about that but not for coding but for image processing. This is
one direction that people have been down. And the second one is very
clear for computer vision point of view can use good segmentation, use
stats, information that have the video coding also is a good idea.
Of course, a lot of research has been done. But the question is: Do
you have stable segmentation algorithm for coding which don't need
people in the loop. So this is the question: Then the third one is
the object-based coding. I think this action will be easy to
understand. And then it's the knowledge-based coding. We said
knowledge based. It's in now the innovation, analysis and the census
dates. So basically this direction will research is focused on the
face related, talking head-related.
So people try to model the face expression or talking face itself, a
face emotion and UTAT to save the bits. This is also related a little
bit higher than the knowledge base, this is semantic based coding,
which not only just to decrypt the surface it is also for the moment,
for the first moment can use the semantic-based technology. And the
sixth one is something that I mentioned for visual system. Human
visual system related, which yields a perceptual coding idea.
Then the last one actually is quite new direction. It's a
learning-based coding. So use the data in the Internet, in the cloud.
Use more data, get more knowledge and use that knowledge to support the
good coding algorithm. So this direction has been looked at as a very
important thing because the big data, if you look at the partition, the
big data over 80 percent of data is image and video.
And especially if you look at the video part, the surveillance video,
it's a majority in the big data. Maybe 40, 50 percent have the data is
surveillance video data. So for this kind of data, if we can find a
good compression algorithm, then we can see the storage or can save the
transmission cost.
Of course, surveillance video also make special requirement for what
kind of quality you need to protect for further pattern recognition.
You have fine art. You know, in most systems today use the
surveillance video. They use quite higher compressed ratio. It's much
higher than the coding standard used in TV industry.
The only reason is they try to say what the cost of storage. But this
is not good. Because after you compressed, then if you want to find
something, find some special people, find some object from the
surveillance video, you'll find out it's very hard to do that, because
you have -- after the high compression ratio compress, you lose a lot
of features. And then you cannot figure out what's the correct result.
For example, of course the resolution is another issue. For face
recognition, if the quality of resolution is too low, then it is not
easy to find the people, the object.
Of course, for making a better pattern recognition, you'll need very
high resolution. And the resolution is much higher than today's system
has been used.
Another use is actually for video, surveillance video, normally a
compress is not the only target. The target is compressed first, and
then later on should it be to analysis. To be heavy or people be heavy
or who is that.
But today the video coding stuff is only for make high resolution, high
compression ratio. But for video content analysis job, they do in
subtly. So no two systems are considered together. So like it's
parallel all the way. So it's not good. So what the reason is the
coding standard, only thinks about how to code the video sequences
inefficient. Not how to analyze the video efficient.
So we try to make some combination, make a standard which you can take
care of two steps. So if we can do that, there's very direct thinking.
Maybe we can get much higher compressed ratio by using a model, using a
background/foreground model, which should be getting a higher coding
efficient. And then later on, since we have the background and the
foreground there for the foreground, it's related to the object. So
for that you can make the analysis, the project -- the task, make it
much easier.
That's the one thing we tried to put together.
So as I mentioned, for the quality, the resolution of the image is
related to the pattern recognition accuracy. For example, if the -- so
here is the face recognition ratio.
So this is the compressed how hard you compress the data. So normally
the parameter in the video coding we normally use DQOP to describe how
hard you have compressed the data. So the big QP means much higher
contribution ratio. So if you look at that, you can find out basically
maybe QP would be equal to 10 here you can get 90 percent and up
recognition ratio.
But with the QOP get bigger, so the ratio get down very fast.
So that's why we should understand that if you want to make the
analysis more accurate, you cannot compress too much to everything. So
the idea is maybe you can compress the object much less and then you
can compress the background much higher. All rows it's okay. So
that's the idea for how we can compress surveillance video for that.
Okay. So that's the idea then. What's the showcase that we have done?
IEEE standard association. So the IEEE 1857 is a working group we
initialized last year, actually it was the year -- let me see. So in a
year, 2011, so we basically initialized the working group. Then the
working group officially started work in the last year, in February 15.
So I am the chair, there's a vice chair and secretary. And then
actually after one year of work, this standard has been finished. So
this is all the processes we have done to make this standard.
This is also the process.
for this standard.
So this is the publication, documentation
So basically a lot of authors involved in this standard. So the
standard cover different things. Can cover basically what we call a
may group or may profile, basic profile and profile, profile for
special application. For example, there's an enhanced profile which is
special for HD movie application.
So there is a mobile profile for mobile application. There's
surveillance profile. There's two actually layers. One is baseline
layer. Another is we call it S 2 layer. Higher layer. So this layer
is more efficient for surveillance video coding.
So list standard, the standard IEEE 1815, the video surveillance
profile, there's definition about the parameter how you can code the
surveillance video. Within that, there's some parameter is very
important. For example, there's region ID. So there's region number.
How many regions you want to make as interesting -- region interesting.
So if there's two regions, then the number is two. Then for the
number, for the number of regions issued, you can describe for where
does it start, where does it end and which parameter you have been
used. Later on you can base it on that to make the object be heavy
analysis or the tracking and so on.
Also for the standard can support camera, mini camera parameter such as
the camera data can be location, direction and so on. Also can equal
many useful information like GPS information. So before this time each
time if you want to make a search, use multiple camera video, then you
need to have a system to figure out which data is captured from where.
But what this information is, is very natural. So you know where is
the camera mounted at, because there's GPS information there.
So the current technology for video, surveillance video coding actually
is model-based. The basic idea is now there's original frame. And
then from the sequences actually we can calculate background. If we
can make the background very good, then we can figure out after
differential, we can figure out where is the foreground. So the
foreground actually we can use that for more efficient coding and for
analysis.
Of course, in the coding structure, we can have a switch. And the
switch you suggested orange in the video scheme or can you use the
background or foreground of coding stream, scheme.
Also we can support not only the fixed region of the camera also the
camera can be rotated, can zoom and out. So for that the background
model is a little bit complicated to build up.
And for surveillance video also the condition will be changed for
lighting, wider and so on.
Also we have some higher layer parameter set for this script in the
different condition, you know the video caption. So here's the basic
idea how we can make the background. So if you look at it, there's a
white line here. So this is the line. This is the plan this show
here. But basically we have changed the data structure from orange and
data structure -- orange not only is left view is XY, say it's timing,
timing zoom. So XYZ. But we turn that in a view. Say YX. So let's
see the YX view. So look at it here. You can find out it's much clean
than left one. So this is much easier for making the model for
background model. So basically all calculations to make the background
model is on this plan. And this space.
Okay. So this is the algorithm we have used for that. So then the
original -- the sequences, we call them training frames. Then by this
training frame we can look in this direction. So not this direction.
We can look at -- we can cut in this one, look at it in this direction.
So just like this. So we turn that as square. Then based on that we
can make the model update. So if you want to now detail how that
works, we have published several papers about that. So I can give you
the paper.
So this is the result we have done by that algorithm. So this is the
first -- sorry, that's something I write in Chinese. The first frame.
And this here is the frame of 118. So this is the result. So if you
look at that. So the result is not that clean. Something still
remains there. But most of the object has been removed. So this is
the background and this is the original one so you can make the
difference that you can figure out the foreground.
So the algorithm is just look like -- so you'll get the sequences.
Then you can have the background. Then you can have the foreground.
Then you can code the background and foreground in different sequences
and make them more efficient. So this is the format we have been used
in IEEE 1857.
Then with that we can easily to figure out the region of interest. Of
course, the object is quite -- you know, sometimes it's tiny and small
and then we can use algorithm to make that more a little bit a larger
region which is easy to deal with for decoding. So for that we can use
that for real time online, surveillance video coding also in most
today's system, for example, h.264 already there and we can use the off
line process to use trans coding, which also make it very efficient
result for storage and for later on analysis.
Okay. So we have used the system to recommend video surveillance
system, which, for example, by that we can record video sequences then
of course where's the foreground. The foreground we can just keep this
region in very less compressed which later on you can do face
recognition more accurate. Also can use this structure to support the
human behavior analysis. So there's a high layer event.
So the region and then we can get the object layer and then we can make
a much higher layer event layer which is most of the multi-frame pulled
together to make the behavior description. So this standard actually
can support, if you want to do the video analysis then can support in
three layer. The top layer we call the index layer. So by index
layer, actually, it's very easy for later on retrieval, analysis and so
on.
Then we can have the object layer. So for object layer is the
foreground layer. Also we can have the original layer we call that as
[indiscernible] layer as like traditional process.
So then some people ask: Is this model only work in IEEE 1857. We say
no, no. So [indiscernible] has done a quite good experiment to use the
model-based algorithm over HTEVC. So the result is we change nothing.
Just put foreground, background model there.
Then here is the big savings. So we are saving 44.78 percent of a bit
than actually EVC, with similar quality, similar video quality. That
means we almost double the performance that EVC.
This is, of course, for surveillance video. So still -- my student
still has done the same experiment on conference video. Maybe I think
generally more interested to this one. The result is not that
impressive because it's only saving 13.79 percent of a bit compared to
ITEVC. One of the reasons is that for this one, the background region
is the majority. So you can make a more efficient one.
But for this one, for conference video, majority actually uses human
face. So when you move, when you change, is changing. So you cannot
make much background to make the advantage.
Okay. So this is the first stage we've already done. And then we have
our next in our staff. So the roadmap looks like today we have
finished the 1857/2013. The next year we plan to make version 2. So
version 2 -- so AVS 2 is from the performance point of view is
equivalent to ITEVC.
So I believe next year for video surveillance we can have IEEE 1857
double the coding efficiency than ITEVC on surveillance video
application. We also have the audio part, system part. And then this
one is quite unique. We have -- we'll call that description part,
which can support the analysis and census and we also have the
interface part.
Okay. So this is the major part I want to share with you. For the
summary, I have two parts. One is the future direction. So basically
today's video coding use the block-based technology. So make decoding
efficiency improvement. Then also I'm talking as the model-based is
somehow belong to the knowledge-based. Then there's a one version that
I did not touch is the cloud-based, use more data to learn and so on
and on. And I didn't taught it. And one thing I didn't taught was
perceptual coding which also some people working on -- of course at
this stage quite many a paper already published. But only a very few
actually integrated in the standard.
So I think people were keeping their effort, make the video coding
efficiency higher and higher and higher. Of course, we should not only
look at the broadcasting video coding standard, we also need to look at
the surveillance video and the Internet video.
So we have published some paper related to the model-based coding. So
if you are interested I can give you the list. So before I finish my
talk, I think as Zhengyou mentioned for this year I have a new job, the
job of NSF China. So NSF China is quite young. So interested a little
bit about that.
So it's relatively young. It's a start in 1986. So I think the
mission is quite close to the NSF in the United States. So this is the
organization structure we have now today. So basically we have eight
departments for different areas.
So the Area 6 is the information science. So I'm in charge of this
one. Also there is office, the general office, planning office and so
on.
So the third one is science policy office. So I'm in charge of this
office already. So here's some other -- so the funding basically
supported three things. One is the research. Why support people and
for otherwise for environment, research environment.
So people we have quite a different category of funding to support
young people, junior -- senior-junior people and senior people and so
on so we have different category.
For project, they also for the research also we have different
programs, general program, key program, major program, major
relationship plan and so on.
So of course for the, so for the people there's different categories
has to be matching different, a year older. So for the funding, there
are program categories, there are general key and so on also for the
money, for the amount of support from some hundred K, for some 10K, 40K
and so on, and to have a meeting R and B. And then how many U.S.
dollars. So the R & D is in the privileged one.
So for the total budget by year -- so you can see from start the NSFC,
the funding is not that many, but with time being is grow up quite
fast. Also it's one of the things you can connect with Chinese
economic region.
So for the number of proposals, the application, actually last year is
going to the top is 170,000 proposals for a year come to the NSFC. But
this year it's a little bit down because we have a new policy. So each
people, if feel two-year has to be stopped one year in application.
Which somebody applied two times or failed stuff, somebody apply one
year, they automatically stop one year and try it next year. They
think maybe they need to get more good preparation then do the
application.
So for the evaluation process, basically after the proposal coming into
NFSC -- actually, we don't accept the proposals from a person. We
accept -- we collect from institutions. So any faculty need to give
the proposal to their institute. Then we collect the proposal from the
institution which will go in different direction, different division or
department then there's the format review. If the format not match,
then we'll not go into the review process. So there's a total of five
percent of proposal is formatted and not good.
Then after that there's a peer review. So peer review will select the
35 percent from 95. Then go to the panel review. Then after the panel
review, around the 20 percent proposal will be accepted. Then we'll
get the support. So I think that's our role, the process of NFSC.
Thank you very much.
>> Zhengyou Zhang:
[applause]
Thank you.
Download