>> Zhengyou Zhang: Good morning. It's my good pleasure to welcome a good friend Professor Wen Gao, and new friend [indiscernible] think to MSR. So we have two talks. And we'll start with Wen's talk. He's a professor at Virginia University and member of Chinese Academy of Engineering and very recently he's a vice president of NSF for China. Not NSF U.S. NSF China. He's in China. And he was president of Institute of Computing, Chinese Academy of Science and also vice president of the graduate school of Chinese Academy of Science. So long title, I won't take too long. So he has been doing a lot of work in video coding and he will talk about some new research topic and also some standardization effort. Wen, please. >> Wen Gao: Thank you. Thanks, Zhengyou, for inviting me to be here. So today I want to share with you about video coding and some new direction, maybe not new direction, some new stuff. Older direction but get some new stuff about the way coding use the model-based and there's a showcase in last year in the June we published IEEE standard which yields new model so I will talk a little bit about the scoop and the key results about that standard and the summary. So I think video coding basically is taken out to try to remove the redundancy from video synchrony. Basically what we can remove, we can think about three kind of redundancy. One is temporal redundancy. Another is special, and the other one is actually what we call coding redundancy. So the system, the standard system just mix three kind of technology to make the system. So because you use multiplied technology. So the system normally called hybrid video coding system. So currently the system basically use the results, the video in, then so first part yields the transform technique to remove the temporal, the special redundancy. Then there's a loop, try to remove the motion-related redundancy. call that temporal redundancy. And then last 1, there's entropy coding, try to remove the coding redundancy. We So this is the first generation, second generation and the third generation of video coding standard for chart. So whatever, I start at 2-six price or X or MPEGX or new standard old chart looks like the same. So the top one actually is the transform coding and the bottom one is prediction coding, try to remove the temporary redundancy and entropy coding used for code the sequences more efficiently. So this is the state of art for today's coding in our technology. So actually video coding has made a big impact to the TV industry because for TV industry, in early time, it's standard TV and actually D TV two or three years ago 2D TV. And from this year the ultra HD TV. And so I think the video coding technology is based for this progress. So well the technology, the coding we show, the coding efficiency start from 50 to 1 compressed ratio. Right now it's around 225. Sometimes we said around 300. Then the next generation should be something like 450 to 1, compression ratio. So but if we think about what's the upper bound for this technology, then we can later on I will explain the upper bound should be very high. So today's the state of art have quite a big room to do bound. So we this think of input matrix -- this is the only reach quite low, result. So we still some new research. If I understand the out 2D metrics to transform. For example, gave IHD sequences. One frame of sequences. So the resolution is 9020 by 1080. So then we can simply interpose some zero then to make some square matrix. So then if we can find a transform to make this success, then the Y becomes in our only element here is non-zero all other zero. So with this transformer, it's very clear we can make very efficient compress. So actually this one exist. So one of them is SVD. Then if the thing about in 3-D dimension, the sequences. Then the upper bound should be much, much higher than that, because normally when we do video coding, coding, we can use at least 15 frames for one group of picture. Sometimes we can use 60. And if we think about it much longer, the sequences, then we can make much more higher the outbound in our compression. So this is the idea how we can make -- in 3-D dimension, longer sequences of video. With that we can make much higher the 3-D metrics transform. So the data looks like today from start is in year of 1994, this is the first generation. Then this is the second generation, which finished in 2003. Today, I think most systems use MPEG-4 or H.264, next generation. This year, there's new standard finish called actually HEVC or H.265. So which if we look at in the I2D resolution level it's around 300 to 1 compression ratio. Then what's next? If we look at the roadmap, next generation of video coding should be contribution ratio of 600 to 1 for HDD resolution. Of course, maybe ten years later HD resolution is too low. Maybe it's UHD. So the contribution will be a little bit higher than 600 to 1. So here is the -- so this is standard TV cooperation -- so standard TV compress ratio. So there's -- even the standard is now changed. But the compressed ratio get a little better, a little bit better with the time being, such as in the MPEG 2 time start for SD resolution, the compression ratio for one SD channel is about five megabits per channel. But what happened today is we can use similar technology we can use about 1.5 to two megabits to compress the same quality of the video. So let's look at our progress by the optimization of encoder. It's just now a contribution from standard, but it's the contribution for main encoder optimization technology. So people try to make more higher compression in our standard by using some technology. So one technology actually people try to use in the big block of data. For example, in early time, only two types of block of data be used for transform of all prediction. But for the time being, for example, in the search generation, there's a lot of different type of blocks that yield for higher encoding efficiency. For example, 64 by 64, down to 32 by 32 and down to 16 by 16 and then down to eight by eight. So with a lot of different combinations. So you can find the right block to use. That's for transform. For prediction, for motion estimation and so on, we can use very tiny prediction by different direction. Early time only 40 second generation, search generation, Maybe 32 or 36 and version we can measure at this much or not. In the the direction from four go to nine, and in the actually the direction goes to much, much more, 33. so on. So with a lot of direction, then you can hit even -- you can match that. So which can say the beats. So let's say overall the transform and prediction. So for the data structure, we can use things like this. So there's a big block and small block down to very, very small block and also different in our combination. So this is the basic idea. First generation, second generation, third generation coding technology has been used. But if we look at the detail for the technology, it must be something missed. So there is one thing if we look at one step is today's video coding standard most of it is in targeted on TV industry. TV broadcasting and so on. So by that there is a lot of limitations for that industry. So this technology has actually been used for, simply used for video and Internet video. It's not good because the TV industry has a lot of special things needed to make their requirement, to match their requirement. For example, they count the clock very accurate. So if you delay even 1:00, then the system will be crushed. That is why the formal rate is very important. You can't change from a rate during the encoding process or the decoding process. So if you want to change the rate, you need to restart. So that's why use this technology for Internet video is not efficient. So of course our Internet video coding algorithm has made a lot of change. But it's not standard. It's a local or it's private feature. So for another missing -- for video surveillance, surveillance video actually is very unique. So use this technology for that. Also not efficient. So that's why we try to look at different application, try to use different technology combination to make much better in our system for the special application. Of course, people very easy to think maybe we can use some visual-based encoding, it should be much efficient. So in the last 50 years, people working in that direction spend a lot of time and energy. But the results are not good because after the video coding standard is now used, this kind of technology -So but a lot of the research results are already there. So we can use the idea for the application like Internet coding, video coding and surveillance video coding. So there's another factor. Normally people want to write a people. We're mentioned as the perceptual coding. Because in a one yield code video, there's distortion. So well the same distortion, which solution is the best for quality. So people can argue maybe we can measure the distortion by PSNR the result is not good. Maybe we can measure that by some metrics. So eventually people think maybe because all the compressed video after decoding is looked at by human's eye. So human's reaction for quality is more important. So this direction is perceptual coding. So with that also a lot of research has been done. So this result partly has been inquiry, integrated into the coding system today. So that's about the video coding. Then why model-based. So after I explain why it's not good to use the TV already in the coding standard for Internet video and for surveillance video, of course we can think a different way for surveillance video, for example. Then if we look at what research has been done for the model-based coding, there's a lot of work, good work has been done, for example, people have been working in geometric partition for the coding, for example, can use the triangle or use the mash network to describe the survey and so on. So I think later on for purposes, maybe it will be mentioned in some part about that but not for coding but for image processing. This is one direction that people have been down. And the second one is very clear for computer vision point of view can use good segmentation, use stats, information that have the video coding also is a good idea. Of course, a lot of research has been done. But the question is: Do you have stable segmentation algorithm for coding which don't need people in the loop. So this is the question: Then the third one is the object-based coding. I think this action will be easy to understand. And then it's the knowledge-based coding. We said knowledge based. It's in now the innovation, analysis and the census dates. So basically this direction will research is focused on the face related, talking head-related. So people try to model the face expression or talking face itself, a face emotion and UTAT to save the bits. This is also related a little bit higher than the knowledge base, this is semantic based coding, which not only just to decrypt the surface it is also for the moment, for the first moment can use the semantic-based technology. And the sixth one is something that I mentioned for visual system. Human visual system related, which yields a perceptual coding idea. Then the last one actually is quite new direction. It's a learning-based coding. So use the data in the Internet, in the cloud. Use more data, get more knowledge and use that knowledge to support the good coding algorithm. So this direction has been looked at as a very important thing because the big data, if you look at the partition, the big data over 80 percent of data is image and video. And especially if you look at the video part, the surveillance video, it's a majority in the big data. Maybe 40, 50 percent have the data is surveillance video data. So for this kind of data, if we can find a good compression algorithm, then we can see the storage or can save the transmission cost. Of course, surveillance video also make special requirement for what kind of quality you need to protect for further pattern recognition. You have fine art. You know, in most systems today use the surveillance video. They use quite higher compressed ratio. It's much higher than the coding standard used in TV industry. The only reason is they try to say what the cost of storage. But this is not good. Because after you compressed, then if you want to find something, find some special people, find some object from the surveillance video, you'll find out it's very hard to do that, because you have -- after the high compression ratio compress, you lose a lot of features. And then you cannot figure out what's the correct result. For example, of course the resolution is another issue. For face recognition, if the quality of resolution is too low, then it is not easy to find the people, the object. Of course, for making a better pattern recognition, you'll need very high resolution. And the resolution is much higher than today's system has been used. Another use is actually for video, surveillance video, normally a compress is not the only target. The target is compressed first, and then later on should it be to analysis. To be heavy or people be heavy or who is that. But today the video coding stuff is only for make high resolution, high compression ratio. But for video content analysis job, they do in subtly. So no two systems are considered together. So like it's parallel all the way. So it's not good. So what the reason is the coding standard, only thinks about how to code the video sequences inefficient. Not how to analyze the video efficient. So we try to make some combination, make a standard which you can take care of two steps. So if we can do that, there's very direct thinking. Maybe we can get much higher compressed ratio by using a model, using a background/foreground model, which should be getting a higher coding efficient. And then later on, since we have the background and the foreground there for the foreground, it's related to the object. So for that you can make the analysis, the project -- the task, make it much easier. That's the one thing we tried to put together. So as I mentioned, for the quality, the resolution of the image is related to the pattern recognition accuracy. For example, if the -- so here is the face recognition ratio. So this is the compressed how hard you compress the data. So normally the parameter in the video coding we normally use DQOP to describe how hard you have compressed the data. So the big QP means much higher contribution ratio. So if you look at that, you can find out basically maybe QP would be equal to 10 here you can get 90 percent and up recognition ratio. But with the QOP get bigger, so the ratio get down very fast. So that's why we should understand that if you want to make the analysis more accurate, you cannot compress too much to everything. So the idea is maybe you can compress the object much less and then you can compress the background much higher. All rows it's okay. So that's the idea for how we can compress surveillance video for that. Okay. So that's the idea then. What's the showcase that we have done? IEEE standard association. So the IEEE 1857 is a working group we initialized last year, actually it was the year -- let me see. So in a year, 2011, so we basically initialized the working group. Then the working group officially started work in the last year, in February 15. So I am the chair, there's a vice chair and secretary. And then actually after one year of work, this standard has been finished. So this is all the processes we have done to make this standard. This is also the process. for this standard. So this is the publication, documentation So basically a lot of authors involved in this standard. So the standard cover different things. Can cover basically what we call a may group or may profile, basic profile and profile, profile for special application. For example, there's an enhanced profile which is special for HD movie application. So there is a mobile profile for mobile application. There's surveillance profile. There's two actually layers. One is baseline layer. Another is we call it S 2 layer. Higher layer. So this layer is more efficient for surveillance video coding. So list standard, the standard IEEE 1815, the video surveillance profile, there's definition about the parameter how you can code the surveillance video. Within that, there's some parameter is very important. For example, there's region ID. So there's region number. How many regions you want to make as interesting -- region interesting. So if there's two regions, then the number is two. Then for the number, for the number of regions issued, you can describe for where does it start, where does it end and which parameter you have been used. Later on you can base it on that to make the object be heavy analysis or the tracking and so on. Also for the standard can support camera, mini camera parameter such as the camera data can be location, direction and so on. Also can equal many useful information like GPS information. So before this time each time if you want to make a search, use multiple camera video, then you need to have a system to figure out which data is captured from where. But what this information is, is very natural. So you know where is the camera mounted at, because there's GPS information there. So the current technology for video, surveillance video coding actually is model-based. The basic idea is now there's original frame. And then from the sequences actually we can calculate background. If we can make the background very good, then we can figure out after differential, we can figure out where is the foreground. So the foreground actually we can use that for more efficient coding and for analysis. Of course, in the coding structure, we can have a switch. And the switch you suggested orange in the video scheme or can you use the background or foreground of coding stream, scheme. Also we can support not only the fixed region of the camera also the camera can be rotated, can zoom and out. So for that the background model is a little bit complicated to build up. And for surveillance video also the condition will be changed for lighting, wider and so on. Also we have some higher layer parameter set for this script in the different condition, you know the video caption. So here's the basic idea how we can make the background. So if you look at it, there's a white line here. So this is the line. This is the plan this show here. But basically we have changed the data structure from orange and data structure -- orange not only is left view is XY, say it's timing, timing zoom. So XYZ. But we turn that in a view. Say YX. So let's see the YX view. So look at it here. You can find out it's much clean than left one. So this is much easier for making the model for background model. So basically all calculations to make the background model is on this plan. And this space. Okay. So this is the algorithm we have used for that. So then the original -- the sequences, we call them training frames. Then by this training frame we can look in this direction. So not this direction. We can look at -- we can cut in this one, look at it in this direction. So just like this. So we turn that as square. Then based on that we can make the model update. So if you want to now detail how that works, we have published several papers about that. So I can give you the paper. So this is the result we have done by that algorithm. So this is the first -- sorry, that's something I write in Chinese. The first frame. And this here is the frame of 118. So this is the result. So if you look at that. So the result is not that clean. Something still remains there. But most of the object has been removed. So this is the background and this is the original one so you can make the difference that you can figure out the foreground. So the algorithm is just look like -- so you'll get the sequences. Then you can have the background. Then you can have the foreground. Then you can code the background and foreground in different sequences and make them more efficient. So this is the format we have been used in IEEE 1857. Then with that we can easily to figure out the region of interest. Of course, the object is quite -- you know, sometimes it's tiny and small and then we can use algorithm to make that more a little bit a larger region which is easy to deal with for decoding. So for that we can use that for real time online, surveillance video coding also in most today's system, for example, h.264 already there and we can use the off line process to use trans coding, which also make it very efficient result for storage and for later on analysis. Okay. So we have used the system to recommend video surveillance system, which, for example, by that we can record video sequences then of course where's the foreground. The foreground we can just keep this region in very less compressed which later on you can do face recognition more accurate. Also can use this structure to support the human behavior analysis. So there's a high layer event. So the region and then we can get the object layer and then we can make a much higher layer event layer which is most of the multi-frame pulled together to make the behavior description. So this standard actually can support, if you want to do the video analysis then can support in three layer. The top layer we call the index layer. So by index layer, actually, it's very easy for later on retrieval, analysis and so on. Then we can have the object layer. So for object layer is the foreground layer. Also we can have the original layer we call that as [indiscernible] layer as like traditional process. So then some people ask: Is this model only work in IEEE 1857. We say no, no. So [indiscernible] has done a quite good experiment to use the model-based algorithm over HTEVC. So the result is we change nothing. Just put foreground, background model there. Then here is the big savings. So we are saving 44.78 percent of a bit than actually EVC, with similar quality, similar video quality. That means we almost double the performance that EVC. This is, of course, for surveillance video. So still -- my student still has done the same experiment on conference video. Maybe I think generally more interested to this one. The result is not that impressive because it's only saving 13.79 percent of a bit compared to ITEVC. One of the reasons is that for this one, the background region is the majority. So you can make a more efficient one. But for this one, for conference video, majority actually uses human face. So when you move, when you change, is changing. So you cannot make much background to make the advantage. Okay. So this is the first stage we've already done. And then we have our next in our staff. So the roadmap looks like today we have finished the 1857/2013. The next year we plan to make version 2. So version 2 -- so AVS 2 is from the performance point of view is equivalent to ITEVC. So I believe next year for video surveillance we can have IEEE 1857 double the coding efficiency than ITEVC on surveillance video application. We also have the audio part, system part. And then this one is quite unique. We have -- we'll call that description part, which can support the analysis and census and we also have the interface part. Okay. So this is the major part I want to share with you. For the summary, I have two parts. One is the future direction. So basically today's video coding use the block-based technology. So make decoding efficiency improvement. Then also I'm talking as the model-based is somehow belong to the knowledge-based. Then there's a one version that I did not touch is the cloud-based, use more data to learn and so on and on. And I didn't taught it. And one thing I didn't taught was perceptual coding which also some people working on -- of course at this stage quite many a paper already published. But only a very few actually integrated in the standard. So I think people were keeping their effort, make the video coding efficiency higher and higher and higher. Of course, we should not only look at the broadcasting video coding standard, we also need to look at the surveillance video and the Internet video. So we have published some paper related to the model-based coding. So if you are interested I can give you the list. So before I finish my talk, I think as Zhengyou mentioned for this year I have a new job, the job of NSF China. So NSF China is quite young. So interested a little bit about that. So it's relatively young. It's a start in 1986. So I think the mission is quite close to the NSF in the United States. So this is the organization structure we have now today. So basically we have eight departments for different areas. So the Area 6 is the information science. So I'm in charge of this one. Also there is office, the general office, planning office and so on. So the third one is science policy office. So I'm in charge of this office already. So here's some other -- so the funding basically supported three things. One is the research. Why support people and for otherwise for environment, research environment. So people we have quite a different category of funding to support young people, junior -- senior-junior people and senior people and so on so we have different category. For project, they also for the research also we have different programs, general program, key program, major program, major relationship plan and so on. So of course for the, so for the people there's different categories has to be matching different, a year older. So for the funding, there are program categories, there are general key and so on also for the money, for the amount of support from some hundred K, for some 10K, 40K and so on, and to have a meeting R and B. And then how many U.S. dollars. So the R & D is in the privileged one. So for the total budget by year -- so you can see from start the NSFC, the funding is not that many, but with time being is grow up quite fast. Also it's one of the things you can connect with Chinese economic region. So for the number of proposals, the application, actually last year is going to the top is 170,000 proposals for a year come to the NSFC. But this year it's a little bit down because we have a new policy. So each people, if feel two-year has to be stopped one year in application. Which somebody applied two times or failed stuff, somebody apply one year, they automatically stop one year and try it next year. They think maybe they need to get more good preparation then do the application. So for the evaluation process, basically after the proposal coming into NFSC -- actually, we don't accept the proposals from a person. We accept -- we collect from institutions. So any faculty need to give the proposal to their institute. Then we collect the proposal from the institution which will go in different direction, different division or department then there's the format review. If the format not match, then we'll not go into the review process. So there's a total of five percent of proposal is formatted and not good. Then after that there's a peer review. So peer review will select the 35 percent from 95. Then go to the panel review. Then after the panel review, around the 20 percent proposal will be accepted. Then we'll get the support. So I think that's our role, the process of NFSC. Thank you very much. >> Zhengyou Zhang: [applause] Thank you.