>> Arjmand Samuel: Okay, so it is my pleasure and honor to introduce Dr. Shahab Baqai. Our paths have crossed a number of times with profound consequences really, ha, ha. We were, we co-founded a startup fifteen years back. You’re looking at network design and software design for some of the problems he’s working on right now. Then he went into academia. He was a Dean of School of Science and Engineering. He was a Chair of the Computerized Science Department at Lahore University of Management Sciences. Right now he’s on a sabbatical with the University of Illinois at Chicago. Some of the research that he’s doing currently he’ll talk about it right now. So without further ado let’s welcome Dr. Baqai to Microsoft Research. >> Shahab Baqai: Thank you so much. I hope I have sufficient breath in the talk to appeal to a disparate audience here, and people sitting online, and hopefully they’ll be able to get something out of this talk. So the title says Multimedia Communications an End-to-End Perspective. Actually I’ll be starting off with some applications. Then I will talk about the basic problem, what we have solved, and I’ll interpose this with the work that I do, and some of the results that we have achieved there. So primarily I will cover two areas because of the time constraints and others. I don’t want to take the whole day trying to tell you about my research for last decade or so. But I will give you some of the areas that I have looked into, and implemented, and gotten some results. But the primary emphasis of this talk would be to present to you the, a basic understanding so that challenges ahead can be identified. So that’s my target for this talk. So, well, Ubiquitous Multimedia that is what we have dreamed of. Some of the applications are giving it to us already. If you’ll look or don’t go too far, a decade ago the things that you would have imagined then are a reality now. Specifically our GPS combined with mapping, and labeling, and Google Maps, and the like, Microsoft maps, the search base, etcetera. You can have multi-model interfaces. You can talk to a device your personal assistant, your hand held phone, and run basic queries. Those kinds of things which we ten years from now if you would have told me that, yes researchers were dreaming of those things but they weren’t a reality, now they are. The computational power in your hand or in your pocket is far beyond the computational powers of servers twenty years ago. When I learned programming I used punch cards to enter data and we will get the output the next day, printed on a piece of paper saying that there was a syntax error at line numbers six. So your program did not run, anyway. So to me if I just sit back and try and think about where we have come the progress has been tremendous. But are we there yet? Well, you guys and me included, because I like to think myself of this generation to, imagine what we could do further. How we can scale up pings if one video on demand application is running how about a million users to that. Is it possible now? Somewhat but there are constraints, quality, resource constraints, constraints in how much we can scale up, how many problems we can solve, to what level we can solve. So the challenges are still there. Of course the natures of those challenges have changed. But I’ll try to give you a feel of what are the current challenges in my thought or my mind that needs to be addressed. So this is like a nineteen, actually twenty zero three map of the world ISP providers. What kind of backbones do they have and things like that. You can’t see it but just to give you an idea about the growth of the communication medium that we have. Which is the primary motivating factor for the info revolution that we are benefiting from today. So multiple areas design and engineering, CAD, CAM, communications, learning, entertainment, business all have been impacted by that. Okay, so just as an example application, health care application. Everybody can appreciate health care, right. Typically a cohesive application that has the patient record, the prescriptions, the historical data, as well as the evolution of their health over a period of time, so temporal evolution as well, and some sort of that. Patients, physicians, caregivers they can all access the data maybe in real time. Somebody is sitting there at home wearing a sensor and their ECG is being monitored by physicians all over the world or as per caregivers somewhere else. Okay, is it possible now? Yes it is. Is it holistically an application that delivers to the physicians and the customers? No, there are still some work done most of all in the user interfaces in the communication medium in the high resolution that is required by diagnostics and treatment like that. So, do I, are we there yet? Getting there but of course that is still something to be desired. Will the result of all this data be available to my health insurance providers? Would my premium go up or down based on my health status now or my health history? What information can I get out of, could we improve our diagnostic techniques for cancer patients for example? So a group at LAMS is working with a cancer hospital in Lahore trying to figure that out and use that as a diagnostic to use data mining techniques to enhance the diagnostics procedure that we have over there. Now, let’s get, do a little bit more complicated scenario, the Department of Defense Info-structure. What are the system features that they envision? At the heart of it all is the communication infostructure. That provides ubiquitous end-to-end, high bandwidth, high resource service to all applications that require it. Hence the anyplace any time kind of thing at a global scale. So that is what we are looking at and this is just a collage of various things that they do or they need to do over here. Since the networks are part of it and they also form under multimedia communications, sensors can take a sample of the environment, or take real time video feeds for certain things that can be analyzed at a remote location, or consumed for some surveillance purposes, or monitoring purposes, or environmental prediction purposes, or any of these combined together, okay. So critical decisions would require or are anticipated to require, or wish to have real time multimedia applications to them, so, like a cardiogram and multiple doctors coordinating or collaborating together to diagnose a patient maybe just right before open heart surgery, or angiography, or angioplasty. So those are the things that we are looking at. So at the bottom of it and that is what I do. I try to figure out how to enable multimedia data with the required play out quality, with the required real time constraints from one location to a remote destiny, one or more remote destinations. Now we have the cloud services, the metal layer that hides the network abstractions or the source, or these services, and provides them as a service to the users. So what are the challenges there, right? So we have this network which is heterogeneous but needs to be interoperable. Video, audio documents, tradition text, or high resolution video, all that is required to be communicated there, and these are the constraints. These are my requirements, I require the right quality for the device to play it out which may not be the same on a hand held, as opposed to a tablet, as opposed to a PC, okay. Security of course paramount, I require security at a number of levels. My payload needs to be encrypted; nobody should be able to hear what I’m saying. Nobody who’s not intended for that communication, of course I would need to decode it and decrypt it to the user, okay. So unintended participants should not be there, eavesdropping should not be there, okay. How do I do that? Sometimes it’s just the payload but at other times I don’t even want to know who’s talking to who, at what times. So connection establishment or negotiation, or what are the capabilities, what are the codec’s that other person can use? Inherently I want to advertise those. I want to announce it, look this is my capability, this is what I can use. But I don’t want to announce it to other users who can exploit that information, so security at a number of levels. Then I want synchronization in a cloud that is an audio server; that is a video server, located maybe geographically dispirit locations. But there has to be a lip sync between the audio that is coming in and the video that is going out. Like the stock is going out then maybe a different audio server which is optimized for audio in some way, a different video server, both have different constraints. Audio requires different bandwidth, and latency, and jitter requirements. Video is more tolerant to, let’s say jitter or things but that require higher bandwidth. Different requirements but they need to be synched together or the same underlying networking info-structure. That is what we’ll be talking about here. How do we accomplish that? Concurrency, I have certain data that has been generated that is available at different locations at the same time. Somebody collaborates over that, changes certain things, who has control over it, those kinds of applications. As an example this is something that we already see here today. We can go to a map, we can use that map, we can get real time video feeds, we can look at the street view, we can get the traffic. If there is a crash somewhere we can get that information real time and make decisions based on that, okay. So end-to-end perspective for communication and qualified by multimedia communications is quality management across various systems across the board. Heterogeneous systems, okay, with different requirement, different constraints, okay, so we require the network, the databases, the end-systems architecture, the security to all work together to deliver that right experience. To the right person, at the right time, in the right location, and I can go on with a couple of these things. Okay, challenges, past, current, and anticipated, okay. We need to model these documents. We need to figure out what resources to allocate. If you want to make it cost effective the resource allocation because some of the resources would be expensive, has to be done dynamically at the right time to the right user, okay. Okay, so required high bandwidth, why don’t I over provision the network? Why don’t I use instead of a dual recorder code machine? So it has to encrypt it, has to require more mega flux, throw more mega flux at the problem. Sure that is what the service providers are doing initially for multimedia applications. But then there were one user, ten users, what about a thousand users using the same data, are connecting to the same data stream, or connected to the same servers. In the search biz this is what happens. Basically Microsoft Bing search gets millions of hits, this is just the top of my head figure, right, I’m imagining that they do. Okay, I didn’t intentionally use Google here. [laughter] So, okay, so we need to model the data and model it in a manner so that the next layer which is going to talk about efficient clustering, indexing, search kind of techniques can use that information. Which can determine that okay at what time do I require what resource? So I need to map no other data, I need to model it, I need to figure out what are the deadlines for a video frame to reach the destination for quality, high quality play out so that there is no motion jitter or break in the user experience. Okay and it has to be done in a distributed manner. It has to be done somewhere along on the cloud that we have. No network caching and synchronization techniques for this distributed multimedia, does the network handle it, does the end system handle it? There are various ways that people do it, sometimes in the network, sometimes in the overlay network, sometimes at the application level, okay. So the first thing is how do you model the data so you have various requirements? I will talk about synchronization and resource provisioning first because that is one area that I have investigated in detail on that. Okay and multimedia communications is coming together various cross disciplines as I would call it. You need to know signal processing. You need to know information theory. How to model your source, how to represent your source so all these codec, HR323 are some communication protocol like Sip or HR323 users at the bottom of the layer, the payload is encoded in a certain manner for efficient transmission, efficient storage, efficient transmission. Are they the same thing? Not necessarily, so they have their own set of requirements. Is source coding and channel coding the same thing? No and people have been doing it independently, independently doing source coding. At the source what do you want to do for storage and transmission? You want to remove the redundancy in your information, represent it in a summarized concise manner so that it requires the minimum amount of resources for a certain quality based play out. That is what is required. Now, once you’ve put it onto a network, let’s say it involves a wireless medium which has a fading channel or that is bursty packet drop that is possible over there. There for channel coding they add redundancy FEC, forward error corrections, and those are important. What do they do? Hard abstraction level if I take the top down view of it we are adding redundancy to the data so that if I lose certain amount of data I can reconstruct it from adjusting data, right. So, hang on, source coding removing redundancy, channel coding adding redundancy independently. So what researches said, okay, let’s marry them together let’s, for a more optimized more efficient representation. But then I have to encode my, let’s say video to be transmitted over a high quality link versus a wireless link separately. How do I handle that? Do I have ten different encoding stored there or in real time do I cater for [indiscernible] encoding or are they negotiated real time? People came up with scalable video encoding for different heterogeneous platforms, you encode it in this. So you get more layers for higher quality, you get less layers for lower quality. Over there yes there is a little negotiating overhead but tolerable negotiating overhead. So we have scalable video coding. So source coding. How do we model the data to encapsulate all of this that, okay if I’m doing a video along with audio lip synchronization has a delay requirement of eighty milliseconds? If I go beyond that then real time conversation breaks down. Okay, so people do work in understanding and evaluating the quality of various media independent or jointed. So I will call them as inter-stream and intra-stream, intra-stream means within the stream itself. There’s a video stream, what is the quality requirement there? What is the spatial frame size? What is the frame per second that I need? Is it a [indiscernible] NTSCPAL or QCIF? Now those are just names about different quality of video, just video. But intra-stream would mean video and audio. Audio is coded using MP3, let’s say codec. These are names of codec in case; please feel free to interrupt me at any time. Okay, so I need to model this data. So what modeling methods do we do? So I need models for spatio-temporal synchronization, content representation, quality attributes, access, so many other features and I just come up with a list. What are the models available out there? Language-based model as SGML, XML now, object-oriented models, graphical models, I have worked on, so this is the example of a browsing graph, a dynamic browsing graph. Which let’s look at a certain time frame about that browsing graph. I see that there is audio in there. There is text, video interposed with certain synchronization requirements. Of course the bandwidth profile and so I’m just looking at what kind of bandwidth do they require. It’s a varying bandwidth profile. Whenever I have video along with audio that is the highest bandwidths of capacities that I would require for a network to efficiently carry it from one point to the other. That is what I’m looking at, right. So I need to know this so that I can efficiently allocate resources in real time to this rather than okay, I can say okay this is the capacity, the top capacity that I require, fine, let me give it that capacity. Fine, I can do it to one connection. I can do it to a hundred connections maybe but what about a million connections? My network would say it, no more and that’s where I get denial of service or I need to apply access control. Okay, so many users logged in, sorry service is busy, come back and voice this video another time. We all know what happens then. We, of course we never go back and watch that video again. Okay, so content at the right time that I require it if I don’t get it, it doesn’t serve my purpose. Okay, so I have worked on a model call the Petri-Net Model and I call it the Object Composition Petri-Net. I didn’t, I was part of a group that worked with this model. It talked about modeling the experiencing multi-media presentation, okay. Then how do we provision it for networks? So we have various things like that that this is a video, what quality, what codec. But that is going to fine train there so I’ll restrict it there. So how do I use this? I use this in a database schema. I use it for access control for security, for storage management, IO scheduling, end-to-end synchronization protocol for my networks. Okay, so what do I do? I have this Meta data and so I’ve modeled it now, right. I need it in my network for my OS, for my database, for my security. I adjust the quality of service of each of those components. Usually quality of service is associated with networks, what they can give you, what is the delay latency capacity, bandwidth, burstiness that a network can handle, right. But I expand it for this, for the sake of conversation and talk about OS databases and security, right. Then I do something, I, then I say okay now I can do the source management. Because I know the profile of my data, I know its [indiscernible], its reliability requirements, its synchronization requirement. So now I can play with those, thanks. Okay, so this is an architecture that we talk about. This is the network part and we’re talking about a membered multimedia communications. How do we enable it? So this is the, let’s say the network layer, the routing and allocation, the synchronization layer, the configuration management, information, location, identification, and the application layer. Of course, this is just an abstraction for us trying to understand the various functionality that is required, in applications of today. For example let’s pick up a typical application that everyone is familiar with. Skype, where does what happen? At the application layer level most of the things, right. So Skype is an application that establishes a connection, and look it looks for a super node, and then figures out [indiscernible] versus bad translations and uses [indiscernible] to do that. Then goes through a firewall, establishes a connection, selects a codec, and at the other end there are various presence indicators and things like that. So all those things and the OS, the database, the storage in some distributed fashion provides that for that application, okay. So basically how do I determine the resources now? Because I now have the model and I need to map those resources on top of that model or using that model and provide that. So at the top I’ll introduce another term called Quality of Presentation. What I specify the synchronization, the frame rate, the buffered size that is required for that, as well as my presentation requirements. What the service provider can provide, the network can provider is called the quality of service. So I need first mapping there. Then I need to talk about end-to-end resource allocation. I say end-to-end because internet or our packaged networks is a hop-to-hop network, provision of the basic architects traditionally use to say that all data is treated equally. You give me the data I will try my best effort to take it to the destination. Of course there is a networking architecture, a layer architecture that does that for us, right. Now we have this multimedia data and like for network architects like myself who work in the multimedia domain, we say that not all data is equal. Some has higher priority, other if I lose it fine I can still manage. So we talk about those things and I have worked in the area of how do you manage graceful degradation in a resource constraint environment? Let’s say you wanted thirty frames per second video with a certain NPSE resolution, okay. But you say okay I can’t have it because I don’t have enough resources for it. Can I settle for a twenty-four frames per second video? Yes I can maybe, right. So there has to be this kind of negotiation, re-negotiation that can happen. But if I lose a packet in the middle and it is NPAC two encoded or NPAC four encoded [indiscernible] encoded with you I can throw other names here also. If I lose some data how is it going to impact my group of pictures, my slice, layer, or my macro-blocks, or my motion vectors? Those are the things that are used insight, H.264 encoding. So I need to have an understanding of that. So not all data is equal there, right? Some data if you lose that I lose a group of pictures. I lose a whole frame. Other data that I lose the same comparable fraction of data I just lose a macro-block and I see a black spot in the middle of a frame that is played out one over thirtieth of a second. So the quality appreciation for that loss goes from one end of the spectrum to the other, right. So I have to figure that out, right. We need to do it intelligently. So all data is equal; not really, okay. So we have to, we can exploit that or leverage that information and design certain protocols and things that, okay, so constraints, constraint environment we have to degrade our performance or play out. But how to do it in a manner which is least likely to impact an application, so that’s an area of our research also and it’s all scheduling. With our crayon finally synchronization, so the thing that I’ll go into a little bit more detail is end-to-end resource allocation. I’ve worked on this part also but maybe for another time. I’ll, after this I’ll go into the security aspect of things which is, in my opinion one of the most paramount or most impactful things that we have to do today. We have to convey multimedia information, reliably, and securely, okay. So I believe that that is an area of research for especially for multimedia. How do we leverage our knowledge of encryption cryptographic techniques and apply them efficiently for multimedia traffic? Okay, so I’ll investigate that part also and present a little bit of my work to you over there. Okay, so let’s think of a black box approach. This is a source, that’s a destination, and in between we have a network of course a hop-to-hop, or has resource reservation techniques, the RSVP or those are all techniques, or an NPLS provision virtual part that provides a certain degree of bandwidth, a certain latency, a bounded latency, and a bounded jitter. That is what our quality of service is, so possible quality of presentation parameters used to formulate end-to-end resource allocation problem, jitter, delay, synchronization, reliability, bandwidth, right. Those are our parameters that we use. So if I abstract the communication and I have multiple servers in the cloud somewhere, right; I have a network. Then I have various paths that that traffic would take, could be the same path for the entire flow like an NPLS, or it could be like a path for each packet, like the internet abstraction, okay. So the problem like that we have worked on and there are other objective functions but there is one objective function we say let’s talk about reliability. Let’s, how do we reliably take this information and make it available to the next, to the destination, okay, end-to-end. So we talk about let’s look at capacity for example it will require a certain bandwidth and we have looked at the bandwidth profile, the time varying bandwidth profile that ultimately we need to dynamically allocate our resources to match that, okay. So these are certain parameters that we have N objects that need to be transported, N could be large. If it is larger then it’s a bigger, computationally harder problem to solve. So this is, we have the bandwidth requirement for each object, that is specified to what is in the model, we have the reliability requirement of each of those objects. So the total capacity requirement at a given instant in the presentation would be let’s say gamma, that is the sum of all requirements of these simultaneous objects that we need to transport, simple, okay. If gamma is greater than the available capacity, right, so certain degradation has to take place, right. How do we do it gracefully? So this is the theta i is the percentage of the data that has to be dropped from each object, okay. How do we determine theta i for each component? Not all data is equal. But let’s simplify the problem for the scope of this talk. Okay, let’s say, okay we want to; let’s say we say that okay we will use an objective function of fairness across all objects. Let’s say all objects will be [indiscernible] objects. Like, let’s not get into the issue of our disparate objects. We say that in accordance with their reliability requirements we want to distribute the penalty on that. We formulate that’s like a non-linear programming problem, NLP. We have ways and means to solve an NLP problem so we say okay. Let’s talk about the network and figure out that there is the path. Then there is the CPU cycles that need to be allocated resources. So the network, the CPU, the storage part, and this is let’s say the theta i L K is the amount of penalty that is going to be imposed at each link, and then end-to-end connection, okay. So we can formulate this like an NLP problem with these objectives where we want to minimize the means square penalty, okay, or distribute it justifiably across all the objects, okay. So that’s an objective function, we can change that if, depending on our requirement, right. So this is like minimization problem that is set up and this is like a flow chart about all of these. So allocate IO capacity by solving the NLP which talks about IO NLP, okay. Then link NLP and then if there is an alternate part that needs to be selected for that, or if it is available. If all of these are satisfied, the non-linear problems that we formulated before then we accept the connection so this is like admission control. Saying, okay for this connection which would have that traffic provide that we have modeled can we provision it? Okay and we do it and then we go to the next decision point. So what are the decision points in the time bearing requirement that we have for capacity because that is what we are looking at right now, the bandwidth, okay. So these are our transitions level at what point that’s our requirement change. That is how we proceed with this. So at, these points we have to solve an NLP problem, but then we can do a tradeoff. We can say okay if the, because we don’t want to be solving the NLP problem at each of these transition points, right. So we can do a course grained resource allocation and it would follow something like this here. If there is a significant change in the bandwidth then we will solve the NLP problem and allocate resources accordingly, okay. So this is one of the aspects that I looked at and I’ve looked at and things like that. I’ve simplified the problem just so that I can tell you what all is required in resource management, a part of it. Then comes synchronization; then comes other factors as well of making a cohesive application that involves multi-model input. Okay, so the second part of this talk, any questions that I should give you an opportunity to ask questions? Is there any remote people? Can they call in and talk? >> Arjmand Samuel: They can… >> Shahab Baqai: Okay… >> Arjmand Samuel: Yeah they can send in… >> Shahab Baqai: Formulate a question to you… >> Arjmand Samuel: Then I can also [inaudible]. >> Shahab Baqai: How are we doing for time Arjmand? >> Arjmand Samuel: [inaudible] >> Shahab Baqai: Okay, okay, no questions maybe I simplified it too much or everything is real clear. The fact is that all of us have thought about this but I’ve been impacted by some sort of degradation, quality constraints in the network, while using certain applications. I believe that all of us have either watching a movie at Netflix, or using conferencing software; or I don’t know if you guys remember the first adaptable protocol came out of Real Audio. There was this Real Audio thing but I don’t see it anymore, things like that. What use to happen there was that, when you request a connection it would say calculating buffer. So it was dynamically allocating resources. It tried to send out probes and get the current network capability of performance and that’s a challenge in itself. How do you estimate the current network bandwidth and project it for the next allocation? So and there are various factors that can impact dynamically on that prediction. So that, so it use to calculate the buffer and start a presentation but the networks congestion or characteristic would change. Then it would say; it would hold the presentation and say recalculating buffer, okay. Now the applications don’t report it to you but they still try to do the same thing and try to do it adaptively in a manner so that any degradation is hidden from the consumer, okay. So that’s a challenge within itself. How do you conceal errors without degrading the resolution, the quality, the play out, the color, or whatever, whatever are my qualities of presentation criteria? Okay, so now I’ll switch tracks and I’ll talk about encryption. What I will talk about is selective encryption. All of us know cryptography or all of us know about the keys and exchange, I’m assuming that here because I’m sitting, or I’m talking to a very enlightened audience I believe. So, and all of these things we use on a daily basis. We do establish a secure connection using TLS, Transport Layer Security or IPSec. We know that there are passwords in the clear on NTLM on those kind of authentication Cerberus and those kinds of, things like that. So let’s talk about security attacks on the internet. You’ll notice a shift in the presentation style because this is a presentation that I prepared, that I used, reused from another conference. So let’s say we have this internet and there are various applications; some of them are out there that try and get, or eavesdrop, or secure certain information, or obtain certain information to which is not directly addressed or available for them. Okay, but this is available out there, it’s available in the network and they want to get this information. This is possible because initially when the network design was made or, so we didn’t think about security that much. But slowly and gradually we were to realize that this is an important concern there like that. So no longer we can be done in the open. We need security because there are critical applications out there. Our business transactions are out there, our bank accounts, our credit card information; on an individual scale and on larger scale sensitive information that we don’t want everybody to access, okay. So what we do is we use various techniques like that secure these at different levels, okay. So have become the D factor standard, like I was say that I’m assuming that everybody has used them at some point or the other, like that. What about multimedia support? I deal with multimedia so I say okay, we need to secure everything and how do we do it there? Well, multimedia as we had, as we saw in, and I wanted to convey to you also is, has different requirements for different media types but they are definitely more constrained than for let’s say traditional text shield data. There is a timing constraint, there is a reliability constraint, there is that certain information if you lose, the impact on quality is very profound. So how do you address that? So we don’t, there is significant overhead to it, but, if I apply traditional security protocols on multimedia traffic and multimedia streams, let’s say it’s a one hour long stream so let’s say about a thirty percent computational overhead which is the top of the head prediction for traditional security, or encryption methodology; so thirty percent overhead over a one hour period significant overhead. Can we do better? Can we minimize that in terms of computation, in terms of network footprint, in terms of latency? So we would like to be able to do that, okay. So we know that there is a high overhead. There is the bandwidth constraint that we have. There is a constraint on latency and jitter that we have. But security protocols that were available out there impose all of these and they impart the quality of multimedia presentation. So we would like to do something about it, okay. So we can talk about faster hardware, use propriety protocols, or selective encryption. So I will concentrate on selective encryption basically, intelligently secure some data which will give me equal in privacy that I could have obtained by applying the traditional ciphers to the entire stream, okay. I would like to do that in a manner that minimizes the CPU, the network overhead, and the latency that is incurred while doing encryption, okay. So this is what I will talk about here. So to me it seems like the best option because of my unwavering belief that not all data is equal especially in multimedia traffic, okay. Some data is more critical to the decoding process, to the play out process than other type of data, okay. To the deciphering processes is found when I want to decode something if I, if there is some data in the clear which does not give me much I can treat it like noise, or I can evaluate whether I can treat it like noise, okay. So let’s look at a typical VPN that, where does the overhead come from? So let’s say this is a private network that means to communicate over the internet in some secure manner. The usual data link layer, the network layer, and on top of it the transport layer, of course application layer sits there on the top, okay. So the virtual networking interface which enables us to establish a VLAN type of configuration or VPN configuration sits along with this there. Basically you have a control plane for it. We need to do a key exchange for it and so that information goes out there, public private key, etcetera, depending on what sort of algorithms am I going to follow. This is the data plane where I have to, the information is compressed, encrypted, and authentications to do all the tasks that are performed in the data plane there. From a connect, private network there is data that comes in, goes through that, or is rerouted through that cloud, then just going directly from the network layer to the transport layer. The VPN Daemon performs the compression encryption and authentication mechanism there and then goes through the TCP stack, or the transport stack, network stack to the internet, out there. Okay, so we want to make this more efficient for multimedia traffic. Why multimedia traffic? Because ISO [indiscernible], multimedia traffic is ISO [indiscernible] there’s a stream that requires high bandwidth, low latency, low jitter, bounded jitter, type of communications for extended period of time, right. So we want to do that so we say we observe that multimedia documents are exist in the compressed form. Because you want to make representation more efficient, or storage more efficient, or transport more efficient, okay. These are the example of the codec’s that are there, MPEG, OGGVorbis, Althera, H.264. All these are all compression standards or codec’s, compression, decompressions. Okay, they have a property, what does compressions do? How do I reduce or redundancy from the data? I concentrate important information into a smaller, with a smaller footprint, storage or network and I devise a way of encoding and decoding it reliably, okay. So I can talk about lossless or I can talk about lossy but all compression techniques want to do that here. So for example the OGG standard which is the [indiscernible] standard and the Vorbis code files have specific decoding information only in the first hundred bytes. The rest of that is encrypted or is compressed information that I can, if only that is available without the headers I will not be able to decode it effectively. But we want to discover how effectively. If I just hide the first hundred bits and let the other bits go in the clear what happens then? This is what we have investigated and I will use the G729 codec which is the widely used codec in audio communications as a represented example for it here. So if there are errors in the chunks like from attackers or imagining that if those chunks are destroyed they may make the decoding process impossible. We want to value it how impossible. What is the data in each of the compression standards that is so critical that without which we will not be able to decode that information or the information is so degraded that I will not be able to make sufficient, or gather sufficient or interpretation, attach any interpretation to that data. So this is what I want to evaluate there. So there’s a difference between looking at degradation, or deterioration, or from hiding some of the information versus the quality of presentation that I was talking about in the first part of this presentation or this talk. There I was saying that okay, if certain degradation has to occur I try and hide it or minimize it and sure present the data. Maybe I see a noisy frame but I can still make out if it is a house, is it a car, or is this a person, or is it day, or is it night, right? From a security perspective I want to introduce sufficient distortion. That removes all contextual interpretation from the data. So an encrypted stream I should not be able to get any information, not just a deteriorated or a noisy image but I want to hide the data completely, right. So that’s my target. So when I talk about quality that’s from a different perspective. It’s the same trying to hide the data or the data is lost. But from a quality perspective I’m trying to minimize the degradation, for a security perspective I’m trying to hide the interpretation of the data without the necessary key or the decoding process that we have here, okay. So I want to make the decoding impossible without my key, okay. So the success of these schemes depends on the amount of data that needs to be secured. If it is substantially smaller then I will get certain gains out of it. The processing cycles that need to be allocated for the encryption and the subsequent decryption would be there, the latency would be there, the network for print or the overhead of this would be less, okay. I need to prove that the privacy is preserved. That’s the hard part. Provable security is the hard part to do, okay. So what I will do is I will say okay let me simulate certain attacks. I will say, okay, what is the information that I got out of it? So we’ll, I will see that. I’ll try to keep it concise and not bore you with too many details. So the objectives are that this security or this sort of encryption or data hiding should be effective, efficient, practical, economical, and privacy should be ensured over all of this. Performance should not be compromised, okay. Design, architecture, coding requirements should not involve major functional changes. Because if I have to implement major function changes who will adopt it unless there is some significant gain out of there. So it should be standards compliant in some way, okay. Cheaper than the cost of program being secured, if I want to try and hide the information and I, the cost of securing that information, the resources that I allocate are so significant that who cares what I talk, right, paradigm. So, well of course I care, you care what I talk. Okay, anyway, so let’s talk about voice communication because representative example would be from the audio compression standards, G729 codec. So PSTN calls that inherently secured. The Public Switch Telephone Network for those of us who don’t know what PSTN stands for. It’s a connection oriented infrastructure. A physical connection is established and all the connections are owned by the Telco. If anybody snoops on that it’s a Telco itself that can monitor calls and things like that. So let us assume that that’s not happening. So, but when we talk about Voice over IP or over IP communications, packet switch communication over a public infrastructure like the internet then I want to secure the voice and video calls over there. Of course I can secure the entire stream using the very robust encryption standards that are out there like RSA or other algorithms that are out there. But what I have tried to construct is a need for scalable partial encryption, okay. So let’s look at the, a typical call establishment in a Sips scenario. Sips is one of the Session Initiation Protocol is one of the most widely used call establishment procedures, open source procedures that are out there. So if a user wants to establish a call through the public network to another user, okay, it sends a Sip request depending whether this Wi-Fi or IP server is acting as a proxy or it just facilitates presence, monitoring, and things like that. The request goes there, goes to the user, the user translation takes place through the Sip, there is a phone number associated with each user that can be dialed and the server knows about it. The user accepts the call, some codec negotiations take place using SDP session decrypted protocol, the request is established and then payload data takes place, very simplified, very simple for the purposes of our understanding this is what happens there, okay. So my interest is right now to secure this part. Of course who is talking to who, when are they talking about that? That is an important consideration as well; I want to hide that to. But let’s leave it for another time there and make things simple, payload only, okay. So let’s look at one of the most widely used codec G729. Why is it the most widely used codec? It has about six to ten kilohertz, kilobits per second network footprint, low complexity, has real time implementations available out there. Of course they do demand a certain royalty associated with it, and has office quality voice or total quality voice. For telephonic conversation this is fine, this is not a high fidelity codec but for voice over IP communication this is the most widely used codec out there, okay. Of course there are wide band codec’s as well as narrow band codec’s out there also. But, so we, I decided to use this as an example. We have worked with other codec’s as well. I’ll talk, while summing up this talk I’ll talk about what’s out there, what are the next steps there. So these are, so what this codec is it takes a ten millisecond sample and correspond, so audio frames of ten milliseconds corresponds to eighty samples, okay. So that’s the sampling there and if I look at the codec or the bit allocation in the G.729 stream. So I talk about that zero to sixty-four bits, what are the parameters that are being encoded over there? So the first step is to identify what are the important parameters that are required to decode this or to reconstruct an audio signal, decompress audio bit stream that is compressed using this codec, okay. So that is the first step. So there were two approaches that proposed for encrypting this in a partial manner. They identified this is Servetti and De Martin, they identified that if we do forty-five percent of, or if we secure forty-five percent of this data this is equaling to full encryption. If I encode all of the bit stream, forty-five percent, yes, fifty percent gain, fifty-five percent gain and that’s good, okay. They said okay if we secure thirty percent of that, so it removes the understanding of the speech. I can still probably make out that the speaker is a male or a female, or the pitch of that, or when there has been a pause in the speech. But I cannot understand what is being spoken. So sometimes there’s just this requirement that all I need to do is to hide the intelligible speech, okay. We’re not worried about the pauses or things like that and other information that can be extracted from a speech signal, so thirty percent not bad, okay. So, and they identified that each bit of the packet that is contained there says sixty-four bits, the ten millisecond samples there corresponds to what mean open and score value. This is how you measure the quality of the voice, call, okay. They decided, okay, that if I remove this there is this impact and if I have, encode that bit then what happens, okay? So another work that was done by Hae Yang and they have a patented codec out there that has five levels of encryption, selective encryption. They evaluate that, what do they hide and how much encryption do you do with that? So we said okay, like can we use their work and use that? So two level of protection and this Servetti and De Martin, five levels of protection over there, actually these two levels correspond to a class three and a class two compression, or equal in privacy that way. Can we do better? So we said okay, lets, and our result or what we did, we said okay only encrypt five percent of the content rather than the forty-five percent, the thirty percent, and the various. So, and we based it on direct hash mapping shuffling, and shuffling the key changes after each of eight G729 frames, okay. So using this and changing the encrypted hash table with, under some secure mechanism enables us to do five percent of encrypting, five percent content encrypting that much and getting, we claim equal in privacy of full encryption. So to us that is a significant impact of that. Okay, so I’ll talk about what are the important fields and unimportant fields that we have experimented with and identified. So this is just a listing of that. I can go back and talk about each of these; they correspond to the pitch, the PCM samples, and the tone of the encryption there. So we said that these fields are important. These fields are unimportant, okay. If we look at that, that all these fields on this side are unimportant, these are very important, and medium important fields that we have over here. So what’s our encryption technique over here? We buffer these frames because you collect a number of frames so we have sufficient data otherwise Brute force attack would work. If I had only eight bits to compress then there are only two fifty-six combinations, okay. So I need a payload size but that payload size or I should not be buffering and so many frames that they correspond to. Remember if I have a lip synch synchronization requirement so my latency cannot exceed eighty milliseconds. So if, so I have to talk about how many frames to buffer so that it doesn’t impact the quality of the application, or voice quality of the application. Extract critical and non-critical data out of it, the ones we have talked about and identified. We do the hash table look up and shuffling of that so that will defuse the information over the entire stream so that it becomes unintelligent, okay. So we get the G729 audio streaming, this is the streaming stack, this is the eight samples eighty bit each that we group together, this is our encryption, our batch encryption that we do. So six hundred and forty bits, give them to the splitter, split the critical, non-critical fields, okay. We take the critical fields from there, okay, and we do a hash table shuffling. Basically we give it to the standard AES encryption, okay. Then reassemble the bit stream to be standard and compliant, okay. So this is the first step, the second step, third and the fourth step, okay in phase one, okay. In the next phase we take that through, the critical fields are now AES encrypted, the rest of them are in the clear, okay. The stream is reassembled or those eight samples are reassembled in the six forty bits to conform to the standard, okay. Then we do a hash based shuffler, okay with a certain key that is extracted from the non-critical bits, okay. Of course the key changes because those non-critical bits which are we’re going to treat like noise because if they are available out there they will not contribute to the deciphering or the intelligibility of the bit stream, okay. This, these are the shuffle bits which are given to the assembler and that output is gone and transported, okay. So phase one, two, three in the second phase, okay, and given to the transport layer, okay. So this requires the following information to decode each set of eight frames. It requires a two fifty-six bit key that I extracted from the unimportant bits which are transported in the clear, okay. Three zero four byte long entry of the hash table and the decryption of the thirty-two bit selectively encrypted partial data in the important fields, AES encryption, I need to decrypt those. Okay and I use a standard encryption or cipher to secure that. So, which means that the strength of the cipher is now dependent on that and the extraction of the two fifty-six bit key of that. Okay, so, and we said okay how do we verify that this is provable security so we set up a test bed of three servers, one located in Washington DC, one located in Chicago, and one located in Lahore. The user voice over IP implementation used Asterisk for that and established a long bit stream through that. To the three legs we monitored the stream by the selective encryption and we tried to submit the insider attack, the packet sniffer, the man in the middle attack and using those we said, and we did a Brute force analysis to that. So the key map to the hash table changes every eight G.729 frames. The shuffling patterns stored in the hash table will require exploring of three hundred and four pictorial permutations, very large number, okay. The attacker required two less to par fourteen hundred to perform Brute force, right. So Brute force attack is fine, this is equal, rather better than full encryption. But of course we can do full encryption using the same technique and claim the same results, okay. So this is the thing that we did with partial security, or selective encryption. But we didn’t just stop there because now as I demonstrated G.729 codec’s certain important fields are there in the bit stream. What about G726 codec? What about Speaks? What about other audio codec’s? What about videos? So for each codec we will have a selective encryption scheme like a library of algorithms. So just like a user goes and negotiates a codec that, okay I’m using on my capabilities is G729, G726, G721, [indiscernible], what is yours? Okay we agree on a certain codec or we transcribe or we have a gate keeper somewhere in the middle that translates one codec to the other, okay. So we negotiate that, so at the same time do we also negotiate a security associated with that and communicated that, but that is sending out to much information. So our next step was to investigate and explore what are the commonalities in that and can we find certain information which is critical to the decoding process or just common to all codec’s? Yes, we did. What is that? End dropping encoding, all of the decompression standards at some point or the other employ end dropping encoders. Like Earth Medicoders, [indiscernible] medicoders, and those kinds of things. So securing some, the decoding process of that or defusing the information available in the decoding process of that gives us across the board there and there is some work that we have published associated with that also. Not just that, what does RTSP do? It secures the real time transfer protocol stream. In multimedia communication either RTP or OGG or some such rapid stream is used to give us the synchronization, the bandwidth utilization, the quality of experience, RTP along with RTCP, okay. So if you do RTSP, so we can secure the payload. But can we do partial RTSP? The answer is yes using the same thing we identify those which that are common to all RTSP, or RTP streams and assuming either RTP or OGG. So we have reduced the variations that we need to do to secure or give a holistic security mechanism which is more efficient and is easily mapped to some of the heterogeneous devices out there. Because we have, like from the hand held devices to the large devices we have all these with different computational network constraints. We would like to map all these algorithms to those, right. The final thing is, the advantage that we get is less power consumption, because less computational overhead, less power overhead. That is very critical in certain applications. Okay, so just winding up. So I just gave you two samples of my work in multimedia communications. So multimedia communication I do other work also, network resource management I posed a very simple problem and presented it to you. Hopefully you all understood it and appreciated that a little bit. I do scheduling of, for synchronized delivery, a source based scheduling, and what happens in the network. How does the network or next generation networks which have intelligent processing and storage mechanisms and they’re like the CCN content centric networks that are somehow are one of the initiatives by NSF to develop next generation networks. How does these protocols, or this architecture, or this enabling multimedia translate to duals networks like that? What are the resource management constraints and requirements? How can we leverage this on the next generation networks there? So security and graceful degradation, so there kind of are the different areas that I investigate in multimedia communication. Recently in my sabbatical I am exploring semantic or attaching meaning to video and audio like activity classification, recognition, and using duals on motion based queues to index and retrieve, or classify activities in a video sequence, especially applied to server less videos. With that I wind up my talk and any questions? Thank you and [indiscernible] that’s my language and not any questions so, if there are any questions please go ahead. >>: Question, so you were mentioning the certain bits in encryption. So I assume those bits exist in sort of every packet or a drop of packets… >> Shahab Baqai: Yes, yes. >>: Because if you will bring a packet level selecting encryption I will assume that any cryptanalysis would expose at least some content… >> Shahab Baqai: Right, right so in cryptanalysis and we have done that to but I just didn’t show the results. There is linear cryptanalysis and different [indiscernible] cryptanalysis. So what is the statistically information that you can get out of that? So there are, and that is why the key shuffling that needs to happen to defuse the pattern. Good encryption or ciphers are the one that if I take the same stream multiple times I will get a different defused stream, right. That defeats the differential cryptanalysis that we have to show otherwise we can extract the pattern out of them. As I said that these are long duration streams or if they are long duration streams they last thirty minutes, then twenty-five minutes later I can instruct the pattern and decode the twenty-six minute stream there. But because we shuffled the key and the results are not the same pattern every time because the key shuffles, okay. >>: That leads to my next question which is, what happens to package loss then? >> Shahab Baqai: Okay… >>: And are you’ll start losing… >> Shahab Baqai: Right and so if you lose a packet which is the eight collections of that and how do you recover out of it? So from those eight bits this is something that we haven’t investigate to far, but you lose all the eight samples because they’re grouped together and ciphered together. So if you lose the important part, packet loss, bursty traffic characteristics or fading channels, and you lose that, so that is something that we will be investigating further. Because we have done graceful degradation and talked about bursty packet loss or relevating channels and those things, if you lose a packet how do you recover out of it? So you spread your information or you shuffle your information so that successive packet losses does not impact, or the degradation is graceful. So that’s a work that I didn’t present here. But we haven’t applied to encryption yet, so we will, thank you. >>: [inaudible] whole packet [inaudible] end to end? >> Shahab Baqai: Right, all eight samples. >>: Eight samples. >> Shahab Baqai: Right, because then you’ll decode that and the next one, next phase. But there, like I want to be able to do better and can I do better? The answer is yes I can do better by shuffling some of the hetero packets with the data. But I just wanted; right now we wanted to be standards compliant. Anybody looking at that stream would see a G.729 encoded stream, thus will not be able to decode it, okay. >>: I have a question about the selection of critical and non-critical data within the stream. My insight is it is more related to the codec that you’re… >> Shahab Baqai: Yes. >>: But do you have any suggestion of what kind of data becomes critical and what kind of… >> Shahab Baqai: Okay, okay, so for example in audio. The PCM samples, the pitch, and this would be different for Speaks and G.729. So for video it is entirely different and for example in video because I do more work in video and my collaborators do work in audio. If you hide the DC level in a macro-block, because most of the energy is concentrated there, codec specific, so for H.264 codec’s or the MPEG2 Revision 10 codec’s, so those macro-blocks DC level is critical to the decoding of the macro-block, right. But if you hide that you can still get some edges and we humans are very good at interpretation. So we can know it is a tree or a texture, or a house, and those kinds of things. So, and the work that we have done on video codec’s revolves around playing with the entropy fields or the RTP fields that you do not or you’re not able to decrypt the payload in a RTPs field, okay. By hiding the RTP headers or scrambling those headers in a way that lets you lose synchronization and recoverability of that stream, but in a manner that we can, or the intended recipient should be able to. Of course there are all those other factors that I didn’t cover here. How do you do the key exchange without compromising the key, or how do you exchange the hash table without compromising the key? So over here we are assuming that we are using standard public private key encryption to do that, right. We do it once and we go to the next step, okay. So there are details in all of this that I am happy to share and refer you to our publications in case you are interested in this. >> Arjmand Samuel: Anything else? Let us thank the speaker. Thank you very much. [applause] >> Shahab Baqai: Thank you very much for taking the time.