Document 17857970

advertisement
>> Arjmand Samuel: Okay, so it is my pleasure and honor to introduce Dr. Shahab Baqai. Our paths
have crossed a number of times with profound consequences really, ha, ha. We were, we co-founded a
startup fifteen years back. You’re looking at network design and software design for some of the
problems he’s working on right now. Then he went into academia. He was a Dean of School of Science
and Engineering. He was a Chair of the Computerized Science Department at Lahore University of
Management Sciences. Right now he’s on a sabbatical with the University of Illinois at Chicago. Some of
the research that he’s doing currently he’ll talk about it right now. So without further ado let’s welcome
Dr. Baqai to Microsoft Research.
>> Shahab Baqai: Thank you so much. I hope I have sufficient breath in the talk to appeal to a disparate
audience here, and people sitting online, and hopefully they’ll be able to get something out of this talk.
So the title says Multimedia Communications an End-to-End Perspective. Actually I’ll be starting off with
some applications. Then I will talk about the basic problem, what we have solved, and I’ll interpose this
with the work that I do, and some of the results that we have achieved there.
So primarily I will cover two areas because of the time constraints and others. I don’t want to take the
whole day trying to tell you about my research for last decade or so. But I will give you some of the
areas that I have looked into, and implemented, and gotten some results. But the primary emphasis of
this talk would be to present to you the, a basic understanding so that challenges ahead can be
identified. So that’s my target for this talk.
So, well, Ubiquitous Multimedia that is what we have dreamed of. Some of the applications are giving it
to us already. If you’ll look or don’t go too far, a decade ago the things that you would have imagined
then are a reality now. Specifically our GPS combined with mapping, and labeling, and Google Maps,
and the like, Microsoft maps, the search base, etcetera. You can have multi-model interfaces. You can
talk to a device your personal assistant, your hand held phone, and run basic queries. Those kinds of
things which we ten years from now if you would have told me that, yes researchers were dreaming of
those things but they weren’t a reality, now they are. The computational power in your hand or in your
pocket is far beyond the computational powers of servers twenty years ago.
When I learned programming I used punch cards to enter data and we will get the output the next day,
printed on a piece of paper saying that there was a syntax error at line numbers six. So your program
did not run, anyway. So to me if I just sit back and try and think about where we have come the
progress has been tremendous. But are we there yet? Well, you guys and me included, because I like to
think myself of this generation to, imagine what we could do further. How we can scale up pings if one
video on demand application is running how about a million users to that. Is it possible now?
Somewhat but there are constraints, quality, resource constraints, constraints in how much we can scale
up, how many problems we can solve, to what level we can solve. So the challenges are still there. Of
course the natures of those challenges have changed. But I’ll try to give you a feel of what are the
current challenges in my thought or my mind that needs to be addressed.
So this is like a nineteen, actually twenty zero three map of the world ISP providers. What kind of
backbones do they have and things like that. You can’t see it but just to give you an idea about the
growth of the communication medium that we have. Which is the primary motivating factor for the info
revolution that we are benefiting from today.
So multiple areas design and engineering, CAD, CAM, communications, learning, entertainment,
business all have been impacted by that. Okay, so just as an example application, health care
application. Everybody can appreciate health care, right. Typically a cohesive application that has the
patient record, the prescriptions, the historical data, as well as the evolution of their health over a
period of time, so temporal evolution as well, and some sort of that. Patients, physicians, caregivers
they can all access the data maybe in real time. Somebody is sitting there at home wearing a sensor and
their ECG is being monitored by physicians all over the world or as per caregivers somewhere else.
Okay, is it possible now? Yes it is. Is it holistically an application that delivers to the physicians and the
customers? No, there are still some work done most of all in the user interfaces in the communication
medium in the high resolution that is required by diagnostics and treatment like that. So, do I, are we
there yet? Getting there but of course that is still something to be desired. Will the result of all this
data be available to my health insurance providers? Would my premium go up or down based on my
health status now or my health history? What information can I get out of, could we improve our
diagnostic techniques for cancer patients for example? So a group at LAMS is working with a cancer
hospital in Lahore trying to figure that out and use that as a diagnostic to use data mining techniques to
enhance the diagnostics procedure that we have over there.
Now, let’s get, do a little bit more complicated scenario, the Department of Defense Info-structure.
What are the system features that they envision? At the heart of it all is the communication infostructure. That provides ubiquitous end-to-end, high bandwidth, high resource service to all
applications that require it. Hence the anyplace any time kind of thing at a global scale. So that is what
we are looking at and this is just a collage of various things that they do or they need to do over here.
Since the networks are part of it and they also form under multimedia communications, sensors can
take a sample of the environment, or take real time video feeds for certain things that can be analyzed
at a remote location, or consumed for some surveillance purposes, or monitoring purposes, or
environmental prediction purposes, or any of these combined together, okay. So critical decisions
would require or are anticipated to require, or wish to have real time multimedia applications to them,
so, like a cardiogram and multiple doctors coordinating or collaborating together to diagnose a patient
maybe just right before open heart surgery, or angiography, or angioplasty.
So those are the things that we are looking at. So at the bottom of it and that is what I do. I try to figure
out how to enable multimedia data with the required play out quality, with the required real time
constraints from one location to a remote destiny, one or more remote destinations. Now we have the
cloud services, the metal layer that hides the network abstractions or the source, or these services, and
provides them as a service to the users. So what are the challenges there, right?
So we have this network which is heterogeneous but needs to be interoperable. Video, audio
documents, tradition text, or high resolution video, all that is required to be communicated there, and
these are the constraints. These are my requirements, I require the right quality for the device to play it
out which may not be the same on a hand held, as opposed to a tablet, as opposed to a PC, okay.
Security of course paramount, I require security at a number of levels. My payload needs to be
encrypted; nobody should be able to hear what I’m saying. Nobody who’s not intended for that
communication, of course I would need to decode it and decrypt it to the user, okay.
So unintended participants should not be there, eavesdropping should not be there, okay. How do I do
that? Sometimes it’s just the payload but at other times I don’t even want to know who’s talking to
who, at what times. So connection establishment or negotiation, or what are the capabilities, what are
the codec’s that other person can use? Inherently I want to advertise those. I want to announce it, look
this is my capability, this is what I can use. But I don’t want to announce it to other users who can
exploit that information, so security at a number of levels.
Then I want synchronization in a cloud that is an audio server; that is a video server, located maybe
geographically dispirit locations. But there has to be a lip sync between the audio that is coming in and
the video that is going out. Like the stock is going out then maybe a different audio server which is
optimized for audio in some way, a different video server, both have different constraints. Audio
requires different bandwidth, and latency, and jitter requirements. Video is more tolerant to, let’s say
jitter or things but that require higher bandwidth. Different requirements but they need to be synched
together or the same underlying networking info-structure.
That is what we’ll be talking about here. How do we accomplish that? Concurrency, I have certain data
that has been generated that is available at different locations at the same time. Somebody
collaborates over that, changes certain things, who has control over it, those kinds of applications. As an
example this is something that we already see here today. We can go to a map, we can use that map,
we can get real time video feeds, we can look at the street view, we can get the traffic. If there is a
crash somewhere we can get that information real time and make decisions based on that, okay.
So end-to-end perspective for communication and qualified by multimedia communications is quality
management across various systems across the board. Heterogeneous systems, okay, with different
requirement, different constraints, okay, so we require the network, the databases, the end-systems
architecture, the security to all work together to deliver that right experience. To the right person, at
the right time, in the right location, and I can go on with a couple of these things.
Okay, challenges, past, current, and anticipated, okay. We need to model these documents. We need
to figure out what resources to allocate. If you want to make it cost effective the resource allocation
because some of the resources would be expensive, has to be done dynamically at the right time to the
right user, okay. Okay, so required high bandwidth, why don’t I over provision the network? Why don’t
I use instead of a dual recorder code machine? So it has to encrypt it, has to require more mega flux,
throw more mega flux at the problem. Sure that is what the service providers are doing initially for
multimedia applications. But then there were one user, ten users, what about a thousand users using
the same data, are connecting to the same data stream, or connected to the same servers. In the search
biz this is what happens. Basically Microsoft Bing search gets millions of hits, this is just the top of my
head figure, right, I’m imagining that they do. Okay, I didn’t intentionally use Google here.
[laughter]
So, okay, so we need to model the data and model it in a manner so that the next layer which is going to
talk about efficient clustering, indexing, search kind of techniques can use that information. Which can
determine that okay at what time do I require what resource? So I need to map no other data, I need to
model it, I need to figure out what are the deadlines for a video frame to reach the destination for
quality, high quality play out so that there is no motion jitter or break in the user experience. Okay and
it has to be done in a distributed manner. It has to be done somewhere along on the cloud that we
have.
No network caching and synchronization techniques for this distributed multimedia, does the network
handle it, does the end system handle it? There are various ways that people do it, sometimes in the
network, sometimes in the overlay network, sometimes at the application level, okay. So the first thing
is how do you model the data so you have various requirements? I will talk about synchronization and
resource provisioning first because that is one area that I have investigated in detail on that.
Okay and multimedia communications is coming together various cross disciplines as I would call it. You
need to know signal processing. You need to know information theory. How to model your source, how
to represent your source so all these codec, HR323 are some communication protocol like Sip or HR323
users at the bottom of the layer, the payload is encoded in a certain manner for efficient transmission,
efficient storage, efficient transmission. Are they the same thing? Not necessarily, so they have their
own set of requirements. Is source coding and channel coding the same thing? No and people have
been doing it independently, independently doing source coding.
At the source what do you want to do for storage and transmission? You want to remove the
redundancy in your information, represent it in a summarized concise manner so that it requires the
minimum amount of resources for a certain quality based play out. That is what is required. Now, once
you’ve put it onto a network, let’s say it involves a wireless medium which has a fading channel or that is
bursty packet drop that is possible over there. There for channel coding they add redundancy FEC,
forward error corrections, and those are important. What do they do? Hard abstraction level if I take
the top down view of it we are adding redundancy to the data so that if I lose certain amount of data I
can reconstruct it from adjusting data, right.
So, hang on, source coding removing redundancy, channel coding adding redundancy independently. So
what researches said, okay, let’s marry them together let’s, for a more optimized more efficient
representation. But then I have to encode my, let’s say video to be transmitted over a high quality link
versus a wireless link separately. How do I handle that? Do I have ten different encoding stored there
or in real time do I cater for [indiscernible] encoding or are they negotiated real time? People came up
with scalable video encoding for different heterogeneous platforms, you encode it in this. So you get
more layers for higher quality, you get less layers for lower quality. Over there yes there is a little
negotiating overhead but tolerable negotiating overhead. So we have scalable video coding.
So source coding. How do we model the data to encapsulate all of this that, okay if I’m doing a video
along with audio lip synchronization has a delay requirement of eighty milliseconds? If I go beyond that
then real time conversation breaks down. Okay, so people do work in understanding and evaluating the
quality of various media independent or jointed. So I will call them as inter-stream and intra-stream,
intra-stream means within the stream itself. There’s a video stream, what is the quality requirement
there? What is the spatial frame size? What is the frame per second that I need? Is it a [indiscernible]
NTSCPAL or QCIF? Now those are just names about different quality of video, just video.
But intra-stream would mean video and audio. Audio is coded using MP3, let’s say codec. These are
names of codec in case; please feel free to interrupt me at any time. Okay, so I need to model this data.
So what modeling methods do we do? So I need models for spatio-temporal synchronization, content
representation, quality attributes, access, so many other features and I just come up with a list. What
are the models available out there? Language-based model as SGML, XML now, object-oriented models,
graphical models, I have worked on, so this is the example of a browsing graph, a dynamic browsing
graph. Which let’s look at a certain time frame about that browsing graph.
I see that there is audio in there. There is text, video interposed with certain synchronization
requirements. Of course the bandwidth profile and so I’m just looking at what kind of bandwidth do
they require. It’s a varying bandwidth profile. Whenever I have video along with audio that is the
highest bandwidths of capacities that I would require for a network to efficiently carry it from one point
to the other. That is what I’m looking at, right. So I need to know this so that I can efficiently allocate
resources in real time to this rather than okay, I can say okay this is the capacity, the top capacity that I
require, fine, let me give it that capacity. Fine, I can do it to one connection. I can do it to a hundred
connections maybe but what about a million connections? My network would say it, no more and that’s
where I get denial of service or I need to apply access control. Okay, so many users logged in, sorry
service is busy, come back and voice this video another time. We all know what happens then. We, of
course we never go back and watch that video again.
Okay, so content at the right time that I require it if I don’t get it, it doesn’t serve my purpose. Okay, so
I have worked on a model call the Petri-Net Model and I call it the Object Composition Petri-Net. I
didn’t, I was part of a group that worked with this model. It talked about modeling the experiencing
multi-media presentation, okay. Then how do we provision it for networks? So we have various things
like that that this is a video, what quality, what codec. But that is going to fine train there so I’ll restrict
it there. So how do I use this? I use this in a database schema. I use it for access control for security, for
storage management, IO scheduling, end-to-end synchronization protocol for my networks.
Okay, so what do I do? I have this Meta data and so I’ve modeled it now, right. I need it in my network
for my OS, for my database, for my security. I adjust the quality of service of each of those components.
Usually quality of service is associated with networks, what they can give you, what is the delay latency
capacity, bandwidth, burstiness that a network can handle, right. But I expand it for this, for the sake of
conversation and talk about OS databases and security, right.
Then I do something, I, then I say okay now I can do the source management. Because I know the
profile of my data, I know its [indiscernible], its reliability requirements, its synchronization requirement.
So now I can play with those, thanks. Okay, so this is an architecture that we talk about. This is the
network part and we’re talking about a membered multimedia communications. How do we enable it?
So this is the, let’s say the network layer, the routing and allocation, the synchronization layer, the
configuration management, information, location, identification, and the application layer. Of course,
this is just an abstraction for us trying to understand the various functionality that is required, in
applications of today. For example let’s pick up a typical application that everyone is familiar with.
Skype, where does what happen? At the application layer level most of the things, right. So Skype is an
application that establishes a connection, and look it looks for a super node, and then figures out
[indiscernible] versus bad translations and uses [indiscernible] to do that. Then goes through a firewall,
establishes a connection, selects a codec, and at the other end there are various presence indicators and
things like that. So all those things and the OS, the database, the storage in some distributed fashion
provides that for that application, okay.
So basically how do I determine the resources now? Because I now have the model and I need to map
those resources on top of that model or using that model and provide that. So at the top I’ll introduce
another term called Quality of Presentation. What I specify the synchronization, the frame rate, the
buffered size that is required for that, as well as my presentation requirements. What the service
provider can provide, the network can provider is called the quality of service. So I need first mapping
there. Then I need to talk about end-to-end resource allocation. I say end-to-end because internet or
our packaged networks is a hop-to-hop network, provision of the basic architects traditionally use to say
that all data is treated equally. You give me the data I will try my best effort to take it to the destination.
Of course there is a networking architecture, a layer architecture that does that for us, right. Now we
have this multimedia data and like for network architects like myself who work in the multimedia
domain, we say that not all data is equal. Some has higher priority, other if I lose it fine I can still
manage. So we talk about those things and I have worked in the area of how do you manage graceful
degradation in a resource constraint environment? Let’s say you wanted thirty frames per second video
with a certain NPSE resolution, okay. But you say okay I can’t have it because I don’t have enough
resources for it. Can I settle for a twenty-four frames per second video? Yes I can maybe, right. So
there has to be this kind of negotiation, re-negotiation that can happen. But if I lose a packet in the
middle and it is NPAC two encoded or NPAC four encoded [indiscernible] encoded with you I can throw
other names here also.
If I lose some data how is it going to impact my group of pictures, my slice, layer, or my macro-blocks, or
my motion vectors? Those are the things that are used insight, H.264 encoding. So I need to have an
understanding of that. So not all data is equal there, right? Some data if you lose that I lose a group of
pictures. I lose a whole frame. Other data that I lose the same comparable fraction of data I just lose a
macro-block and I see a black spot in the middle of a frame that is played out one over thirtieth of a
second. So the quality appreciation for that loss goes from one end of the spectrum to the other, right.
So I have to figure that out, right. We need to do it intelligently. So all data is equal; not really, okay.
So we have to, we can exploit that or leverage that information and design certain protocols and things
that, okay, so constraints, constraint environment we have to degrade our performance or play out. But
how to do it in a manner which is least likely to impact an application, so that’s an area of our research
also and it’s all scheduling. With our crayon finally synchronization, so the thing that I’ll go into a little
bit more detail is end-to-end resource allocation. I’ve worked on this part also but maybe for another
time. I’ll, after this I’ll go into the security aspect of things which is, in my opinion one of the most
paramount or most impactful things that we have to do today. We have to convey multimedia
information, reliably, and securely, okay.
So I believe that that is an area of research for especially for multimedia. How do we leverage our
knowledge of encryption cryptographic techniques and apply them efficiently for multimedia traffic?
Okay, so I’ll investigate that part also and present a little bit of my work to you over there. Okay, so let’s
think of a black box approach. This is a source, that’s a destination, and in between we have a network
of course a hop-to-hop, or has resource reservation techniques, the RSVP or those are all techniques, or
an NPLS provision virtual part that provides a certain degree of bandwidth, a certain latency, a bounded
latency, and a bounded jitter. That is what our quality of service is, so possible quality of presentation
parameters used to formulate end-to-end resource allocation problem, jitter, delay, synchronization,
reliability, bandwidth, right. Those are our parameters that we use.
So if I abstract the communication and I have multiple servers in the cloud somewhere, right; I have a
network. Then I have various paths that that traffic would take, could be the same path for the entire
flow like an NPLS, or it could be like a path for each packet, like the internet abstraction, okay. So the
problem like that we have worked on and there are other objective functions but there is one objective
function we say let’s talk about reliability. Let’s, how do we reliably take this information and make it
available to the next, to the destination, okay, end-to-end. So we talk about let’s look at capacity for
example it will require a certain bandwidth and we have looked at the bandwidth profile, the time
varying bandwidth profile that ultimately we need to dynamically allocate our resources to match that,
okay.
So these are certain parameters that we have N objects that need to be transported, N could be large. If
it is larger then it’s a bigger, computationally harder problem to solve. So this is, we have the bandwidth
requirement for each object, that is specified to what is in the model, we have the reliability
requirement of each of those objects. So the total capacity requirement at a given instant in the
presentation would be let’s say gamma, that is the sum of all requirements of these simultaneous
objects that we need to transport, simple, okay. If gamma is greater than the available capacity, right,
so certain degradation has to take place, right. How do we do it gracefully? So this is the theta i is the
percentage of the data that has to be dropped from each object, okay. How do we determine theta i for
each component? Not all data is equal.
But let’s simplify the problem for the scope of this talk. Okay, let’s say, okay we want to; let’s say we say
that okay we will use an objective function of fairness across all objects. Let’s say all objects will be
[indiscernible] objects. Like, let’s not get into the issue of our disparate objects. We say that in
accordance with their reliability requirements we want to distribute the penalty on that. We formulate
that’s like a non-linear programming problem, NLP. We have ways and means to solve an NLP problem
so we say okay. Let’s talk about the network and figure out that there is the path. Then there is the CPU
cycles that need to be allocated resources. So the network, the CPU, the storage part, and this is let’s
say the theta i L K is the amount of penalty that is going to be imposed at each link, and then end-to-end
connection, okay.
So we can formulate this like an NLP problem with these objectives where we want to minimize the
means square penalty, okay, or distribute it justifiably across all the objects, okay. So that’s an objective
function, we can change that if, depending on our requirement, right. So this is like minimization
problem that is set up and this is like a flow chart about all of these.
So allocate IO capacity by solving the NLP which talks about IO NLP, okay. Then link NLP and then if
there is an alternate part that needs to be selected for that, or if it is available. If all of these are
satisfied, the non-linear problems that we formulated before then we accept the connection so this is
like admission control. Saying, okay for this connection which would have that traffic provide that we
have modeled can we provision it? Okay and we do it and then we go to the next decision point. So
what are the decision points in the time bearing requirement that we have for capacity because that is
what we are looking at right now, the bandwidth, okay.
So these are our transitions level at what point that’s our requirement change. That is how we proceed
with this. So at, these points we have to solve an NLP problem, but then we can do a tradeoff. We can
say okay if the, because we don’t want to be solving the NLP problem at each of these transition points,
right. So we can do a course grained resource allocation and it would follow something like this here. If
there is a significant change in the bandwidth then we will solve the NLP problem and allocate resources
accordingly, okay. So this is one of the aspects that I looked at and I’ve looked at and things like that.
I’ve simplified the problem just so that I can tell you what all is required in resource management, a part
of it. Then comes synchronization; then comes other factors as well of making a cohesive application
that involves multi-model input.
Okay, so the second part of this talk, any questions that I should give you an opportunity to ask
questions? Is there any remote people? Can they call in and talk?
>> Arjmand Samuel: They can…
>> Shahab Baqai: Okay…
>> Arjmand Samuel: Yeah they can send in…
>> Shahab Baqai: Formulate a question to you…
>> Arjmand Samuel: Then I can also [inaudible].
>> Shahab Baqai: How are we doing for time Arjmand?
>> Arjmand Samuel: [inaudible]
>> Shahab Baqai: Okay, okay, no questions maybe I simplified it too much or everything is real clear.
The fact is that all of us have thought about this but I’ve been impacted by some sort of degradation,
quality constraints in the network, while using certain applications. I believe that all of us have either
watching a movie at Netflix, or using conferencing software; or I don’t know if you guys remember the
first adaptable protocol came out of Real Audio. There was this Real Audio thing but I don’t see it
anymore, things like that.
What use to happen there was that, when you request a connection it would say calculating buffer. So
it was dynamically allocating resources. It tried to send out probes and get the current network
capability of performance and that’s a challenge in itself. How do you estimate the current network
bandwidth and project it for the next allocation? So and there are various factors that can impact
dynamically on that prediction. So that, so it use to calculate the buffer and start a presentation but the
networks congestion or characteristic would change. Then it would say; it would hold the presentation
and say recalculating buffer, okay. Now the applications don’t report it to you but they still try to do the
same thing and try to do it adaptively in a manner so that any degradation is hidden from the consumer,
okay. So that’s a challenge within itself. How do you conceal errors without degrading the resolution,
the quality, the play out, the color, or whatever, whatever are my qualities of presentation criteria?
Okay, so now I’ll switch tracks and I’ll talk about encryption. What I will talk about is selective
encryption. All of us know cryptography or all of us know about the keys and exchange, I’m assuming
that here because I’m sitting, or I’m talking to a very enlightened audience I believe. So, and all of these
things we use on a daily basis. We do establish a secure connection using TLS, Transport Layer Security
or IPSec. We know that there are passwords in the clear on NTLM on those kind of authentication
Cerberus and those kinds of, things like that.
So let’s talk about security attacks on the internet. You’ll notice a shift in the presentation style because
this is a presentation that I prepared, that I used, reused from another conference. So let’s say we have
this internet and there are various applications; some of them are out there that try and get, or
eavesdrop, or secure certain information, or obtain certain information to which is not directly
addressed or available for them. Okay, but this is available out there, it’s available in the network and
they want to get this information. This is possible because initially when the network design was made
or, so we didn’t think about security that much. But slowly and gradually we were to realize that this is
an important concern there like that.
So no longer we can be done in the open. We need security because there are critical applications out
there. Our business transactions are out there, our bank accounts, our credit card information; on an
individual scale and on larger scale sensitive information that we don’t want everybody to access, okay.
So what we do is we use various techniques like that secure these at different levels, okay. So have
become the D factor standard, like I was say that I’m assuming that everybody has used them at some
point or the other, like that.
What about multimedia support? I deal with multimedia so I say okay, we need to secure everything
and how do we do it there? Well, multimedia as we had, as we saw in, and I wanted to convey to you
also is, has different requirements for different media types but they are definitely more constrained
than for let’s say traditional text shield data. There is a timing constraint, there is a reliability constraint,
there is that certain information if you lose, the impact on quality is very profound. So how do you
address that? So we don’t, there is significant overhead to it, but, if I apply traditional security protocols
on multimedia traffic and multimedia streams, let’s say it’s a one hour long stream so let’s say about a
thirty percent computational overhead which is the top of the head prediction for traditional security, or
encryption methodology; so thirty percent overhead over a one hour period significant overhead.
Can we do better? Can we minimize that in terms of computation, in terms of network footprint, in
terms of latency? So we would like to be able to do that, okay. So we know that there is a high
overhead. There is the bandwidth constraint that we have. There is a constraint on latency and jitter
that we have. But security protocols that were available out there impose all of these and they impart
the quality of multimedia presentation.
So we would like to do something about it, okay. So we can talk about faster hardware, use propriety
protocols, or selective encryption. So I will concentrate on selective encryption basically, intelligently
secure some data which will give me equal in privacy that I could have obtained by applying the
traditional ciphers to the entire stream, okay. I would like to do that in a manner that minimizes the
CPU, the network overhead, and the latency that is incurred while doing encryption, okay.
So this is what I will talk about here. So to me it seems like the best option because of my unwavering
belief that not all data is equal especially in multimedia traffic, okay. Some data is more critical to the
decoding process, to the play out process than other type of data, okay. To the deciphering processes is
found when I want to decode something if I, if there is some data in the clear which does not give me
much I can treat it like noise, or I can evaluate whether I can treat it like noise, okay.
So let’s look at a typical VPN that, where does the overhead come from? So let’s say this is a private
network that means to communicate over the internet in some secure manner. The usual data link
layer, the network layer, and on top of it the transport layer, of course application layer sits there on the
top, okay. So the virtual networking interface which enables us to establish a VLAN type of
configuration or VPN configuration sits along with this there. Basically you have a control plane for it.
We need to do a key exchange for it and so that information goes out there, public private key, etcetera,
depending on what sort of algorithms am I going to follow. This is the data plane where I have to, the
information is compressed, encrypted, and authentications to do all the tasks that are performed in the
data plane there. From a connect, private network there is data that comes in, goes through that, or is
rerouted through that cloud, then just going directly from the network layer to the transport layer. The
VPN Daemon performs the compression encryption and authentication mechanism there and then goes
through the TCP stack, or the transport stack, network stack to the internet, out there.
Okay, so we want to make this more efficient for multimedia traffic. Why multimedia traffic? Because
ISO [indiscernible], multimedia traffic is ISO [indiscernible] there’s a stream that requires high
bandwidth, low latency, low jitter, bounded jitter, type of communications for extended period of time,
right. So we want to do that so we say we observe that multimedia documents are exist in the
compressed form. Because you want to make representation more efficient, or storage more efficient,
or transport more efficient, okay. These are the example of the codec’s that are there, MPEG, OGGVorbis, Althera, H.264. All these are all compression standards or codec’s, compression,
decompressions.
Okay, they have a property, what does compressions do? How do I reduce or redundancy from the
data? I concentrate important information into a smaller, with a smaller footprint, storage or network
and I devise a way of encoding and decoding it reliably, okay. So I can talk about lossless or I can talk
about lossy but all compression techniques want to do that here. So for example the OGG standard
which is the [indiscernible] standard and the Vorbis code files have specific decoding information only in
the first hundred bytes. The rest of that is encrypted or is compressed information that I can, if only
that is available without the headers I will not be able to decode it effectively. But we want to discover
how effectively. If I just hide the first hundred bits and let the other bits go in the clear what happens
then? This is what we have investigated and I will use the G729 codec which is the widely used codec in
audio communications as a represented example for it here.
So if there are errors in the chunks like from attackers or imagining that if those chunks are destroyed
they may make the decoding process impossible. We want to value it how impossible. What is the data
in each of the compression standards that is so critical that without which we will not be able to decode
that information or the information is so degraded that I will not be able to make sufficient, or gather
sufficient or interpretation, attach any interpretation to that data. So this is what I want to evaluate
there.
So there’s a difference between looking at degradation, or deterioration, or from hiding some of the
information versus the quality of presentation that I was talking about in the first part of this
presentation or this talk. There I was saying that okay, if certain degradation has to occur I try and hide
it or minimize it and sure present the data. Maybe I see a noisy frame but I can still make out if it is a
house, is it a car, or is this a person, or is it day, or is it night, right? From a security perspective I want to
introduce sufficient distortion. That removes all contextual interpretation from the data. So an
encrypted stream I should not be able to get any information, not just a deteriorated or a noisy image
but I want to hide the data completely, right.
So that’s my target. So when I talk about quality that’s from a different perspective. It’s the same trying
to hide the data or the data is lost. But from a quality perspective I’m trying to minimize the
degradation, for a security perspective I’m trying to hide the interpretation of the data without the
necessary key or the decoding process that we have here, okay. So I want to make the decoding
impossible without my key, okay. So the success of these schemes depends on the amount of data that
needs to be secured. If it is substantially smaller then I will get certain gains out of it. The processing
cycles that need to be allocated for the encryption and the subsequent decryption would be there, the
latency would be there, the network for print or the overhead of this would be less, okay.
I need to prove that the privacy is preserved. That’s the hard part. Provable security is the hard part to
do, okay. So what I will do is I will say okay let me simulate certain attacks. I will say, okay, what is the
information that I got out of it? So we’ll, I will see that. I’ll try to keep it concise and not bore you with
too many details. So the objectives are that this security or this sort of encryption or data hiding should
be effective, efficient, practical, economical, and privacy should be ensured over all of this. Performance
should not be compromised, okay. Design, architecture, coding requirements should not involve major
functional changes. Because if I have to implement major function changes who will adopt it unless
there is some significant gain out of there. So it should be standards compliant in some way, okay.
Cheaper than the cost of program being secured, if I want to try and hide the information and I, the cost
of securing that information, the resources that I allocate are so significant that who cares what I talk,
right, paradigm. So, well of course I care, you care what I talk.
Okay, anyway, so let’s talk about voice communication because representative example would be from
the audio compression standards, G729 codec. So PSTN calls that inherently secured. The Public Switch
Telephone Network for those of us who don’t know what PSTN stands for. It’s a connection oriented
infrastructure. A physical connection is established and all the connections are owned by the Telco. If
anybody snoops on that it’s a Telco itself that can monitor calls and things like that. So let us assume
that that’s not happening.
So, but when we talk about Voice over IP or over IP communications, packet switch communication over
a public infrastructure like the internet then I want to secure the voice and video calls over there. Of
course I can secure the entire stream using the very robust encryption standards that are out there like
RSA or other algorithms that are out there. But what I have tried to construct is a need for scalable
partial encryption, okay.
So let’s look at the, a typical call establishment in a Sips scenario. Sips is one of the Session Initiation
Protocol is one of the most widely used call establishment procedures, open source procedures that are
out there. So if a user wants to establish a call through the public network to another user, okay, it
sends a Sip request depending whether this Wi-Fi or IP server is acting as a proxy or it just facilitates
presence, monitoring, and things like that. The request goes there, goes to the user, the user translation
takes place through the Sip, there is a phone number associated with each user that can be dialed and
the server knows about it. The user accepts the call, some codec negotiations take place using SDP
session decrypted protocol, the request is established and then payload data takes place, very
simplified, very simple for the purposes of our understanding this is what happens there, okay.
So my interest is right now to secure this part. Of course who is talking to who, when are they talking
about that? That is an important consideration as well; I want to hide that to. But let’s leave it for
another time there and make things simple, payload only, okay.
So let’s look at one of the most widely used codec G729. Why is it the most widely used codec? It has
about six to ten kilohertz, kilobits per second network footprint, low complexity, has real time
implementations available out there. Of course they do demand a certain royalty associated with it, and
has office quality voice or total quality voice. For telephonic conversation this is fine, this is not a high
fidelity codec but for voice over IP communication this is the most widely used codec out there, okay.
Of course there are wide band codec’s as well as narrow band codec’s out there also. But, so we, I
decided to use this as an example. We have worked with other codec’s as well. I’ll talk, while summing
up this talk I’ll talk about what’s out there, what are the next steps there.
So these are, so what this codec is it takes a ten millisecond sample and correspond, so audio frames of
ten milliseconds corresponds to eighty samples, okay. So that’s the sampling there and if I look at the
codec or the bit allocation in the G.729 stream. So I talk about that zero to sixty-four bits, what are the
parameters that are being encoded over there? So the first step is to identify what are the important
parameters that are required to decode this or to reconstruct an audio signal, decompress audio bit
stream that is compressed using this codec, okay. So that is the first step.
So there were two approaches that proposed for encrypting this in a partial manner. They identified
this is Servetti and De Martin, they identified that if we do forty-five percent of, or if we secure forty-five
percent of this data this is equaling to full encryption. If I encode all of the bit stream, forty-five percent,
yes, fifty percent gain, fifty-five percent gain and that’s good, okay. They said okay if we secure thirty
percent of that, so it removes the understanding of the speech. I can still probably make out that the
speaker is a male or a female, or the pitch of that, or when there has been a pause in the speech. But I
cannot understand what is being spoken. So sometimes there’s just this requirement that all I need to
do is to hide the intelligible speech, okay. We’re not worried about the pauses or things like that and
other information that can be extracted from a speech signal, so thirty percent not bad, okay.
So, and they identified that each bit of the packet that is contained there says sixty-four bits, the ten
millisecond samples there corresponds to what mean open and score value. This is how you measure
the quality of the voice, call, okay. They decided, okay, that if I remove this there is this impact and if I
have, encode that bit then what happens, okay?
So another work that was done by Hae Yang and they have a patented codec out there that has five
levels of encryption, selective encryption. They evaluate that, what do they hide and how much
encryption do you do with that? So we said okay, like can we use their work and use that? So two level
of protection and this Servetti and De Martin, five levels of protection over there, actually these two
levels correspond to a class three and a class two compression, or equal in privacy that way. Can we do
better? So we said okay, lets, and our result or what we did, we said okay only encrypt five percent of
the content rather than the forty-five percent, the thirty percent, and the various. So, and we based it
on direct hash mapping shuffling, and shuffling the key changes after each of eight G729 frames, okay.
So using this and changing the encrypted hash table with, under some secure mechanism enables us to
do five percent of encrypting, five percent content encrypting that much and getting, we claim equal in
privacy of full encryption. So to us that is a significant impact of that.
Okay, so I’ll talk about what are the important fields and unimportant fields that we have experimented
with and identified. So this is just a listing of that. I can go back and talk about each of these; they
correspond to the pitch, the PCM samples, and the tone of the encryption there. So we said that these
fields are important. These fields are unimportant, okay. If we look at that, that all these fields on this
side are unimportant, these are very important, and medium important fields that we have over here.
So what’s our encryption technique over here? We buffer these frames because you collect a number
of frames so we have sufficient data otherwise Brute force attack would work. If I had only eight bits to
compress then there are only two fifty-six combinations, okay. So I need a payload size but that payload
size or I should not be buffering and so many frames that they correspond to. Remember if I have a lip
synch synchronization requirement so my latency cannot exceed eighty milliseconds. So if, so I have to
talk about how many frames to buffer so that it doesn’t impact the quality of the application, or voice
quality of the application. Extract critical and non-critical data out of it, the ones we have talked about
and identified. We do the hash table look up and shuffling of that so that will defuse the information
over the entire stream so that it becomes unintelligent, okay.
So we get the G729 audio streaming, this is the streaming stack, this is the eight samples eighty bit each
that we group together, this is our encryption, our batch encryption that we do. So six hundred and
forty bits, give them to the splitter, split the critical, non-critical fields, okay. We take the critical fields
from there, okay, and we do a hash table shuffling. Basically we give it to the standard AES encryption,
okay. Then reassemble the bit stream to be standard and compliant, okay. So this is the first step, the
second step, third and the fourth step, okay in phase one, okay.
In the next phase we take that through, the critical fields are now AES encrypted, the rest of them are in
the clear, okay. The stream is reassembled or those eight samples are reassembled in the six forty bits
to conform to the standard, okay. Then we do a hash based shuffler, okay with a certain key that is
extracted from the non-critical bits, okay. Of course the key changes because those non-critical bits
which are we’re going to treat like noise because if they are available out there they will not contribute
to the deciphering or the intelligibility of the bit stream, okay. This, these are the shuffle bits which are
given to the assembler and that output is gone and transported, okay. So phase one, two, three in the
second phase, okay, and given to the transport layer, okay.
So this requires the following information to decode each set of eight frames. It requires a two fifty-six
bit key that I extracted from the unimportant bits which are transported in the clear, okay. Three zero
four byte long entry of the hash table and the decryption of the thirty-two bit selectively encrypted
partial data in the important fields, AES encryption, I need to decrypt those. Okay and I use a standard
encryption or cipher to secure that. So, which means that the strength of the cipher is now dependent
on that and the extraction of the two fifty-six bit key of that.
Okay, so, and we said okay how do we verify that this is provable security so we set up a test bed of
three servers, one located in Washington DC, one located in Chicago, and one located in Lahore. The
user voice over IP implementation used Asterisk for that and established a long bit stream through that.
To the three legs we monitored the stream by the selective encryption and we tried to submit the
insider attack, the packet sniffer, the man in the middle attack and using those we said, and we did a
Brute force analysis to that. So the key map to the hash table changes every eight G.729 frames. The
shuffling patterns stored in the hash table will require exploring of three hundred and four pictorial
permutations, very large number, okay. The attacker required two less to par fourteen hundred to
perform Brute force, right. So Brute force attack is fine, this is equal, rather better than full encryption.
But of course we can do full encryption using the same technique and claim the same results, okay.
So this is the thing that we did with partial security, or selective encryption. But we didn’t just stop
there because now as I demonstrated G.729 codec’s certain important fields are there in the bit stream.
What about G726 codec? What about Speaks? What about other audio codec’s? What about videos?
So for each codec we will have a selective encryption scheme like a library of algorithms. So just like a
user goes and negotiates a codec that, okay I’m using on my capabilities is G729, G726, G721,
[indiscernible], what is yours? Okay we agree on a certain codec or we transcribe or we have a gate
keeper somewhere in the middle that translates one codec to the other, okay.
So we negotiate that, so at the same time do we also negotiate a security associated with that and
communicated that, but that is sending out to much information. So our next step was to investigate
and explore what are the commonalities in that and can we find certain information which is critical to
the decoding process or just common to all codec’s? Yes, we did. What is that? End dropping encoding,
all of the decompression standards at some point or the other employ end dropping encoders. Like
Earth Medicoders, [indiscernible] medicoders, and those kinds of things.
So securing some, the decoding process of that or defusing the information available in the decoding
process of that gives us across the board there and there is some work that we have published
associated with that also. Not just that, what does RTSP do? It secures the real time transfer protocol
stream. In multimedia communication either RTP or OGG or some such rapid stream is used to give us
the synchronization, the bandwidth utilization, the quality of experience, RTP along with RTCP, okay. So
if you do RTSP, so we can secure the payload. But can we do partial RTSP? The answer is yes using the
same thing we identify those which that are common to all RTSP, or RTP streams and assuming either
RTP or OGG. So we have reduced the variations that we need to do to secure or give a holistic security
mechanism which is more efficient and is easily mapped to some of the heterogeneous devices out
there. Because we have, like from the hand held devices to the large devices we have all these with
different computational network constraints. We would like to map all these algorithms to those, right.
The final thing is, the advantage that we get is less power consumption, because less computational
overhead, less power overhead. That is very critical in certain applications. Okay, so just winding up. So
I just gave you two samples of my work in multimedia communications. So multimedia communication I
do other work also, network resource management I posed a very simple problem and presented it to
you. Hopefully you all understood it and appreciated that a little bit. I do scheduling of, for
synchronized delivery, a source based scheduling, and what happens in the network. How does the
network or next generation networks which have intelligent processing and storage mechanisms and
they’re like the CCN content centric networks that are somehow are one of the initiatives by NSF to
develop next generation networks.
How does these protocols, or this architecture, or this enabling multimedia translate to duals networks
like that? What are the resource management constraints and requirements? How can we leverage this
on the next generation networks there? So security and graceful degradation, so there kind of are the
different areas that I investigate in multimedia communication. Recently in my sabbatical I am exploring
semantic or attaching meaning to video and audio like activity classification, recognition, and using duals
on motion based queues to index and retrieve, or classify activities in a video sequence, especially
applied to server less videos.
With that I wind up my talk and any questions? Thank you and [indiscernible] that’s my language and
not any questions so, if there are any questions please go ahead.
>>: Question, so you were mentioning the certain bits in encryption. So I assume those bits exist in sort
of every packet or a drop of packets…
>> Shahab Baqai: Yes, yes.
>>: Because if you will bring a packet level selecting encryption I will assume that any cryptanalysis
would expose at least some content…
>> Shahab Baqai: Right, right so in cryptanalysis and we have done that to but I just didn’t show the
results. There is linear cryptanalysis and different [indiscernible] cryptanalysis. So what is the
statistically information that you can get out of that? So there are, and that is why the key shuffling that
needs to happen to defuse the pattern. Good encryption or ciphers are the one that if I take the same
stream multiple times I will get a different defused stream, right. That defeats the differential
cryptanalysis that we have to show otherwise we can extract the pattern out of them. As I said that
these are long duration streams or if they are long duration streams they last thirty minutes, then
twenty-five minutes later I can instruct the pattern and decode the twenty-six minute stream there. But
because we shuffled the key and the results are not the same pattern every time because the key
shuffles, okay.
>>: That leads to my next question which is, what happens to package loss then?
>> Shahab Baqai: Okay…
>>: And are you’ll start losing…
>> Shahab Baqai: Right and so if you lose a packet which is the eight collections of that and how do you
recover out of it? So from those eight bits this is something that we haven’t investigate to far, but you
lose all the eight samples because they’re grouped together and ciphered together. So if you lose the
important part, packet loss, bursty traffic characteristics or fading channels, and you lose that, so that is
something that we will be investigating further. Because we have done graceful degradation and talked
about bursty packet loss or relevating channels and those things, if you lose a packet how do you
recover out of it? So you spread your information or you shuffle your information so that successive
packet losses does not impact, or the degradation is graceful. So that’s a work that I didn’t present here.
But we haven’t applied to encryption yet, so we will, thank you.
>>: [inaudible] whole packet [inaudible] end to end?
>> Shahab Baqai: Right, all eight samples.
>>: Eight samples.
>> Shahab Baqai: Right, because then you’ll decode that and the next one, next phase. But there, like I
want to be able to do better and can I do better? The answer is yes I can do better by shuffling some of
the hetero packets with the data. But I just wanted; right now we wanted to be standards compliant.
Anybody looking at that stream would see a G.729 encoded stream, thus will not be able to decode it,
okay.
>>: I have a question about the selection of critical and non-critical data within the stream. My insight is
it is more related to the codec that you’re…
>> Shahab Baqai: Yes.
>>: But do you have any suggestion of what kind of data becomes critical and what kind of…
>> Shahab Baqai: Okay, okay, so for example in audio. The PCM samples, the pitch, and this would be
different for Speaks and G.729. So for video it is entirely different and for example in video because I do
more work in video and my collaborators do work in audio. If you hide the DC level in a macro-block,
because most of the energy is concentrated there, codec specific, so for H.264 codec’s or the MPEG2
Revision 10 codec’s, so those macro-blocks DC level is critical to the decoding of the macro-block, right.
But if you hide that you can still get some edges and we humans are very good at interpretation. So we
can know it is a tree or a texture, or a house, and those kinds of things.
So, and the work that we have done on video codec’s revolves around playing with the entropy fields or
the RTP fields that you do not or you’re not able to decrypt the payload in a RTPs field, okay. By hiding
the RTP headers or scrambling those headers in a way that lets you lose synchronization and
recoverability of that stream, but in a manner that we can, or the intended recipient should be able to.
Of course there are all those other factors that I didn’t cover here. How do you do the key exchange
without compromising the key, or how do you exchange the hash table without compromising the key?
So over here we are assuming that we are using standard public private key encryption to do that, right.
We do it once and we go to the next step, okay. So there are details in all of this that I am happy to
share and refer you to our publications in case you are interested in this.
>> Arjmand Samuel: Anything else? Let us thank the speaker. Thank you very much.
[applause]
>> Shahab Baqai: Thank you very much for taking the time.
Download