Document 17865179

advertisement
>> K. Shriraghav: It's a great pleasure to have Sharad Mehrotra here
from UC Irvine here. He's here the whole of this week. If people want
to meet him one-on-one, we have time for that. So cloud security is a
popular topic, but Sharad is ahead of the curve. He thought about this
problem ten years back. He won an award in SIGMOD and he's here to
tell us today about more on his recent work on security.
>> Sharad Mehrotra: All right. So first things first, to lower this
down so I can see. All right. So what I thought I'll do is give you a
sense of where we are going in terms of the project and the research in
this direction. But before I do that, it's a small -- and I say I have
over 90 slides or so, if I really want to go far. But we'll do a very
small subset of it. But I don't know if everybody here knows UCI that
well or not. But in any case, here's the recent picture of our group,
which is the IS group. Dave is intimately familiar with this group in
the sense he visits us every year. So he's like one of our constant
visitors. There should be a badge for him now or something, a room
especially for him. Okay. So there's three interesting things about
this picture. Okay. One of them is that this is about five years old.
If you come back to UCI now, the faculty is the same, the students are
pretty much different. But there are two interesting aspects. You can
tell me see if you can identify. One person is PhotoShopped in. But
he really wanted to be in the picture, couldn't be in the picture at
that time. So he was PhotoShopped.
You can make out from the size of the head. If you look carefully,
it's proportionately not correct. The other interesting part of the
picture is the guy whose lab PhotoShop was built in, he's also in the
picture. Who is he? When Ramesh was at Michigan, PhotoShop came out
as a product of his lab itself. That's three interesting things. All
right. These are the faculty we have in the ISU group.
Mike Carrie, everybody knows him here more or less. Ramesh works in
multimedia. I work in data quality, privacy, security, lots of
database, general purpose database, whatever, X. Chen Li. He works
and has done a lot of work in integration Web search. He tried a
company the middle of all of this and finally back again. Natalie
works in distributed computing and in middleware technology. And
Dmitry, he works so far very closely with me and especially databases
and data quality. That's the group.
I wanted to give you a sense before I start on the talk on today's talk
about what we are doing and how this group kind of comes together. And
this took a long -- this slide took a long time to make. That's why it
shows up. A lot of jumbles around here. But that's kind of like the
big data stack that we're aware of at this stage. And most of the work
that we do, as different faculty, can be put or cost as essentially in
this particular framework. So, for example, ACCESSdb as many have
heard of, attempting to be -- this is Mike's project. He likes to say
one size doesn't fit all. And sometimes what he's hoping to achieve
with this is at least one size fits most. So coming up with this next
generation of essentially big data framework, which is ACCESSdb. And
the other Project Sherlock which I won't talk about today. This is one
of my projects where we're trying to look at data quality challenges in
the context of large data and big data. So we just started to launch
this new effort of this direction, if I get -- I'm hoping to meet some
of [indiscernible] people who are in his group, hopefully a bit more
interested in that topic as well. All right. The next large project
which is a collaboration between LD and myself is on essentially
adaptive data acquisition. So in the systems -- so this traditionally
goes back if you look at standard databases, data acquisition is
something which we normally do not worry too much about. We think
within databases data comes in and we start working with process query
optimization and all that stuff. We do a wonderful job after all the
data is in. What this project is trying to look at is can you make the
acquisition to be one of the main centered projects inside databases
itself and what will the changes be if you look at data acquisition as
a fundamental important kind of component or data management itself.
In particular, if your data is spread in diverse sources, sensors,
whatever it might be or basically Web sources, and you have so much
bandwidth constraints, whatever it is, that you can only access some
parts of the data, how do you go around doing that appropriately.
That's the larger picture of what the project is trying to do. I'll
talk a lot about Radicle, which is the cloud security a product I want
to talk about today. So I'll skip past that. And a lot of vertical
efforts in this data stack kind of picture which are looking
specifically at things like IoT. So there's a bunch of us working
towards essentially the designing systems for Internet of Things kind
of stuff.
And also there's a lot of interest, especially in the case of Ramesh,
in the social media. If you look at the full pipeline of storage,
modeling, representation, all the way to applications of social media
applications, that's where there's a lot of effort in that direction as
well. Okay. So that hopefully gives you a sense of the things that
are happening in the group. Okay. The product I'm going to talk about
today is Radicle project. And the idea here is we are trying to
explore data processing frameworks which essentially are meant for the
cloud environment and they try to exploit essentially this concept of
hybrid cloud, which I'll explain a little bit more about as well. In
particular, they try to exploit partition computation as well as
encryption technologies to be able to provide you with a secure data
processing environment. The fundamental idea here is to play with all
kinds of things such as generality of applications, risks and
confidentiality and risks basically associated with data processing and
so on and the usability of the system. And hopefully by the time we
are done with the talk, we'll see snippets of examples of like how
these data can be played off in different settings. Okay. So back to
the introduction of what the project is about.
So if you look at cloud computing, a public cloud, it has more or less
emerged as the new home for basically data. This slide is more towards
personal data. So if I think of ourselves as end users, we'll use
Gmail, Google docs, calendar applications whatever it is, mostly now
everything is on the cloud itself. So most of us essentially use Gmail
or some variant of that or the other which is where data resides in the
cloud itself. The more interesting slide is if you look at from the
enterprise perspective, because for a personal perspective, cloud makes
a lot of sense. The interesting thing to me at least was this survey
from Forrester, basically report by Forrester, done in 2012, results of
2012. And from then onward it's projected. But I think it's come to
pass more or less. And here basically what Forrester did is they asked
a large number of people who are IT managers and so on or makers and
they asked about 2200 such people in major companies about their plans
for cloud adoption. And they were looking at from the perspective of
infrastructure as a service versus platform as a service versus
software as a service. And if you look at it, the expectation at least
for the software as a service is around 60 percent in 2014. If you
look at the least, most difficult adapted from a service largely. So
that's even around as high as 40 percent or so. So there's this
enterprise is being interested in actually using the cloud versus to me
it was an interesting thing. Even for enterprise it makes a lot of
sense. Well established enterprises makes a lot of sense. If you
think of U.S. companies versus Asia-Pacific, I think the only
difference in this as far as Forrester was concerned was Asia-Pacific
was about 12 months behind. So the projection of the figure of that
was also similar, except they were following a track of around 12
months behind. Actually, recently I was in China and I came back from
China. And what I'm told by those guys is that's actually changed.
It's no longer they're behind. They're pretty much at the same pace.
So here's another similar slide from another source, which is basically
if you look at the expectation, very soon from now, if companies invest
a lot of money in their IT infrastructure, a very large portion of that
is going to go into the cloud, essentially. Okay. So the question is
valid why public cloud and this group is not much reason to emphasize
that at all. But let me, for the sake of completeness, maybe there's
somebody listening over this thing as well, remotely as well. So cloud
offers lots of advantages. One of the main things is utility model.
You essentially pay for only what you want. You do not have -- at the
start of the business you don't have any infrastructure cost at all.
There's a big advantage. Similar kind thing versus rent versus
basically buy or lease versus buy. So leasing often is an easy thing.
You don't have start up money, you can lease these off. Another major
advantage is elasticity. If your demand was up and the need was up,
you can get more resources. Potentially limitless set of resources
available on the other side. You can scale down if you so desire to
scale down. You don't have to manage your own systems at all. This is
a very important aspect as well. And probably from an individual
person, cloud adoption perspective, it's probably one of the most
important advantages. I don't have to manage my own resources at all.
Then there's cost optimization because of economy of scales.
And hidden inside this to some extent, all the advantages, is a
subliminal message about what the challenge also is. The challenge in
terms of cloud adoption is basically your only worry is loss of
control. So maybe that's an exaggeration but one of the main worries
is loss of control. That happens. So if you're using the cloud model,
your data and your applications and your computation is now running
outside of your control basically in the cloud provider's control.
There are many reasons why this leads to, which are many factors that
lead to this loss of control. Cloud is a shared resource. That's the
advantage of that. That's the advantage of where cloud comes from. So
now your applications are running at the same time with other people's
applications which you might or might not know, might or might not
trust necessarily. And even if that's not the case, if I think from a
confidentiality perspective, the cloud environment is susceptible, it
increases all in some sense, because hackers they can go in the cloud,
hack into the cloud, get into a major bank for the money. In some
sense, even though the cloud may have a significant security perimeter
around it, the chances of people trying to attack it increases as well.
The more dangerous part of it is the insider attacks. And also the
fact that the jurisdiction, the issue is the cloud might sit anywhere
else outside the jurisdiction boundaries of where you actually are
running your business. That's a major issue as well. And finally,
which is an issue that Snowden made very popular or visible to all of
us is, this whole effect that when you, if the cloud provider based on
subpoenas can actually be pushed or forced into sharing information and
data to possibly the detriment of the donor himself or herself. So
there's always this the danger that the data is not necessarily in your
control. It's not in your world, not inside lock and key. Now it's
somebody else and somebody else can do whatever they want with it.
This loss of control has lots of implications from almost every aspect
of the system design. It affects availability. And there are cases of
example where basically data was not available stored somewhere else in
the cloud and that caused a disruption to businesses as well.
There are examples of integrity as well. Yesterday we were talking
about integrity. There are examples of basically loss of integrity
leading to a problem. So there are examples of this as well. The part
which I'll focus on, that we're focused on in Radicle, is largely on
security and privacy and confidentiality. And this is touted as one of
the major concerns by most respondents for cloud adoption. So they're
worried about security, confidentiality of the data. So the question
that comes up is, okay, security is compromised or confidentiality
might be compromised. But the key question is whose responsibility is
security? Is it the responsibility of the person who owns the data is
the responsibility of the person who is running the cloud or joint
responsibility, whose responsibility it is.
To some degree, the answer is visible. And this is a slide I think
borrowed from one of you guys, your presentations. Is written in the
policy that basically, and the advice in this particular case which AWS
has. And the yellow part which is highlighted clearly states what
Amazon is telling you, okay, if you want to do something sensitive with
the data make sure you encrypt it or figure out security and stuff.
Cloud providers are not necessarily willing to take on this
responsibility of protection of the data and for good reason because
it's very difficult to protect it. Here was a slide which I thought,
when I looked at this first, I thought it was interesting. So, for
instance, they do a lot of this stuff. So they kind of -- they asked
people who are basically the IT administrators so on, makers, they
asked are you aware the security is the responsibility to these people
from companies. And the answers could be yes or no. I don't care.
The third answer. But I guess they pushed their answer yes or no.
Here's the interesting question. What do you think if you're not
seeing this result before, how many of the people actually said yes and
how many said no? What would you expect?
>>:
Most people said yes.
>> Sharad Mehrotra: So that -- he's too long with the serious. He's
absolutely right. I was surprised by this. I would think most people
would say no to this question. But people have a realization that
security is your own responsibility. So they're not kind of counting
on, they're willing to use this in the cloud, despite the fact that
cloud does not offer security, which is okay from an individual
perspective. To me this is from an enterprise perspective.
Enterprises are okay with this. So even though there's awareness,
tools today lack basically the power to enable users to protect the
data. There's a need for protection. They realize that. This is a
big barrier to adoption of the cloud and yet there's no technology.
Just kind of being able to support something like this, the issue of
confidentiality for data. And given that this is not a cloud
responsibility now the tools have to empower the end users to do
something like this. So what's the answer? Well, one answer which is
straightforward is encryption. So you encrypt the sensitive data
before you upload it to the cloud. And if you do that, there are at
least two models that come to mind right away. The first model is I'll
encrypt the data stored in the cloud and then when I need the data,
I'll get the data back into the client, decrypt it, appropriately, do
my functionality. So using cloud essentially as storage. The other
model is you encrypt the data into the cloud. You try to do whenever
you have an application, you try to do that in the cloud itself as
well. And then you look at the results and you decrypt the results.
Hopefully you're using the cloud computationally for computational
purposes as well. In the first model, basically you're only using the
cloud as storage, limited utility of the cloud itself. So we should
probably strike it out. I only want a secure disk. I want something
more than that, as well. So the answer is the second approach, which
is what we all have been struggling towards. And the question is at
least with this approach you can utilize some power to the cloud from
computational perspective. There's been in the last 15 years a
significant amount of research on how to enable encryption in such a
way that you can use it to computation on the cloud side as well or
encrypted domain itself. What I'm going to do, before we go into the
Radicle solution, which is what we're talking largely about I'll take a
slight detour around 5, ten minutes to give you my view of where the
work in the encrypted computation has been and where it's headed and so
on. Okay. You can go to sleep because some of it is borrowed from -because this is borrowed from you guys only. To some degree it's
borrowed from your slides. But I think it's reasonable to set this up
before we go, why we're doing what we're doing in the context of
Radicle. Okay. So to some degree, the first set of works that started
this whole area of computing on encrypted domain in the more recent
times, this area has been on for a long time in reality. But the one
that kind of rewired, it was a slightly more modern view of this broad
problem was this work by Unsearchable Encryption by Unsong [phonetic]
and so on. Appeared in SNP 2000. The idea they wanted to be able to
store documents in the server side in the encrypted representational
documents and be able to retrieve keyword searches on these documents
itself. The idea was very powerful. What they did, every word in the
document, they would generate a random string and hide a trapdoor for
this particular word inside the random string. That would be my
representation of encrypted data. Essentially randomized strings with
some trapdoors. When you want to query, you essentially send the
trapdoor, evaluate and power encryption that you can check if this
random string that you have got represented actually stores the
trapdoor that you're searching for.
If you can do that, then you have an easy way of testing whether or not
a document or the word corresponds to the word you're searching for now
or not. And you could retrieve documents. The first one that was
built. So it supported essentially keyword searches and keyword
documents. If you think about it, essentially every word has to be
checked. If there are N documents and each document contains let's say
D number of or D documents and the end word is document the complexity
of this is N times D that you have to check for trapdoors. It's not
indexable or efficient and so on. People in the encrypted community
they went around trying to figure out how to make it more efficient.
First thing they did was used broom filters and got rid of the size of
the document dependence, make it linear in the number of the documents.
But from indexing perspective that's not very good. The other set of
people who went around [indiscernible] and gang, maybe we can get help
from the client, construct appropriate indexes and so on in particular
inverted lists and use an obvious traversal of the lists, so you can
actually do better than that, you can get it in subnet time. But every
particular technique that came out since then, since the starting
paper, has had weaknesses and goodnesses and weaknesses around with it.
The other piece of work, which looked at the problem as skill
perspective, was work I was also a part of, which took this whole
concept of instead of keyword retrieval, it's more SQL retrieval. And
the idea of this is straightforward. We have relations. The way we do
the relation representation is I will identify fields that are
searchable. Let's say I want to have queries on age and salary and so
on, searchable fields. For each searchable field I'll create a cipher
index. The cipher index appropriately encrypted and follow
appropriate, let's say, encryption technique. And this will be padded
along with the actual data itself. And now when the query comes in,
you will exploit this cipher index you've got to answer as much of the
query, to evaluate as much of the query as you possibly can on the
cipher representation. And when you hit a boundary of encryption you
cannot do anything more. You'll push the computation back to the
client and we'll start computing and do the rest of the query
processing on the client side. We did work in that direction. So
essentially there are two fundamental ideas here in terms of how to do
encrypted search. One was essentially to exploit as much cryptography
you possibly can to do as much of the work pushing as much of the work
you possibly can to the encrypted domain. And the second aspect of it
was partition computation where you basically can continue work on the
client side as well. Now, the question is what can you do in encrypted
domain? Lots of different things. If you have deterministic
encryption, for example, you can essentially do point queries quite
easily. Do joins, so on and so forth. So I'll skip this. Each
technique we end up discussing will have caveats and weaknesses. For
example, if you do deterministic encryption, if I know the distribution
of the data I can pretty much guess the data itself. It's not exactly
fully secured in that sense. And another innovation that came around
at the time was OP or representative encryption. The idea, it's
borrowed slide from you guys only, the main concept if you have in your
plain text X less than Y, then the encryption of X is an encryption of.
You enforce. How do you do this, hundreds of different techniques for
achieving something like this which people have developed, on other
present encryption. It enables you to do researches quite effectively,
but the problem which is obvious, the first thing adversary learns the
order between things. In particular, adversary knows the domain. The
possible value, pretty much knows everything at that stage. My
favorite example if I order encrypt, let's say grades of people ABCDE
or ABCDEF and that's it and know A is more than B, B is more than C
more than D. I pretty much might not encrypt because adversary gets
ahold of it at this stage. Grades cannot be anything other than
ABCDEF. And it gives it away completely. Now this problem has been -this is the thing we talked a little bit about yesterday as well. A
idea of model encryption. There are many ways of model encryption.
One of the projects is that in this case the highest value corresponds
to the highest value. The second highest value and so on. So starting
position of everything is known. It's not just the orders but even the
starting position is real. So you can overcome that problem using
basically mapping this to a model or domain, using model mathematics.
One of the ideas was imagine the original one, two, three, so far, and
choose any OP technique to map it to some other order domain. Then
what you do is you take essentially the corresponding representations.
Now in this model representation what you'll do is you'll take let's
say one and instead of representing it as OP of 1, you add an offset
and this offset is secret. So secret offset. So let's say the offset
is 2. So the one will be the represented not as OP of 1 but
represented of OP of 3 and so on. When you reach N you wrap around.
Right? So the interesting part here is that in this theme, the order
is preserved completely. So if X as in Y you can test whether X is
less than Y or not. But at the same time the starting position is
completely hidden, hidden in this parameter J. Because unless the
adversary knows J, you cannot figure it out. Yes?
>>:
[indiscernible].
>> Sharad Mehrotra:
>>:
Say that again?
[indiscernible].
>> Sharad Mehrotra: Because when I do the mapping, if I query the set
2 to 4, I'll map 2 to 2 plus 2 is 4 to OP of 4 and OP of -- if the
query was up to 4, OP, the OP would correspond to OP of 6 because 4 - 2
is 6. OP of 6.
>>:
But the model that you're [indiscernible].
>> Sharad Mehrotra:
around.
>>:
You'll wrap around.
Get more and you'll wrap
[indiscernible].
>> Sharad Mehrotra: Capacity not big. So strictly speaking, this is
not OP representation. But there is enough power left that you can
wrap around. Query might be starting from N minus 2 and going up to 2,
for example. So you could do that. Okay. But again is this secure?
Well, first, the security of this is no better than security of OP
itself which has its own set of problems. But the starting positions,
if you talk about just the starting position, is it secure from the
starting position perspective? Can the adversary figure out the
starting position? The answer is yes or no. Yes from ciphertext if
it's 100 percent secure because the starting position cannot be
detected by adversary in this technique at all. But on the other hand
if I allow queries from the query pattern you can actually figure out
with reasonable attack you can figure out what the value J is going to
be.
>>: [indiscernible] question. Going back to the example.
create on a curve and I know you give 15 percent --
If you
>> Sharad Mehrotra: Yes, so sensible attacks, forget sensible attacks
for a minute. Let's assume the benign situation that everybody, the
bins are all equal and extreme situations. It's a bit more secure than
not doing it at all. Let's put it this way. The modularity adds
the -- if you forget about the extensible attacks of that kind,
modularity prevents you from obvious attack of OP, highest value here,
highest value there. That's no longer true because highest value
depends on where J is.
>>:
Both are insecure.
>>: Both are insecure, of course yes.
>>: Even order preserving, if the same thing is crypto the same other
thing, why is it any harder than the crypto we all grow from age 0 to
age 2 as babies?
>> Sharad Mehrotra: It is not -- I think the point here is that OPE -the position required, absolutely, I think OP is insecure technique.
I'll offer you the next one, which is something we did. Let's see if
you bring that one. It's also the wish, true, but I think it's a bit
more secure. The problem is we've not formally proved it. It appears
more secure. Let's have a look at that. So what were we doing, we did
not do OPE. What we did was the concept of bucketization, in that
piece of work, and what bucketization does is very much follow database
principles of histogramming and the idea was that if you have a domain
you look at the domain, in this case the domain is salary, and break
salary into bunch of buckets. With each of these buckets I am
associating essentially a deterministic encryption technique. So the
first bucket is 32 to 50K. I'll now have a deterministic encryption of
one or basically this idea, whatever the bucket ID, is it corresponds
to a data encryption bucket ID. This is what we would do. We would
store this stuff in. The advantage of this scheme is the following. I
guess the way the processing happened since you bucketize the domain,
in this case you cannot exactly check the query at all. So if I want
to look for, let's say, the inch query or point query of any particular
thing, find the corresponding bucket and do a deterministic, since it's
deterministically encrypted, I'll have to go, retrieve the whole
bucket. Once you retrieve the whole bucket, the clients work to filter
the bucket out. Now, you can look at this from the advantage and
disadvantage perspective. The advantage of positive of this was this
is very general. You can actually do almost all of SQL including parts
of aggregation as well, using the simple idea, storing appropriate
counts and so on. Very general. Do joins, do point queries and bench
queries and so on. It is efficient because it's fundamentally
indexable. So you pretty much didn't have to change the database
processing in that sense. You could spread the optimizer. And another
advantage it added sliding scale security. If you want complete
security, I declare one bucket. This is kind of silly. But on the
other hand it's security. So there is a sliding scale security. The
negative was they were all ahead. You have to do post processing in
this case and learn part of the query in the private side as well. And
depending on how you bucketize, ciphertext will reveal some information
because there's some value of knowing, hey, these two values are close
to each other, like they belong to the same bucket that would give away
some value, some information away for you as well. So there were
advantages and disadvantages. But it's playable. And the key question
of security, dependent on how you can move the buckets itself, whatever
the buckets are generated with. So if I look at it and then we could
do a mathematical analysis of the same thing, larger the span the
bucket, typically you'll have larger security. Okay. So security
metric was larger would do better. The bucket size, larger the better.
Frequency distribution in the bucket, the more uniform it is the more
information that's been hidden away. Uniformity was important measure
as well. Cost metric was how many false positives do you actually
generate, given a query what's the size of the false positive you have.
And the key issue which I thought was interesting here was that it
actually provided us with a mechanism of improving security by adding
bonded randomization I'll not talk too much to it but I'll give you an
example of it.
>>: When you're talking about buckets, you said more uniforms are
better, is there an issue of the amount of things that are in the
bucket or can you vary bucket size to give you an appearance of
uniformity?
>> Sharad Mehrotra: You can vary the bucket size. You want that to
prevent statistical attacks badly. So, in other words, if it's a
distribution the SKU can give away the bucket ID.
>>: In some sense push it out, make all the buckets the same size.
Then you might hide that.
>> Sharad Mehrotra: Yeah, so that's what we'll do. We'll make the
buckets equal size. So the interesting part of this entire scheme to
me was, which is still not fully explored, we've not done a good job
exploring this appropriately, is we could add a certain amount of
bounded amount of let's say randomness to the process. So in
particular, one particular way -- there are hundreds of different ways
randomness can be added to improve security I think but one of the ways
which we did try in a paper we did, we'll be general, was we said,
okay, if I look at a bucket, initially have a bucket identified for
you -- let's say this is bucket one. I'm going to toss a coin and I'm
going to throw the objects in this bucket either to its bucket
representation itself or some randomly identified bucket. This mapping
of which bucket could, the content of this bucket would diffuse into is
fixed. It's a secret. So objects in bucket one may reside in bucket
one or they could actually go to bucket four or bucket three or
whatever it is. So essentially adding a little bit of randomness to
the data itself. You're diffusing the data into different buckets and
so on. So if I finally look at what does bucket four look like, it
could have data from anywhere. Not anywhere, the answer is secret.
But at least from a large amount of data space itself. And the same is
true for all the different buckets.
What this does is it increases essentially security for us. Okay.
Obviously it's not free. Because now we'll have to do a search. If
your query revolves around bucket one, you have to go to all the
buckets where the data could reside in. Right? So but on the other
hand it's giving me a practical approach to going around adding more
security. In fact, completely diffuse the bucket to everywhere then
it's back to square one, which is there's no prunability left of the
bucket itself. But I can control the amount of basically security
randomness. This technique of adding randomness in the context of
partially secure techniques is an interesting direction which is
relatively underexplored the way I see it. There's not too many
examples of this but it's a wonderful direction kind of to move
forward.
All right. So sorry took a little more than five minutes but here is
where -- we wrote a couple of papers basically high level papers
identifying what are the different techniques for basically searchable
encryption and so on. And I think forget the details. The most
interesting aspect of it was if I look, there is no final like silver
bullet you can solve everybody's problems. You can evaluate these
different techniques from different perspectives, generality query,
confidentiality that you get, whether the thing requires you to have a
cloud versus client workplace issue, what's the efficiency of the
technique, how much does it depend on trustworthiness of the
infrastructure itself or not. And all techniques kind of fall either
as points or set of points in this particular space of some kind that
we generate. So you have a large number of solutions out there, but
none of them is complete. They explore different tradeoffs within
generality, security, efficiency. And this is again -- this same point
is made also in your tutorial which is the slide from your tutorial
which can identify the same thing. Okay. Now this has not stopped
people from building, the project is so important even though there's
no silver bullet, there's no exact solution to this stuff, people are
building systems already based on the topography that's already out
there and available to you. There's many examples of it including the
work going on at Microsoft, this group CryptDB and cipherbase and so on
and so forth. And, by the way, besides these I've had a chance to look
at the system at SAP, at NIST, which is -- NIST is implementation of
our work. SAP is more like CryptDB. A implementation of CryptDB and
Lync is influenced by CryptDB as well. There's a lot of work in this
area that has taken off and people have built systems and so on,
explored different options.
Okay. The key issue through my perspective is the following. If I
look at the modern trend here, most of the systems that have been
designed, they essentially offer more security having different
functionality. So being a tradeoff between the security and
functionality. That's where the ballgame has been to a large degree.
So there are many challenges even if you look at encryption encrypted
data management using encrypted data representation for security.
There are many challenges that remain. First, obviously there's no
technique that's a silver bullet. No technique of complete security
and that leaves a natural question. If I am having two different
techniques and three different ways to do the same thing, which is
better? Which is more secure? And this is not an easy question to
answer. You have to go into the depth of somehow the other modeling
the risk. How much information is given away by deterministic
encryption versus OP versus some of the other searchable encryption and
so on. How do you measure something like that? A key question which
is there, which is open. Again, I mentioned this already current
functions functionality versus security. I think the more interesting
from a system development perspective is not that question. The more
interesting question is functionality you cannot compromise on. Either
you want it or don't want it. If you don't want it, don't buy the
system. That's okay. If question is if I want the functionality. The
real tradeoff should be between basically between the efficiency of
implementing or realizing a functionality and security. And the third
thing is most current environments in HP use these systems. We should
never ignore the power of private machines of basically secure, either
secure hardware or basically the machines that you may have in your
client side itself. So you don't want to ignore and solve a problem as
if the data is completely in the cloud itself, in a completely
untrusted environment. There is actually an availability of trusted
infrastructure which you should be able to exploit to your benefit. So
question is can data computation be partitioned to explore this secure
execution environment you might get.
>>:
The secure environment, would it be the client --
>> Sharad Mehrotra:
way.
>>:
The system could be both either
General vision itself?
>> Sharad Mehrotra:
>>:
It could be both.
We'll see the differences come up.
Okay.
>> Sharad Mehrotra: Okay. All right. So that vision has never been
fully realized proper system. That's one of the problems. We didn't
build the system which you should have. Okay. So Radicle project is
all about trying to do the same thing. It is a review of what we did
in DAS essentially, but with an eye on two things. The first is can
partition to achieve security and the second is can I now at least
formalize what risks are and exploit risks?
So bound the amount of risk a particular execution has, a system
particularly has. It's not to replace encryption it, it's to
complement it, rather. If you can make good progress on homomorphic
encryption, make it practical and do some of the special encryption
stuff, that's great. This is meant to complement it. If you can
completely solve the homomorphic encryption problem, fine, we'll back
off and do nothing, you've solved it. But I don't think it's solvable
in the next few years to come, 10 years to come, 50 years to come. So
I think we're okay.
>>:
Solvable by one --
>> Sharad Mehrotra: Maybe. But [indiscernible] all right. So in
Radicle -- so this vision of risks, controlling risks and controlling
using partition computation. We've built many example systems, example
systems of this kind. One of them is called CART protect. And here
the idea is to build a middleware which what it does it's like a
middleware that sits between your Web browser. It's meant for
providing secure access to applications. So there's DropBox. There's
Gmail, whatever you have on the other side, and there is this browser
to which you access this particular information. And what it does,
it's a proxy-based architecture, sits between these and the rest of the
bad world and selectively encrypts data for you. The encryption data
is such it plays a game of trying to make sure, when you encrypt what
happens, when you encrypt the data, some of the functionalities which
you could get in Google, for example, are no longer available to you.
Now, if you wanted functionality, what do you have to do? You have to
get the data back, decrypt it, send it back, get your functionality and
implement it. English translation, you can do that but you have to
bring it back. It interferes with the usability of the system. The
latency of the operation takes a long time. What it does, it's sitting
quietly on the side, looking at the log of what you're doing
automatically adjusts what should be encrypted, not encrypted, to
strike a good balance between security and risk of exposure.
>>: So what risk and other dimensions, for instance, you could lose
your key in which you've lost your data. Which is not relevant ->>: Sure. In this case the assumption has been the key is to
safeguard of data on a local machine. The proxy sits somewhere and it
guards your key whatever. You could push the entire proxy and in
reality never achieve that. It's very light so it can run on a mobile
machine as well and it should be able to store all its data into the
cloud, which is never doable I think. But we didn't do that.
>>: Can you imagine this being say part of ODBC, an ODBC framework
where all database accesses go through the ODBC client side and do a
little encryption there.
>>: This is calendar, right? The client has the calendar to face on
the database. Doesn't see SQL or anything.
>>: But there's no logical -- I agree with that. Our current version
answer would be no. This is not just calendar, the cloud protector has
been used with Google calendar, DropBox, with Box, with Picasa, lots of
different services. They've been built early. At some stage all this
was working. They exist in a point when these were working at some
state. It's a pretty general architecture. There's no logical reason
why we cannot even do data processing and SQL-esque kind of stuff with
this.
>>: The client, you have the JavaScript and the proxy needs to
implement the same interface as the server.
>> Sharad Mehrotra: Yes. So this is under the assumption that
interaction is through http requests proxy. The proxy looks at the
http request and modifies it based on the encryption mechanism. Proxy
is maintaining how the data is represented on the client side on the
server side in which particular fashion. It maintains full knowledge
of that. Some representation.
>>: But it needs to understand the semantics of the interface and
that's why the ODBC approach will not work because it needs to search,
it needs to do some rewriting.
>> Sharad Mehrotra: I see your point. In this case it does understand
each form, which is out there. So it understands the semantics of the
form. What does it contain to the form for each of the http requests.
Corresponding http requests, it has to understand that. The second
system which I'll also talk about, we can separately talk about later,
is hybridizer. The idea is simple. I want to have the cloud to run
hybrid queries or SQL queries on hybrid infrastructure. Part of the
machine is here, part on the cloud side. The goal was, given the
workload you've got, figure out how should we partition data and
partition computation. And we made a assumption here. Unit of
partition is a query itself. Query executes on the private side or the
public side. Not both. That will complicate things. It tries to
figure out the best way to partition data and led to the problem,
optimization problem, that's the hybridizer framework we built. The
one I will talk about is SEMROD which is a secure MapReduce technique.
And the idea here is so in a sense this is all about SQL and we wanted
to go one level lower first. For two reasons. One we said okay, I
know how to play the game at the SQL level to some degree. Let me see
if I could build an infrastructure where I don't have to worry about
SQL at all. Let me do the demand level and then figure out how to run
MR in a reasonable fashion, in the hybrid cloud environment. Then take
it to high and convert to MR and it will run. Basically this will be
the lowest possible, let's say, level at which you can exploit the
secure processing itself.
SEMROD today.
So I'm going to talk a little bit about
>>: Can I insert the SQL connections. So you could use your
functionality as a way of partitioning the data between -- you could
also have some reasonable partition into the cloud between local and
secure cloud based perhaps on sensitivity of the data, whatever. And
treat the problem -- let's say a job of partitioning the query in some
way.
>> Sharad Mehrotra: So the partitioning is completely based on
sensitivity. You know what's sensitive and not sensitive. And you
know which query accesses what data. You have workload available to
you. What you're trying to figure out is okay not just how much
sensory data resides with you on the private side. It's a risk-based
model. You can actually -- if it's small amount of data it's less
sensitive but used very often, it will push that sensitive data to the
public side, including a disk. But the system will allow you to
automatically adjust the amount of risk you are wanting to have. Now,
it doesn't answer the harder question like if I do OPE and if I do
deterministic encryption, how much risk is there because that we don't
know. The minute I say here is the data which I'm exposing to you
which is sensitive, I'll count that as risk. Right now the model of -risk 0 or 1. And data encryption is .5 .6 .7, is that the way to look
at it, different issue altogether.
>>: You had a notion of partition, so can I thing of hybrid viewing
the partitioning using risk as a mechanism for the partitioning as a
new way of ->> Sharad Mehrotra: Yes, you can look -- in this case the partitioning
is at the workload level. You take a whole workload and you first come
with a data partitioning, given to suit the particular workload. When
the actual execution happens, the queries come, you can partition it
accordingly. Similar. The only addition being that the risk is
factored in as well. So on to SEMROD, a new work which we completed.
Okay. So if I look at one quick sense of hybrid cloud, private side is
secure, because it's in your control, rather secure, I should say
control. It may not necessarily be secured, but it's in your control.
Public side which you'll all agree is efficient, is cheap and
scaleable, elastic, all the nice properties we have from the public
cloud environment. And the hybrid cloud is a seamless integration of
both of these things. You can run the application on both sides, on
public and private. And the goal is highlighted on the Beckman report.
Maybe it's there we were there, and I participated in that thing.
Maybe that's when the line made it. But anyway it's not in the Beckman
report. It's an opportunity to achieve -- and it is an opportunity to
achieving secure and efficient computation in a cloud environment.
>>:
[indiscernible].
>> Sharad Mehrotra:
they're ->>:
Funny word.
They prefer it.
>> Sharad Mehrotra:
>>:
So the way they prefer, I'm not sure what sense
Right.
Prefer it.
But I'll make --
I have two things to say about that.
I thought maybe you had.
>> Sharad Mehrotra: I don't have the numbers. I have one thing. We
were forced to do the experiment and make it to be more accepted. In
the Bay Area, from nearness or near the data, major datacenters and so
on, where the stuff runs, right, they're providing and laying out, lets
say, fiber to the extent they have very good access in a large area on
the public cloud. Now, one of the things they're after and one of the
reasons the companies that do this the list is, hey, people will use
hybrid clouds and you'll quickly see that having fast connectivity to
the public cloud infrastructure is pretty much not a requirement but
you can do reasonably without that but that will certainly make it a
feasibility in a very big way. I have a feeling that if it's not
happened, the people will be using a lot more of this.
>>:
The hardware may be there.
>> Sharad Mehrotra: Yes, I'm not sure how much actually of using that
this happens. So we are going -- if we solve the problem maybe we will
basically, people will use, yes.
>>: You have a hybrid cloud, the whole ends up responsible when
something bad happens, will there be finger pointing between of the
private and public people?
>> Sharad Mehrotra: Good question. I'm not sure. I think the hybrid
cloud the responsibility of running it belongs to you. You can get
virtual machines and you can connect those virtual machines in a
particular datacenter to your infrastructure and you're running it
yourself as a company. So probably the responsibility will be yours.
Now, if the virtual machines do not give you the SLA they're guaranteed
to give you, maybe we can go after the public cloud. Okay. So these
are end goals. What we wanted to set out is the following. We wanted
security. Again I'm going to drop risks altogether. It will be zero
risks. Fully secure. We wanted a starting point. No leakage about
any sensitive information in the public cloud. We've wanted to use the
public cloud. One easy way of achieving the first is run everything on
the cloud and you're done. That's not what we want. We want to run it
on the public cloud itself. We want to limit the burden on the end
user. If someone is in MR programming I don't want them to reprogram
things at all. Should be able to run it as is without any changes. We
want it to be generic but MR most things compile down to at least in
one version of the world they compile down to MR. It's okay. We're,
by the way, doing the same thing with Spark. We're planning to do this
with Spark. We're exploring that. So hopefully if you don't use MR
Hadoop, if your Spark that's fine, too. Some near future maybe. Okay.
And finally the main question is the following: This is the way I
would hit the road. You want the world of security basically not to be
too high, you want to be practical. The question is what does it
actually mean. So the first question is compared to let's say the
other obvious solutions run everything on the private side, this should
be significantly faster. Okay. Otherwise no go. Completely. And
second thing is, I'm not too concerned about, but anyway it's an
important thing to say anyway, is should not be much worse than running
without security. If let's say MR job, Hadoop without worrying about
security in this hybrid cloud environment, whatever performance I get I
should not be a factor away from that or many magnitudes away from that
performance. I want to be relatively close to the native Hadoop
implementation as well. This is going to be very difficult to meet.
>>:
Security, is it no leakage.
>> Sharad Mehrotra: No leakage whatsoever. But I'm not using
encryption. So we'll just play around with how things are shifting
back and forth. So you'll see. You call this efficiency. So before
we -- since we're talking about security we have to attack model.
Attack model in this case is very much what we normally use in the
cloud environment. It's honest but curious, passive attack model and
we assume that this guy on the side does not alter databases does not
alter results and so on as well. What can he see to attack you?
Anything that happens on the public side is visible to that bad guy, to
the adversary, obviously. Not just that, any interaction between the
private and public side is fully visible to the attacker as well. So
adversary has full knowledge of all the stuff through, if you're
shuffling data, transmission of data back and forth, it's fully visible
to the adversary as well. On the other hand what you do on the private
side on your own side is not visible to him. Now you can actually
attack that point. You can say that, hey, depending on the timing
constraints maybe he's expecting to see data X being transmitted to
him. It took five minutes versus or five seconds versus six seconds
that's visible to the adversary. True. We're not secure against that
attack. We're not secure against that attack from that adversary so
far. We have to have a sensitivity model. The question is it's about
protecting data that's sensitive. What is sensitive data? So
sensitive data is basically data that you do not want leaked. What I'm
going to do, put caveats here. Maybe we'll kill the animation. Put it
here. So to define what is sensitive, it's probably easier to define
what is not sensitive. And you'll quickly see that as I scoped all the
inference of attacks in this kind of environment, which is, by the way,
exactly how database systems also work in reality. So, for example, if
you have sensitive data, any data which leads to inference about
sensitive data, any correlation attack of that kind, I'm not going to
allow that. Because if your data can reveal some information about
sensitive data, you better call the data also as sensitive. So there's
this partitioning available to you that this is sensitive and that's
not sensitive. So, in other words, the adversary completely sees all
the nonsensitive data, he still can't know anything about the
nonsensitive data. Privacy works in a different fashion altogether.
I'm not concerned about that part. This is pure security of sensitive
data.
>>:
Is this --
>> Sharad Mehrotra: So you as a user define it. You may specify that
all methods, salaries are too high, a person has been fired, a
sensitive number.
>>: Challenge the set of movies you've seen, [indiscernible], don't
even know up front that that's a potentially sensitive ->> Sharad Mehrotra: That's what I'm saying. I'm limiting myself to
applications. So more towards SQL security setting where you have
defined the sensitivity based on predicates or whatever it is. It's
not addressing the -- it's not addressing the inference and challenge.
Because if you go in that direction, there's no end to it. Then you
resolve a differential privacy problem, which I'm not going into at
all. There's a large class of practices which fall and you use as a
sensitivity model.
>>:
Sensitivity [indiscernible].
>> Sharad Mehrotra: Not necessarily. It could be file by file. It
doesn't really matter. It does not matter. The example we'll do with
the MR framework. What's the record in MR framework necessarily? A
lot of times there's a record. Sometimes there's not. It will not
matter. Okay. I'm going to make some assumptions further. And this
is in some sense limiting ourselves further. So one assumption I'm
going to make, if you compute some function on sensitive data, the
input is sensitive, the output is sensitive. Now, you can argue that
it's not correct. And actually it is not correct because lots of times
the computer function, the output is not invertible, you can't do
input. This is a very conservative assumption. We won't deal with the
conservativeness. Anything other will only help us; it will not harm
us.
All right. All right. So we're going to do MR with basically
sensitive data. So in this case it's record oriented. Though it could
be column record oriented. It doesn't really matter. In this example,
let's say the dataset consists of name, disease and treatment dates and
so on. And I, for better or worse, assume that if somebody has got
cancer, that couple is, that record is sensitive. Okay. All right.
And imagine running a MR job. What the MR job is trying to do is
something very simple. It's trying to create a list of persons along,
name of person, along with a list of, let's say, diseases the person
has. So Chris, flu, James flu, Jean, acne, cancer, so on. If I look
at this, the record in this case is had cancer, had cancer and Jean
cancer. These are basically sensitive records. These were not
sensitive. This reducer function runs. It generates data about Jane
and Zach, this is sensitive because input of Jane and Zach is
sensitive. This is also sensitive.
>>: Are you predicting yourself, you have a reduced function that has
a red input and you show the black output.
>>: So this reduce function, if you think of it as working on key by
key. So this working on this key, this key is sensitive, that key is
not sensitive. Okay. Okay. So first it's a full MR system. So users
specify what files, using predicates and so on, what is sensitive and
not. First thing is how should data be distributed in the hybrid
cloud. From HDFS perspective, it's straightforward. You look at the
data and there's a master who decides the placement of data.
Nonsensitive -- sensitive goes on private side and basically
nonsensitive data gets shipped off with the public side, not a big deal
to do this. You cannot expose any sensitive data anyway. All right.
Now here's the question, let's see we start running, we've got
sensitive on the private side, nonsensitive to the public side. And
let's say start running this MR job. Clearly the mapper which is
running on the side of data which is sensitive is taking sensitive
data. So obviously it will have to run on the private side. So this
mapper cannot run on the public side. Any way, we would not run on the
public side because you want mapper to run close to the data anyway.
So that's okay. Let's go down to the reducer. In this case there are
two partitions and so they're two reducers in this case. This reducer,
which is dealing with cancer records, sensitive records, where can this
run, can this run on the public side? Of course not. It has input
sensitive, it has to run on the private side. Whatever that guy, this
guy runs on the top two records. And no records from here and no
records from here. So it's running only nonsensitive data. Correct.
Can that run on the public side?
>>: I would say no, the pattern of which reducer runs where leaves
information.
>> Sharad Mehrotra: Absolutely. In fact, it cannot. Because there's
a key inference attack, possibility. Specifically what happens if you
run this to the public side, what will happen is that at this stage,
the adversary would know, if nothing else he would know that James, who
is -- yeah, James, right, does not have cancer. Because if James had
cancer, then there's no way this reducer would be allowed to run on the
public side. Adversary finds that James does not have cancer. Since
James and Matt record gets sent here the probability that Jean or Matt
one of these two guys has cancer increases. If there's knowledge in
the data X people have cancer the probability increases. So it has to
run on the public side. Sorry, on the private side. So effectively
what we did is we went around and only said, okay, the first map
operation can learn on the public side on the nonsensitive data and
everything else has run the private side.
>>: It occurs to me how does that also become secure, because it seems
to me you're making some assumption about the addressing knowledge.
But example, if the adversary knew that he had a record about Zach and
it is not visible in the public side when you are mapping, so the
adversary is looking for that, he knows there's a report on Zach and it
did not appear in the mapping in the public let's say he's able to
infer.
>> Sharad Mehrotra: Remember the assumption. I'll cut you short and
answer it right away. Remember what I said. There is this
knowledge -- there's a partition of sensitivity and nonsensitivity.
Knowledge of what is not sensitive does not give away anything about
sensitive information at all. So basically if he had got all the
records which are not sensitive, he would never be able to infer
anything about Zach. This knowledge that Zach is in the record is
actually sensitive information.
>>:
Is that an assumption about addressing knowledge or is that --
>>: Sharad Mehrotra: I'm sorry, it is -- you can treat it either way
but to me it's a definition of what is sensitive and what's not
sensitive. I'm not allowing inferencing attacks because then I'm
screwed. There's nothing we can do at that stage. We have to go into
differential privacy and so on which is going to complicate matters
completely. If I look at mandatory access control, work in databases
and so on, and in fact data security most of the work, this is the
model of security that they have. Here are records that are sensitive
and here are records that are not sensitive. You can expose
nonsensitive anytime you want. Sensory data should not be exposed. If
there's an inferencing possibility between the two things. All bets
are made you made a mistake should not have made it nonsensitive to
begin with.
>>: Which is fine. But the second part of it where you the reason it
has to run on private thing it seems to learn inferencing anyway.
>> Sharad Mehrotra: That's a different kind of referencing. The
problem is -- the predicate I have associated, which is this thing, the
sensitivity is defined in the predicate. So the fact that James'
record is not here, James' record essentially gets reduced here gives
me additional information there's no other James record out there.
>>: You can produce privately, one production in the public side and
then a second production on the private side.
>> Sharad Mehrotra:
>>:
There's only --
Two reducers.
>> Sharad Mehrotra: So you'll see in a minute as to how we treat the
problem. So all I'm saying is if I blatantly run MR.
>>:
Run MR --
>> Sharad Mehrotra: That's all I'm saying. We'll see how we can do
this. We'll fix it in a minute. But the point is a fine thing. In
this current architecture, the only thing I can run on the public side
is this guy, the first map. That's it. Okay. And in fact this is a
paper. So there's another group working in parallel with us. And they
have a paper called Siri. That's what they said it's CPS paper last
week which that's what it is they'll run the first map on the public
side and reducing and everything else happens on the private side
itself. Okay. And it makes sense if a job is very map heavy and a lot
of the details not sensitive, then this does make sense completely.
Question is we want it a bit different because if we look at database
[indiscernible] a lot of them produce heavy drives. You are not
getting any benefit of public machine for most of the work we end up
doing. We want to do it a little bit differently. What did we do?
What we did was something routine analysis. So effectively what I'm
going to do is I'm going to figure out, when you do the mapping, this
is sensitive data, I'm going to figure out what keys are getting
essentially dirty or which are sensitive. So in this case Zach and
Jane are the keys sensitive? Yes.
>>: Put in standard database query optimization sense, isn't it true
all you need to do is consider plans where only data flows from public
to private, that's it.
>>: In the previous plan also data flew only from public to private.
Here also -- well, yes. In fact, there was no data which was sent from
the private to the public side. So still a problem.
>>:
Example of secure plan.
>> Sharad Mehrotra: Not secure. Unsecure, leave reduce to secure, is
on the private side. It's not secure otherwise. But intuition, if you
hold on for a minute, I'll be very clear. Your intuition is almost
there but not fully. Let me say one thing. It will be clear.
Hopefully. Here's the following keys I'll keep track of what keys are
sensitive or not, which is these two in this case. Now, the current
time for reducing. Now what I want to emulate is the following thing.
I want to make sure essentially to cut the long story short I want to
make sure that the behavior and information exchange, which is visible
to the public side is completely independent of what is private and
what's not private. What's sensitive and what's not sensitive. So
from the observation perspective, the execution that I'll get is
identical to the execution as if there's no sensitivity whatsoever. If
that's the case, I'm going to go observation equal and hence secure.
>>: If I know that J and X exist and I look at the public cloud, don't
I ->> Sharad Mehrotra: That's the question, too. So we don't -- that's
being scoped out from the assumption we've made that's been scoped out.
>>:
[indiscernible].
>> Sharad Mehrotra:
Yeah, that's the case in the U.S. market.
>>: If you are worried about attacks like that you mark all the
records as sensitive.
>>:
Can't do anything in the cloud.
>>:
It's not an easy problem.
>>:
That's abstracted away.
>> Sharad Mehrotra:
>>:
That's abstracted away.
That problem is extracted away and the definition of sensitive.
>> Sharad Mehrotra: Yes. All right. So how will I achieve that?
What I'll achieve, the way I'll achieve that think again reducer two,
with Jane and Matt's records are being basically shuffled to the
reducer tool. I'm going to replicate reducer two action on both sides.
So I'm going to actually replicate the reducer action on both private
and public side. So in particular these records from Jane, and this
will always be the case, the private, the public side will always
siphon or shuffle data both to the public side and to the private side.
So when reducer tool will get the action of Jane and Matt and so on it
will generate records. Notice the difference between Jane and Matt.
Matt's okay. Matt's record will generate flu/cold because of whatever
record came in. Jane will produce acne here. The same record is here.
Let me see if I have animation of it. The same record comes here as
well. So you get Jane and Matt's record as well as the records from
the private side as well. This guy has access to the list of what keys
are essentially sensitive or not sensitive. So in particular, when it
gets Jane's record, it sees Jane has sensitive record. It drops Jane's
record completely. It will throw it out. It gets Matt's record too.
But Matt's key is not sensitive. It will say I'll do nothing about it.
So most of the worker reduction, if most data here is not sensitive is
actually being done by reducer too on the public side. Some portion of
the data for the sensitive part of it, the one for which the key is
sensitive is being replicated and done by reducer tool right here.
>>:
You have to shift all the data.
That's the whole cost.
>> Sharad Mehrotra: Yes. Hold on for that for five minutes more.
We'll be there. In fact, that is the fundamental question. So you're
absolutely right. But I'll show you that actually the quest is better.
So this is what I will do. Okay. Now, once I have done this, this guy
will have reduced and produced incorrect answer, which is Jane acne.
And there's correspondingly right one here. It's not rocket science to
figure out which of these two is basically clean and which is dirty.
So there's a final filtering step. We'll filter step which will be
there which will get these records together. Throws acne out and keep
the other one. The logic of it is a bit more complex but it's doable.
Okay. That's okay for one MR job. Now, the question is what happens
if there are multiple MR jobs. If the sequence MR jobs. That gets a
bit more complex. The reason that gets complex is in the most naive
implementation, what will I do? I have the right answer here. Now,
forget the cost that Don was mentioning, let's not worry about that for
a second. I'll ship correspondingly work from here to that public
side. Going back to intuition, if I ship anything from the private to
public side, I'm screwed. It will always give away information. If
you look through the filter look, at all the records, all the output is
sensitive in our model. We cannot shift things back. And which then
tells me if I have a multi-layer MR job the only thing I can do is new
processing here on these wrong records. But if I have new processing
on new records, I somewhat have to have a logic built in that if I have
the processing further up in the second MR job what is tainted and
what's gone wrong, the knowledge of that should be available to me.
And then there are lots of design questions how do you build that logic
in. The one we ended up choosing after a few efforts was a very simple
logic. This reducer, when it looks at the data coming from the public
side, not only does the right thing which it has to do anyway, it also
replicates the work done by the wrong world done by the previous guy.
It will generate Jane acne as well. Okay. So Jane's record is the
right one. This is the wrong one. And this is also wrong one. This
wrong one will be used to cancel that wrong one. Now, if you have
multi -- this input now is going through the next layer of MapReduce
jobs. Okay. So I don't know how I'm doing on time, I might be over
already.
>>:
The other --
>> Sharad Mehrotra: All right. So believe me with a proof real
processing and marking of the keys, you are maintaining this
information about what is dirty and what's not dirty. Basically
alternating the processing on the public side as if everything is okay.
And you will cancel -- you maintain enough state that once the record
merged which will be done at the very end, at that time you'll have
thrown away the wrong records and keep the right ones. The details, if
I skip I won't be able to answer Don's question. I'll answer his
question. So I'll skip this, this is technically more detailed slide.
I'll skip this one of how to maintain the information. Before we
answer this question, first question is security. From the perspective
of security, this is the execution is observationally completely equal
as if there was no sensitive data at all. So the adversary doesn't
learn anything at all. You can start the game with standard proofs to
prove it and we were able to prove it. That's not a big deal. Okay.
If I go he back to design goals, security, yes, check, because we
proved it. And public side usage, yes, because our maps and reduce,
everything works on the public side as well. Able to fully use the
public side. Limited burden to end user. Most of the logic about what
we're talking about is implementable within the MR framework itself.
The user doesn't do anything at all. The only thing is to have a
mechanism, what's sensitive, not sensitive, which is something you need
for a system of this kind. Can be done without any burden to the user.
Generic to the MR generic, assuming it's generic. The key question is
is it good and that's the fundamental question, you're shuffling data
to the public to the private machines. Is the overhead too high or
practical or not? Again, there are two tests for us, one is compared
to what. So compared to all private, or compared to native MR
execution? Two words like that. Look at overheads, where are the
overheads. First we're doing key generation, overhead of key
generation. Turns out not very high for number of jobs. We were
initially thinking of representing these key sets in broom filters and
efficient sets. It's cool. It doesn't matter. There's overhead in
all implementations, all the tests, there is never too high. If it
turns out to be high we have techniques galore on how to make set
implementations set checking that we have to do efficient. So it's not
that really big overhead. Extra incorrect processing filtering which
we have to do, yes, that's overhead. But again we are designing this
for the point for the system in which most data is not sensitive. Only
sensitive data has to be pruned out. So if that percentage is small,
this is not too much overkill as well. This is a killer. You're
overshuffling. Now, you have to fail to us, if you think of
overshuffling what am I comparing here. If it's other solutions like
CEDEK [phonetic], has the same shuffle from public to private.
Compared to CEDEK we're better. The question is compared to nonsecure
Hadoop we're shuffling more and we'll be shuffling over wide area
network, assuming that the public/private sides are in a slow network.
So it's significant overhead. Obviously if the network is as fast as
LAN we'll get better. We'll be okay. We'll start competing with
Hadoop at that stage. Okay. So let's do a quick analysis. And I'm
not going to go into details analysis, but more than to point out the
parameters that one has to view. Here is the -- this analysis is very
easy because you can figure out what the cost of MR jobs additional
overhead is given MR. Compared to let's say all private versus semi,
which is what our system is. So the assumption is that this is the
initial data. This is the intermediate data. This is the speed at
which per byte of initial data, the map speed. This is the reduced
speed and so on. So the amount of time a system will take to reduce
the job will be D star B star the results of the speed of reduction,
reduce, divided by the number of machines, for example. So you can
figure it out. You can also figure out the shuffling cost as well. So
these are the costs. And I will skip past, believe the math is
hopefully okay. So let me go to -- so you can compare. So the
comparison is what? This is the cost, expected cost of doing all
private. This is the expected cost of doing it in SEMROD. If this is
more than that, I'm doing better. So I can possibly, inequality right
here, and I can analyze the parameters of this inequality. What turns
out is first observation if beta and beta star are slow, that means the
machines are slow. You've got slow machines in your public and private
side, good for you. SEMROD is going to work well because you're
basically reducing the, you're increasing left-hand side. So it's
going to be better for you. All right. If you're going to powerful
private machine, then don't worry about cloud anyway. This is meant
for that then you don't have that. Smaller over LAN speed, beta is LAN
speed more closer you are to LAN speeds the better for you essentially.
The smaller theta the smaller the right-hand side and then it will be
better for you as well. Okay. All right. The main question -- oh,
one more thing. Smaller the, in this equation, the smaller the end
private is the private machines the smaller that number the better for
you again. The main question, which is the lambda, which is the number
of private, public machines over private machines, what happens in that
ratio. Okay. And it turns out that that's a little bit more tricky,
and I'm going to cut through the math of this and just tell you what it
is. If you think about it, you have a certain amount of job, the
expectation is that any sensitive data will happen on the private side
any nonsensitive data will happen on the public side. So if alpha is a
sensitivity parameter, if you have got one machine here you should have
basically one minus alpha, alpha machines on the other side, then it's
completely balanced. There will be load balancing situation. It turns
out in this equation that that actually is the optimal amount of public
machine to get. So if you get that many machines, or if you get let's
say less machines in the public side compared to that, then increasing
lambda, which is the number of public machines to private machines, up
until that point actually gives you a better chance of let's say
satisfying the inequality. So it's good to have up to that machine.
Beyond that machine, the equation shows that it's independent. So
basically the number active public to private ratio machine will not
matter beyond a certain point. Which is not unexpected as well because
when you increase you've got multiple machines but the data is too
sensitive so it will never go to the public side anyway. So
effectively what you do after this analysis is kind of figure out what
the important parameters of the equation are, which is basically
public, public/private issue which is lambda. The personally sensitive
data, and then speed of LAN versus WAN. Let me show you some results.
And the key question is given these, does SEMROD do better under
realistic assumptions of this guy, that guy and that guy. So we did
all the experiments to kind of figure out the in the space how things
work. The first experiment is done, we have a small cluster. So the
experiment is done basically with on UC itself with some nodes being
private and some nodes being public. Now, the clear public, same
cluster divided into two parts. For Internet work, or, let's say
cluster kind of performance, we added delays to mimic essentially by
data network as well. And then we change the ratio of the white area
from all the way one, LAN and WAN the same to LAN being 100 times the
LAN just to get a sense of how things go. And we also experimented
with this in a realistic kind of setting we got machines at USCD
there's the public machines and private machines and we were doing
across the cloud as well. Experiments there. And we ran this over
variety of benchmarks and TPCH to high bench and so on all the way from
page rank algorithms, gaming algorithms, sorting, tera sort, all that
stuff. Let me show you a couple of the results. This is at
sensitivity ratios, and this is the Y-Axis here is speed up with regard
to all private and the red is CEDEK the map optimization. What is the
SEMROD. So at certain stage when the sensitive is very high,
50 percent of data is sensitive, the two are similar otherwise we have
a significant advantage over CEDEK. Okay. And this is for the
multi-level jobs. So here are multi-level jobs. So CEDEK doesn't
benefit much from those jobs because reduce public maps can be done on
the public side we have advantage compared to the all private. And
this is amortized and averaged over all different jobs that we ran this
over. Let me show you one more result. Let's see the right one. This
is for the integral network this is the part Don you were asking. So
here I'm running the issue here is capturing that LAN equals WAN speed
all the way to LAN being 100 times faster. And the ratio of the
machines is 1 to 17. That means one than private and 17 public. Or
one to five. That means one private and five public machines. And if
you look at the performance, and this is speed with regard to all
private. So we get even when the LAN is kind of pretty slow, 100 is
very, very slow at that stage, we still are getting some performance
improvement over basically the running it on private. If the WAN
speeds are better we obviously get much better performance.
>>: There's a performance aspect. There's also a cost, monetary
aspect. So in all public clouds, all LAN communication is free. And
the bad communication is really expensive. It's like everything.
>>:
Point well taken.
>>:
The ratio is infinite in the current cost models.
>>: So this is not considered cost of, which is important parameter.
Absolutely right.
>>:
Not going to one, going to 0.
You're going to 0, right.
>>: Sure. This is not considered the cost model. Absolutely. It
should be a cost aspect to it. Okay. And these are the results over
actually the next one is slightly better. So this is like a more
detailed result over different jobs. And what it's showing is the
relative speedup with regard to all private. Sorry. Should have gone
to the previous slide. There's the other question about how does
computer compare to Hadoop. You can be pretty bad at times. This is
an example. CEDEK up here is SEMROD and if you ran this on Hadoop on
mixed cluster you get speed up to six times we're getting it only up to
two times. So sometimes depending on if job, native Hadoop will do
much better as compared to secure Hadoop.
>>: [indiscernible] analytical model doesn't capture this. Why didn't
it say more than the [indiscernible]. So you have to start some jobs
locally. Transferring large datasets. So that will ->> Sharad Mehrotra: It will affect, yes. So at the end of the day.
At the end of the day, I'm hoping that the solution of this resides
with something that you guys are doing. If there's a secure component
of the basic cluster which is in the cloud itself, that will be so much
ideal. Because then never to worry about what he's saying. I always
have 0 -- more cost for paying for whatever that extra security. So
yes.
>>: So I should see this graph, require it as being the all public
file.
>> Sharad Mehrotra: No Hadoop line is not all public. Hadoop line is
private public but no security. So forget about security. Just run it
as if the machines are all yours.
>>: So then one of the points that you started with was that it can't
be too bad as compared to any one of those cases. Where is the all
public line.
>> Sharad Mehrotra: Compared to all private line. Not all public
line. I see, I see. I see. But, okay, I see what your point is, but
I think what will happen is so all public, if I've got let's say five
private machines and 15 public machines, would you want to compare it
to 20 machines or 15 public machines.
>>:
Either way.
>> Sharad Mehrotra: We haven't done that experiment.
but we've not done that experiment.
>>:
I see your point
What is the all public better than the all private because --
>> Sharad Mehrotra: Many more machines. Faster machines. So right
now our assumption has been in this setup, in the real test, the
machines are the public side were much faster. They were basically as
DSC machines and ours is a small little cluster, which you call
cluster. So it's much slower. So that's absolutely that point is also
valid. So experience also. But in the similar experiments they were
on the same power machine essentially. So we could have done -- we
didn't do the experiment. We compared it to all private. And did not
compare to all public. But presumably all public will be around Hadoop
itself. It will not be too much more, maybe it will be because it
depends on the LAN speeds and so on. It could be even better than
that. It's possible. Okay. So where are we going with this? Well,
the key question -- so what happens in the execution of this is that at
initially some of these are sensitive. When you compute more get
sensitive it spreads. The sensitivity spreads. There is always a
point, and this is a query optimization problem where you shifted back
to the private side that's better. How to do that we didn't fix it.
We should fix it. Basically we should do something smarter here. So
there's lots of small little things that one can do. And the most
interesting to me is this entire model is very suitable, I would think,
not much for MapReduce but for Spark or Asterisk and so on. The reason
is simple in a general workflow system you can maintain state and reuse
the state and so on. And Hadoop is a silly, you know it's a silly
problem in Hadoop. Have to store things basically you lose data. You
don't do partial computation. There's a lot more to do in that kind of
setting. Other thing is I started by saying we'll give a talk on risk.
I didn't talk about risk at all. This is 0 risk. So there's a natural
extension of this work to quote risk management so on introducing
boundary to boundary so on. We did initial experiments. It's not
easy. Basically the results showed that it's not an easy problem. A
lot more work to be done in that area.
>>:
Formulating -- okay.
The performance.
>> Sharad Mehrotra: True. Okay. So my last slide, sorry for taking
such a long time but basically if you go towards cloud you have loss of
control. We kind of all agree on that. Leads to price security
concerns and we have focused on security and encryption which is great,
wonderful. But I think there's a lot of power in not even going
towards encryption at all as the only solution, as the only approach to
let's say secure processing. Power of using let's say trusted hardware
available to you, whether it be a client machine or whether it be a
server side itself is a very useful direction and I think that was the
point of kind of we didn't do trusted hardware, we did trusted client,
it's the same -- there are differences but it's along the same
direction only. I think that's the right approach at least worthy of
exploration as well. We have done a lot of work on this modeling and
so on but it's not really mature, our risk is very simple. We're
basically size 0 risk one risk if you expose it 0 risk, a lot more risk
to be done in the risk modeling.
>>: When you said part of this you could use secure hardware to
improve it, that's not completely obvious because I think one key
assumption is your data needs to be stored in the public, not in the
private, because even the initial data of who has what disease, that is
hidden. That can be -- but any architecture in the cloud metadata
itself resides there then again other things can go wrong.
>> Sharad Mehrotra: All I'm saying is I think mapping this into
working appropriately on secure hardware is not a straightforward
thing. It's not trivial. It's not trivial at all.
>>:
Even data storage is a problem.
>> Sharad Mehrotra: I agree with you completely. How you use secure
hardware is going to be interesting in reality. It's not a done deal.
No, I think it's an interesting direction. It will overcome some of
Don's problems that he's defining, that's the main question it's the
approach that one should explore. What we have done, I talked today
about SEMROD. If I get a chance let me talk to some of you about the
work we've done in Hypervisor, would love to get your feedback, and so
on. And CloudProtect, I know Don has done something similar as well
one of the things he did when he was at ETHT. It's very related
because I read the paper of yours. So we've done work in that
direction. So I'll stop at this point. Sorry it took a little longer.
I'm not going to go into other things right now.
>>: [indiscernible].
[laughter]
[applause]
Download