>> Ari Juels: Well, we're all familiar with the... us with a nice shiny abstraction, an IT resource in...

advertisement
>> Ari Juels: Well, we're all familiar with the proposition of the cloud. It presents
us with a nice shiny abstraction, an IT resource in its purity. And the wonderful
thing about abstraction of course is that it lends itself to conceptual simplicity and
to administrative simplicity.
But there's also a dark side to extraction. It sweeps a lot under the rug. So while
they may be the perspective of the cloud, the view of the cloud of a happy
programmer, the cryptographer's view of the cloud tends to be rather different. It
looks something like this.
Now, cryptographers of course are known for their extreme pessimism. And it is
also worth considering let's hostile adversaries. You'll see in the latter part of my
talk that I do that. And perhaps also look at an auditor's perspective. And
auditor's perspective is somewhat more benign but still very challenging.
If you walk into a datacenter of the cloud or anything else, you'll see something
like this. These are actual photos pulled by a colleague here of real data centers.
So it's with this motivation in mind that we've crafted the RSA lab's research
program in the interest of developing something we call remote cloud security
checkups.
The idea of the general framework is to restore transparency to the cloud, to kind
of drill a little hole in that abstraction to enable testing of security and compliance
properties but to assume minimal trust in the cloud. The framework is something
like this. We have a cloud, we have an auditor acting on behalf of a tenant or the
tenant acting directly on her behalf and she wants to know if the cloud has a
particular security property. I'm going to talk about several of them today.
So in good cryptographic parlance she sends a challenge to the cloud and she
receives a response and she verifies it. Pretty much everything I'm going to talk
about today follows this general format.
I'm going to talk in particular, as I mentioned, about cloud storage. And cloud
storage can get very naughty indeed, because the ability to meet -- to encrypt
files means -- and integrity protect them means that you can move them around
with a fair amount of dynamism. So it's not inconceivable that something like the
following would arise a little over the top. It gives you the sense of the types of
problems we're concerned about.
Alice wants to upload her wedding photos to the cloud, back them up. They're
important to her. They have strong sentimental value. So she sends them to
some big company in Seattle, a cloud storage provider. Well, what do big
companies do? Big companies of course outsource to smaller companies. This
company sends the file to agile little company in Bangalore called Mini File. Well,
Mini File is not just agile. It's clever. And what do they do? They make use of
unused storage capacity on consumer hard drives. So they send think thing to
Bob.
Well, of course Bob has no idea that it's there. Bob also happens to be an avid
user of peer-to-peer networks. And by coincidence Alice is also an avid user of
peer-to-peer networks. Well, one day Alice's machine crashes, she contacts
MegArchive and you know the rest of the story. As I said, this is a little over the
talk. But it gives you a sense of the way risk can be passed around in cloud
storage in particular in unpredictable ways.
There are other reasons why Alice's file might suffer. For instance, MegArchive
might keep Alice file around but move it to a slow disk array. So the file's there,
but if she wants to retrieve it, it's going to be a hopeless proposition. Or
MegArchive might use cheap storage, crummy drives that degrade over time. So
her file is subject to bit rot.
Or if they're really trying to cut corners, they might actually throw away a portion
of Alice's file. I mean, how often do users actually retrieve their backups. If one
or two looks for a backup it's not there, MegArchive apologizes, sends a check,
and everything is hunky-dory.
So the question I'm going to address in this first portion of the talk is how can
Alice be sure that she's able to retrieve her file in its entirety? And this is
motivated a couple of papers listed here.
The technique we use is something called proof of retrievability. Alice would like
to be sure that her file is retrievable. She may also want to ensure compliance
with A Service-Level Agreement to make sure she can retrieve her file in a
certain amount of time. I'll talk about that a little later.
Well, there's a simple approach to this of course. Alice can just download the
file. This is not good, though, because it's resource intensive, right? It's a recipe
for unusual congestion.
So I'm going to build up this scheme incrementally, give you a flavor of how it
works. I'm not going to delve into all the details. And we'll start with spot
checking. Here's what we might do.
Before Alice uploads her file to the cloud, she rolls some dice and pulls down a
few random fragments on the file.
Now, to make sure her file is there, what does she do? She picks one of these
fragments, pulls down the corresponding fragment from the cloud and make sure
that the two last. If something has gone entirely wrong, there's a big chunk of the
file missing. She's going to notice this. Well, this is not because now she doesn't
have to download the whole file. If there's a lot of the file missing, she can detect
it. But there are still some drawbacks to this basic approach. She's got to store
these chunks of the file. And she can't detect small erasure corruption, you
know, creeping bit rot, for instance, is going to be online her can here.
So here's a refinement. We use Message Authentication Code. What Alice does
now is in the preparatory stage, she generates a key, symmetric key, and she
MACs each of the fragments of the file and she just stores the MACs, the
appropriate integrity protection in the cloud.
Well, now, to check whether her file's there, she doesn't have to retrieve a
fragment locally, she just pulls down the MAC, pulls down the fragment she's
looking for, and uses her locally stored key to check. This is better than the
previous solution. She doesn't have to store any of the file. But we're still not
quite there yet.
Alice can't detect small erasures or corruptions. So we need a further
refinement. This further refinement is an error correcting code. So as you recall,
what an error correcting code basically does is append some parity bits, some
error correcting information to a file such that if a piece of the file goes missing or
gets corrupted, the resulting string message can be decoded to retrieve the
original file if the corruption is not too heavy.
So the -- if we error correct Alice's file, now the only way to lose the file is for a
big error to occur. We basically amplified the errors in the stored file. And this is
very nice because it means if we add parity bits to the scheme and we also have
to check the correctness of the parity bits that Alice doesn't have to store any of
F, she just stores a key and she can detect corruption and F with high probability.
So here are some sample parameters we might use in error correcting code that
can detect up to ten percent corruption. Alice might check 30 positions and if the
file is irretrievable she'll detect the fact with very high probability and more checks
can boost that arbitrarily close to 100 percent.
There are lots of other twists and complications. As I said, I'm going to skip over
them. I'll just mention one or two here. You know, as the problem -- suppose
that Alice sends challenges to the cloud, gets responses. They're all correct.
Then she decides to retrieve her file and the server says no. What does she do?
That's a problem. I'll talk about that a little later. And somebody clever may have
noticed that actually PORs don't solve the original problem I described. If Alice
challenges the cloud, her challenge will circle around to her own machine and
then come back and actually it will look like the file is there.
So this raises an interesting point, namely that the speed of a response in a POR
is also significant. It tells us how quickly the file can be retrieved. So if there are
many, many levels of indirection like this, that's something always can pick up.
So PORs are useful for checking quality of service for that reason.
Well, I come from an industrial lab, so I also have to give you a business plan
here. Where is it the case that a quickly detected failure, a missing file is
tolerable. Where is it not catastrophic? The answer is in backup. If Alice has
backed her photo album -- her wedding album up to the cloud and it goes
missing, she's still got a local copy. So it's not catastrophic. And then as I
suggested, we can also use PORs to detect the speed with which files can be
retrieved. And that's a useful property as well.
Well, there is that problem, though, right? Alice can send challenges and
responses and maybe one day the response that comes back is not correct.
Right? Maybe there's a little accident in the datacenter. What is she going to do
in that case?
This is motivation for another project. We know how to deal with situations like
this. Namely we can distribute files across multiple devices. This is the principal
underlying RAID. We can store file blocks on a few servers, store some parity
information on another server and now if something goes wrong we can take the
parity blocks and use our error correcting erasure code and reconstruct the
original distribution of the file.
So why not do that with cloud providers, right? Why not just sort of do a RAID
striking across cloud providers? The challenge here is that the situation here is
not so nice as it is with RAID. When you get an error in a RAID configuration,
you assume that derives it benign they'll let you know. But cloud providers may
not tell you. They may lose data and not inform you until it's too late. And if this
happens to occur on two providers, you're in serious trouble. Or if there's a virus
propagating through multiple systems, you can lose your file irretrievably.
This motivates a project we call High-Availability and Integrity Layer, HAIL, cloud
storage. We try in this scheme to protect against a fairly strong adversary,
against a mobile adversary. Mobile adversary Microsoft Office from device to
device or cloud to cloud actually corrupting as it goes and not notifying anyone of
the fact that it's infected the system.
This type of adversary models for instance the situation I described where there
are system failures that a user doesn't get informed about or there's a
propagating through the cloud that corrupts files. And RAID as I said is not
designed for this kind of adversary. It's designed for a limited, readily detectible
failures in devices you use, basically, the benign case.
This mobile model is very strong, not the strongest achievable, but one that we
think has real practical utility. The idea in HAIL is the following: In cryptography,
the usual approach to a mobile adversary is a proactive one. You assume that
the adversary is going to be there, okay, some period of time, so you tear
everything down, destroy the adversary, build it up again, and everything is
clean. You get a fresh start. But that can be awful expensive, right, pulling down
a file, redistributing it and so on and so forth.
So a cheaper possibility is a reactive approach. We detect corruptions to files or
missing bits and then we remediate. And that's the way that HAIL works here.
So we try to seek out the adversary and knock it out of the system, kind of play a
game of whack-a-mole here. And PORs are useful for providing detection of the
presence of the adversary, if it's corrupted files.
Here's how this game works in a nutshell. We take our file. We carve it up into
pieces with some message block, some parity blocks. And as in a RAID
configuration, we also add a couple parity servers. And now Alice, when she
wants to know if her file is intact, picks a row at random, challenges it and
basically does what's done at POR. She can also challenge multiple rows and
aggregate. But I won't get into that.
So now everything looks great. We've got RAID-type redundancy so we can
tolerate up to T provider failures. And use of the POR allows us to bound the
number of failures, to bound the data corruption at T, if we parameterize the
game correctly.
So it looks like we're done. And we are basically done, but there's a nice
optimization here that I just want to point out. This row has a useful structure.
The parity information essentially duplicates in some degree the message blocks.
And so the parity information could be used to check the integrity of a row. In the
original PR you remember you had to store these check -- explicit check values.
In HAIL they're basically there. So Alice can probe a whole row. And the
redundancy here allows her to check its correctness.
Now, we have to do something a little bit more than this. But it turns out if you
layer a PRF on top of an error correcting code in the right way, basically you
encrypt the blocks, you get effectively a MAC. And we get it almost for free here.
We're just layering it on top of data that we would normally use in our RAID
configuration. So now if the address are corrupt something we can easily detect
it.
So briefly what HAIL gives us is an abstraction, the high integrity file system. It's
robust to a Byzantine mobile adversary. The cost in terms of storage and the
server overhead is modest because we get these free MACs I mentioned. And
we use PORs so we don't have to pull down whole files. And we can aggregate
and so on and so forth.
So again, let me give you the business plan. This one's a little more exciting I
think. Now, RAID itself actually led to a new business model. And, in fact, it was
responsible for the birth of my company, RSA belongs to a larger company called
EMC that builds storage arrays. And they competed with IBM at a time that IBM
was building these monolithic high reliability drives. And EMC came in with this
fleeter system, the RAID system, and managed to dislodge IBM from important
parts of the market.
It would be nice if we could do the same thing with HAIL. You know, maybe we
could put EMC out of business. Suppose we fuse together cheap cloud
providers, the equivalent of low quality drives, to get a high quality abstraction.
So for instance, Amazon charges something like 15 cents per gigabit per month
for storage. There are some providers that you've never heard of for good
reason that charge only two cents per gigabit per month. They're probably less
trustworthy but we could probably get five of these and it would still cost us less
than using Amazon S3. And HAIL's useful for this type of approach.
Okay. The find third of the talk I'm going to describe another challenge in the
cloud. I've been describing techniques for verifying logical properties of a file.
It's also helpful to verify the physical layer. And the physical layer is exactly the
thing that gets abstracted away in the cloud. Amazon, for instance, claims to
store three distinct copies of a file for resilience. Can they prove it? Is there
some way to build a challenge response scheme to verify that's the fact. Well,
obviously proof of retrievability isn't going to do the trick. Even downloading isn't,
right? Alice can download her file once, then twice, then three times. That
doesn't mean that it's stored three times, it could be stored once and it's just
delivered multiple times. So it's not obvious how Amazon would prove that it's
got three copies of your file. You can encrypt your file under different keys but
that gets -- that gets messy. You got key management problem. And the
challenge here is particularly cute because of the use of virtualization. So for
instance the cloud provider may say your file can provide two disk crashes. We
strike across five drives. Well, that's great. But if those drives get virtualized, it's
conceivable that, in fact, they're physically co-located, and in that case it could be
that one drive failure results in loss of the file.
This motivates very recent work, how to tell if your cloud files are vulnerable to
drive crashes. So we would like to prove that a file can actually survive two disk
crashes. When you think about it, though, it's not obvious how you do this, right?
How do you prove that certain bits are sitting on certain pieces of hardware. How
do you achieve a logical to physical binding?
Well, I hope you're waiting with bated breath, but I'm not going to tell you the
answer. I'm going to tell you the answer -- I'm going to hint at the answer with a
little story and then receipt you figure out the solution. Here's the story. There's
a fraternity, Eata Pizza Pi that orders regularly from its favorite shop, Cheapskate
Pizza. And generally things go well. They call up, they say we want six pizzas
and the pizzas arrive and everything is great.
Well, one day during the big game they call up Cheapskate Pizza, they ask for
their six pizzas and Cheapskate says oh, sorry, oven broke. Can't deliver. Now,
this was the big game, okay. So Eata Pizza Pi was really ticked off, and they
threatened to send their business elsewhere. Cheapskate Pizza said okay, okay,
we're going to add oven, so we've got redundancy. We're going to give you extra
capacity here. And that way if an oven fails on the day of the big game, you'll still
be okay, we can still make your pizzas. So how can Eata Pizza Pi verify, without
having to go to the premises which are -- would be a breach of hygiene rules and
so on and so forth, how can they verify that the two ovens are there?
Well, we could do the follow: Suppose that is the Eata Pizza Pi knows that an
oven bakes one pizza at a time serially and takes 10 minutes and they know that
the Cheapskate truck takes 15 minutes to get across town. Well, then what they
do is the following. They order six pizzas, start the stop watch, stop the stop
watch when the pizzas arrive, and they make sure it took 45 minutes. Why 45
minutes? Well, the three pizzas were baked in each oven. It's 30 minutes
baking time. And then 15 minutes to get across town. And this tells them
something about the oven capacity.
So I will leave it to you to translate this technique that's principal to storage
arrays.
So I may have some time for questions or you can ask my colleague Kevin.
Kevin is going to be giving a talk on implementation of these various schemes
later today. So he can give you parameters and other details. But I'm happy to
entertain questions if we have got time.
>>: [inaudible] different drive. I really want it to be in different towns and
[inaudible] different [inaudible].
>> Ari Juels: Right. Yes.
>>: In different [inaudible] zones.
>> Ari Juels: Yes, I agree with that. That's an important problem as well. And,
in fact, geolocation or localization is also important because of jurisdictional
issues. So you can use things like time of flight techniques. Somewhat similar,
actually. To resolve those problems that they are important.
>>: Yeah, which are Cheapskate Pizza [inaudible] similar contacts to the
question. They could be that Dominos opened up across the street and they
have an agreement that when their pizza oven goes down, they can go over and
use Dominos because Dominos is the new business and it's growing and they're
not using very much of their capacity.
[brief talking over].
>>: But it doesn't prove that the power transformer that serves that block breaks
down that they and the redundancy ->> Ari Juels: Well, actually it does not sense that it's -- so the question is if, you
know, if Cheapskate is making use of Dominos to fulfill orders, do we have a
problem? When you translate the analogy into the storage setting what you get
is a guarantee that your files are stored not just locally but in the annex if you will,
the equivalent of Dominos. So it's okay. It means that Dominos is actually
engaged in storing your file and they help provide the resilience you want. So
the analogy is not perfect in that regard.
>>: [inaudible].
>> Ari Juels: Right. And that can be layered on top. But that's an important but
essentially orthogonal question.
>>: [inaudible] translation [inaudible] measurements of vacancies of time
responses. Unless you're seeking really close data center and [inaudible]
experiment with the measurements [inaudible] so many moving parts [inaudible].
>> Ari Juels: Well, I've glossed over a lot of the details. Kevin is actually going
to talk about that very issue. We've done experiments and of course the
important thing is not the total latency of the variance because you can always
subtract out some base latency. So we actually did some experiments to
determine what the variance was. And we can compensate for variance using
the appropriate statistical techniques. That's a very good point. And there are
actually a number of twists here that I think will be very interesting to hear about
from Kevin.
>>: So it seems like -- so all this work is assuming an uncooperative released
cloud provider. Have you guys thought about integrating that assumption a little
bit, thinking about cloud providers that are willing to cooperate to a certain
degree and how that changes the trust models and what types of things you can
and cannot do?
>> Ari Juels: That's a great question. Particularly as I was talking about weaker
security models earlier on. Actually the -- the last protocol I described which we
called raft relies on kind of subtle adversarial modeling. It relies on an economic
principle. We're testing to be sure -- we're using the bandwidth caps on drives
essentially. As a way to ensure that a file is accessible. If the cloud provider
decides that it's not going to use hard drives but it's going to use SSDs, solid
state storage, which is much faster, it can cheat.
But that's going to cost it a lot more. And in some sense you would be happy if it
cheated because then, hey, your file would be on a solid state drive.
>>: [inaudible] reparameterize what your protocols are doing to account for the
faster hard drives?
>> Ari Juels: You can. But you need an accurate commitment on the part of the
cloud provider because if the cloud provider says it's using enterprise class
drives and then it bumps up its storage quality to SSD, you would be in trouble.
But then you would have this economic argument acting in your favor. SSD is a
lot more expensive than enterprise class drives. So it didn't make sense to cheat
in that way.
So actually the modeling here is -- you know, it didn't have the full robustness of
a traditional cryptographic one.
>>: So maybe I'm just curious for [inaudible] I know you only [inaudible] so what
kind of assumptions if you prove the security of the scheme, do we need some
kind of non standard complexity leveraging assumptions to argue that you cannot
[inaudible] like for this file application. I'm just curious what kind of assumptions
are used in your proofs.
>> Ari Juels: So we make some simplifying assumptions in the paper that
basically get us away from certain cryptographic complexities. But actually in the
appendix to the paper we explore some cryptographic primitives that are useful in
this context, and I can talk to you about those afterwards. So there is actually
interesting crypto at play here. For practical purposes you probably wouldn't
need it. But -- sorry. I ->>: So [inaudible].
>> Ari Juels: Right.
>>: [inaudible].
>> Ari Juels: Yeah. So that process, Ron is mentioning the fact that if multiple
users are storing the same file, the cloud can just store one copy. That process
carries the ugly name of deduplication as you may know, deduplication makes it
all the more important to verify the robustness of the file. Because if there are 10
copies, users happen to have -- 10 different users have a same copy of the file,
you have resilience built in. If you're compressing, then you really need to be
sure that that one copy is well protected.
>>: [inaudible] to make sure that [inaudible].
>> Ari Juels: That's right. Actually I'm not sure about that. If the user -- it
depends on how the deduplication works. It can get messy in the sense the files
get carved up. But if the file's not carved up, then all I want to know as the user
is that there's one copy of my file that's resilient to drive failures. I don't care that
other people happen to have the same copy. And I know what the copy looks
like. So I'm probably okay in that setting. As I said, dedup, you know, is much
messier than that. But at least, you know, in that simple scenario you're probably
right.
>>: [inaudible].
>> Ari Juels: So, as I said, that's actually, you know, in some sense a motivation
for this type of work. All kinds of strange stuff goes on in the cloud. And dedup
is just one example. You've got virtualization messing things up. There's so
many layers that -- I mean, this is the justification for adopting the framework we
do, the challenge response framework that treats the cloud as a black box.
Because if you delve in and try to the audit, you're going to deal with that house
cat's field day of wires that I showed you in one of the slides.
So, you know, it's exactly this type of issue that underpins the research.
>>: Okay. One last question.
>>: So I like the [inaudible] measuring a black box. Would you consider actually
adding an interface, like having a storage signed and messages and sufficient
[inaudible] signatures.
>> Ari Juels: That's a great question. The question is should we be outfitting the
cloud with specially purpose interfaces to support this kind of thing. Actually to
some extent we have to. I mentioned for instance that it's possible for the tenant
to aggregate some challenges. That means the cloud has to some facility for
doing the aggregation. And that means some extra protocols. And that's
something that we're working on. I don't know if you're going to talk about that,
Kevin, but that is important. They do need special purpose support to some
extent. You can do a very lightweight POR where you just challenge individual
blocks, probably with existing infrastructure. But to get something stronger and
nicer, you have to have the special interface.
>>: Okay. Let's thank the speaker.
>> Ari Juels: Thank you.
[applause].
>> Charalampos Papamanthou: Okay. My name is Charalampos
Papamanthou, and I'm a PhD student at Brown University. And today I'm going
to talk about part of my PhD thesis which has been done with collaboration with
my advisor, Roberto Tamassia at Brown University and Nikos Triandopoulos
from RSA Labs.
And I'm going to talk about efficient verification of outsourced data and
computations.
Okay. So let's see what's the motivation of my talk. So as we know, cloud is
becoming a big thing. You have manual line services that store our data online.
So Amazon, Google Docs, Microsoft Azure. But however we are here, I mean
we put our files online but we don't have any guarantees for integrity. Okay? As
of now.
So for example Amazon says explicit that we have no liability to you for any
unauthorized access, use, corruption, deletion or loss of any of your content or
applications. So we don't really -- we can't really guarantee what's going to
happen with the files after you put it online. So it's a big problem to be able to
check that the data that you're receiving after you have uploaded them are in this
exact same state that where when they were first uploaded to the online servers.
Okay. So another scenario that involves some kind of outsource computation
and need for verification of queries that are run by interested servers is a three
part scenario where you have a trusted source that produces some data, okay,
so say it's New York Stock Exchange, you trust New York Stock Exchange, but
the actual stock price is not retrieved by querying New York Stock Change, it's
being retrieved by querying some other untrusted server to which New York
Stock Exchange has delegated the computation and the answering of the
queries.
But you only trust New York Stock Exchange and you don't trust the immediate
servers that are going to answer to -- are going to answer your queries. So what
you need to do is when WC -- Wall Street Journal gives you something back, you
want to be able to verify that this is correct based on the trust you have on the
initial source of the data. Okay. So this is like a three part model that captures
the idea of, you know, you want to be able to verify queries that are alarmed by
untrusted parties. Okay.
And this specific -- to formalize this specific model, the model of authenticated
data structures was introduced where we have a three party model, okay, so we
have a source, that produces the data, and this is the entry that we really trust,
the data is outsourced to untrusted server, okay, so we trust the source, we don't
trust the server, and the client contacts the server, sends a query and retrieves a
proof and an answer from the server. And he's -- he has to be able to verify the
proof and the answer based on the trust he has on the source.
And the security property here we want to guarantee is that the computationally
bounded server that we don't trust should not be able to come up with a valid
proof for an incorrect answer. So basically a server should not be able to
[inaudible] this about a different state of the data that was originally created by
the source and the source is the only entity we trust.
So this is the three party model that [inaudible]. So there have been a lot of
instantiations of this model with different cryptographic primitives. And every
time you want to have a solution for a specific data structure, you are interested
about specific complexity measures. So it's important to be secure and to be
efficient. Because you can do many inefficient things.
So in terms of time, we're talking about a dynamic data structure. Basically here
we have a data structure in general. We're in this thing updating data structures,
we're in this thing updating time of the source the update time of the server
because the data when it's updated here it has to be updated there, too, it has to
be consistent. The query time of the server is how much time does it take the
server to query that data, and the verification time is how much time does the
client takes to verify a specific answer.
And this is the time complexity that we're interested. So in all of these three party
models specific time complexities. In terms of space this is the updating
formation that is sent from the source to the server when there is an update
issued in this data set so that this can be replicated to the server data set. It is
the proof size, it is the size of the proof that is sent from the server to the client
and the client space, the space the client needs to lock and hold in order to have
->>: [inaudible] talking about general ->> Charalampos Papamanthou: So this is a generic model and you can view like
you have any data structure involve the source and a server. And so you can do
the query as a query, data structure query basically.
>>: So [inaudible] problem is [inaudible] and is a source [inaudible] I guess it's
not that. I was going to say the source [inaudible] but then you have to ->> Charalampos Papamanthou: You don't contact the source. You just -- so
basically you just delegate the source delegates the computation to other. So
there's of course very efficient solutions for that problem. And for the case of the
dictionary, you have an authenticated dictionary with -- that's created as follows:
Basically you have the source that has X1 to XN, some data collection. These
are the collections replicated by the server. So what it does -- how do we solve
this problem?
So the query here is does X belong to a specific set that this has been
outsourced, okay. So have a dictionary data structure. So what -- how come we
sold that? We built a Merkle tree using a collision [inaudible] on top of the set.
Okay. So build that here. We build that there. So the proof for an element is a
size path of collision digest so basically we get this proof and the source sends
the assigned digest of that Merkle tree. So the client has a proof from that trust
server and a signed digest of the set from the source. And he can use these two
things to verify that his query has been answered correctly. And he's -- his X
indeed belongs to a set or not.
All right. So this is a has based solution. So now we -- if you use two a good
function that has not been attacked yet, you're okay with security. But let's see
what we're doing in terms of efficiency. So you have N elements. What you get
is basically query time logarithmic update time logarithmic because you have to -every time you change your collection you have to update the tree. The proof is
logarithmic size and the verification is logarithmic size.
So everything here is logarithmic. This is the [inaudible] that was introduced in
the past. And authenticated. It's a randomized solution of the same thing that
has [inaudible].
>>: [inaudible] change one data.
>> Charalampos Papamanthou: No, but if you do an update on a [inaudible] you
can just update the logarithmic size path from the element to the root. So that's
why everything is logarithmic here. So basically, okay, so we have this solution
that works pretty well in practice. This work has been used. So let's see now
how the different users cryptographic primitive can really influence the complexity
of the authenticated data structure solution we're introducing. Okay.
So let's see how -- what is the impact? So passing so far has been used
extensively. And there has been some lower bounds about how good you can
get. [inaudible] proved that if you use generic [inaudible] satisfaction set two
where you don't get to have any access to how the algorithm really workers then
you don't have any set algebraic structure of the algorithm, you cannot do more
better than log N. So basically there is this lower bound. But this is with set 2
and this kind of action where the collision is basically a black box. So you might
think if this is what I am exploring if we can pose these limits further down by
using logarithmic primitives like accumulators, lattices and pairings.
And we always want security to be based on well accepted assumptions. Okay.
So let's see what our contribution. So basically hash-based solution we have
logarithmic cost, source update, server update, server query and size of proof.
We have a solution that tries -- that since some constant size cost but has a
linear cost here. So basically the source update, server, server, proof size is
constant but you have a linear, almost linear end to the epsilon square root of N
which is almost linear. And with lattices we get the first solution that also it
pushes one complexity measure to constant. It is able to maintain the other
complexity measures logarithmic.
Okay. So I'm going to go through three instantiations of some of the data
structures. Not in the same depth for all three of them. So for the first solution
uses and accumulator to construct the data structure to the data structure that
has a face and proof size but kind of expensive updates.
So basically what is the RSA accumulator, a crash course. You have a set of
primes, and you have an RSA modulus N and a base G that belongs to the QRN.
Then you can represent this set in a collision way by exponentiating the set of
primes to exponentiation base G. Okay? And you can have a witness for X
basically by omitting X from the exponent and giving away all the other things.
Okay? So this is a witness for X.
And the security of this is look, you cannot really compute a witness for an X that
doesn't belong to the set unless you break the factor -- the strong RSA
assumption. Okay? And this is how you verify it.
So in this set, in this setting you can use the RSA accumulator to [inaudible] that
has constant proof size because basically what you do here, you have a witness
that is not dependent on the size of the set that is accumulating, okay?
So this is like a straightforward application of like accumulator function. But what
we did is to take -- we took that and we put that accumulator on top of a certain
tree, which we call the accumulation tree. So if you have era data elements, X1
to XN, you can compute -- you have a tree that has constant depth, okay? Has
one over epsilon depth, which makes the internal degree of the nodes to be big.
So it makes the [inaudible] of the nodes to be N to the epsilon. Okay?
So now what you do, you're applying the RSA accumulation function not only on
the elements but also recursively, only it's level of the tree. It's something like
applying set 2 on a binary tree. Here we squeeze the tree down. We increase
the degree of the node and we apply the RSA accumulation function. Okay?
So if we do that and we put that framework in the authenticators data structure
setting, so basically what we do is -- so we have our data elements here, say
these elements are primes, we compute the exponentiation function for this
node. But this is not the prime. So in order to feed that to the next level, you
have to turn all these nodes into prime by using a special function. And then we
do that recursively, we end up with an RSA digest of the whole set. And this is
what this thing describes.
I'm not going to go into this whole thing into this solution, but if you apply the
solution into the three party model that I showed before, you get this complexity.
So basically you have constant complexities for most of the complexity
measures, but you have an expensive kind of server update. Well, for epsilon
equals one-half, you have square root of N log N, I mean it's better than linear,
but it's still N to the epsilon. It's kind of considered to be linear. And you have
two instantiation of the solutions where you have an expensive update solution
and expensive query solution. So you can -- you can trade off queries and
updates, basically.
So here we went from the logarithmic to constant, but we increased the update
time or the query time respectively by using the RSA accumulator. Okay.
So let's see now how you can apply lattices in the authenticated framework to get
some -- yes?
>>: [inaudible].
>> Charalampos Papamanthou: Yes.
>>: How do you do that? I mean so self -- I [inaudible].
>> Charalampos Papamanthou: So basically you have the source that always
you trust. Basically you have the most fresh digest coming from the source.
>>: Okay. You ->> Charalampos Papamanthou: Because you trust. So basically you trust that
he's going to do the deletion correctly. So basically whatever you get from the
source, which is the person you trust is the right thing, the digest. Okay. So ->>: [inaudible].
>> Charalampos Papamanthou: You're right. Yes.
>>: [inaudible].
>> Charalampos Papamanthou: You're right. So basically the set -- the source
[inaudible] assigned digest to the several and the server is just four words, the
signed digest to you. Yes.
>>: [inaudible].
>> Charalampos Papamanthou: Yes. There's a timestamp in the signature
[inaudible] so make sure that you always get the [inaudible].
>>: [inaudible].
>> Charalampos Papamanthou: Yes.
>>: [inaudible] server just says no or does he have to provide the proof?
>> Charalampos Papamanthou: No. He can also do both. Both, the element is
in the set and that the element is not in a set.
So basically proof of membership and non-membership.
>>: [inaudible].
>> Charalampos Papamanthou: Yes. So non-membership, you can do it two
ways. There's an expensive way where you use crypto in the more specific way
-- in a more elaborate way but it's [inaudible] and you can also view
non-membership queries and ask a membership query by accumulating instead
of specific elements pairs of elements. So basically if you are proving the
existence of a pair of elements that are successive, you are proving the
non-existence of every element that falls into that interval. And this is how you
basically -- this is a way to do ->>: This is expensive -- sorry. Isn't [inaudible].
>> Charalampos Papamanthou: Yes. So basically I -- if I ask for an element and
the server says not -- this element is not in my collection, okay, what he's going
to do to prove me that is he's going to give a proof of membership of two
successive things of the set. Okay?
So basically we reduce the proof of non-membership to a proof of membership
basically. And we know that everything that is accumulated is instead the
successive set, the successive elements of our collection. And we make sure
that this is being maintained while we're doing updates. Because if you lose that
property, you can't really lose that reduction to prove non-membership with
membership.
>>: There is no kind of privacy here [inaudible].
>> Charalampos Papamanthou: No.
>>: [inaudible].
>> Charalampos Papamanthou: We don't care about privacy at all here. So it's
only correctness and verifiability all we care about.
So let's see how we use lattices in order to get some better complexities here.
So what's the motivation of using lattices in order to solve this problem? As we
saw before, if we use [inaudible] you get logarithmic complexities. If you use
straightforward implementation of RSA accumulator and bilinear map
accumulators you get constant complexities with linear queries here. By using
the accumulation tree and these accumulator solutions we are able to get this N
to the epsilon here. This is the illusion we just described. Before there was a
another solution by Goodrich and Tamassia where all -- most of them were N to
the epsilon. There was only the proof sides that was constant.
Okay. So this is what we know so far. So with lattices what happens is the
following: I like to call it homomorphic integrity checking. So basically with
exponentiation functions if you have a digest of a set and a digest of another set,
if you just lay with digest you can't real get anything meaningful because I mean
if you multiple them you get something that has the sum here, okay? So it's not
like really digest of the universal set or something like this.
Well, if you have pairings by using this nice property when the exponent's
multiplying the -- after the application of the epsilon, what you get is this property
but only one time because you can't really apply pairings for a second time,
okay?
But the lattice-based hash functions out of the form of a multiplication with matrix
with a vector, so basically if you add them, you really get something meaningful,
but you have to make sure while you're doing these things that the input doesn't
grow too much because if the input grows too much, then you have a problem,
everything is [inaudible].
So let's see how we can use this lattice-based hash function in a data structure
framework in order to get something better. So We're going to use a
generalization of a function that was proposed by Micciancio and others. So
suppose K is the security parameter and Q is a K-bit modulus. Okay. The
function is as follows: So basically we start with the random matrix M which has
entries in ZQ. Okay. The dimension of the mod is K times 2K squared. Okay.
Now, if we take a vector that has dimension 2K squared times 1 and small
entries, okay, if you supply it with a matrix we end up in ZQ. Right? So we start
in ZQ, we'll end in ZQ. So this is basically an application of a hash function.
Okay?
So what's the property of this function. If the entries are small, you can't really
find two vectors of small entries that of the same output here. Unless you solve
the approximate a very difficult problem of lattices to a polynomial factor, which is
supposed to be very difficult problem. Okay. And so we're using the worst case
instance of GAPSVP gamma -- where gamma is a polynomial factor. And you
can't really solve that. You can't approximate GAPSVP gamma polynomial
factor. At least not what we no so far.
Okay. So basically we have this hash function here that takes this vector and
outputs something. Okay. So let's see how we can extend this hash function to
operate in two inputs. Okay? So we have this matrix. The first thing going to be
2K squared times one vector. But the actual data are going to be the last entries
of the vector, okay? And the other -- the first part is going to be just zero, okay?
And the second input of my hash function is going to be the same -- 2K squared
times our vector. But we have the data of the first part of the vector and the other
entries are just zero. So if you see that, you can't really -- so now we extend the
hash function into two inputs in the sense that you can't really find two different
vectors of this format that can really create a collision, right? It's just simple
[inaudible]. But it is important to have the entries in delta where delta is a
polynomial, okay?
So let's see what we're doing here. We have the Merkle tree, okay? So we
know how to use SHA-2 on top of a Merkle tree and to do things there. Okay.
So now we're using this lattice based hash function but it has the same interface.
It seems that it's going to do the same exactly.
Well not really. It has a nice property. And we're going to use basically this
function to show what we can do that we can do something more efficiently than
SHA-2 because we know the special structure of the function and we explore in it
a certain way. Okay? So we have -- we are going to describe here hierarchal
computation. We have our data, the leaves which are vectors of small entries.
Okay. This is our actual data. Okay. And we do not [inaudible] so the digest of
node did is go to be the application of the hash function to the inputs. Okay?
>>: [inaudible] a lot of zeros, right?
>> Charalampos Papamanthou: Yes. Yes. They have already been prepared in
the proper format I showed before. So basically DFT is the application for
[inaudible] of the two inputs and so on and so forth. So DS is the hash function
to the two inputs. So the -- this -- the output hash function lands in ZQ. So
basically we end up having vectors in ZQ. But this is not what we really want. In
order to apply the computation recursively, you need to make sure that the inputs
have small entries, okay?
So how can you do that? Well, you have this problem. So say DT and DS are
31 and 27. These are like big things in ZQ. Well, I want to make them small.
Somehow have a small representation of them so we can input them in the next
level. So what we're going to do, we're going to define Z of something to be
Radix 2 representation of this. So basically it's another representation of the
same thing but has more entries. Radix 2 is basically a binary representation but
I'm also using except for 0 and one, I'm using like bigger numbers like -- I use like
two, three, and other things. So basically I represent this number in ZQ with this
representation, with a vector basically of small entries.
>>: [inaudible].
>> Charalampos Papamanthou: Yes. Yes. I call it Radix 2 because the -- it has
two to the 0, so something -- two to the 0 plus something, two to the one, so the
base of our number system is two. That's why I call it Radix 2.
>>: [inaudible].
>> Charalampos Papamanthou: No, no, no. It's base two, but the multipliers
who fits coefficient can be other than 0 and one, okay? It's base two though. It's
base two.
So basically I do this trust formation, and I also do the same thing for 27. But I
have to make sure that I prepare them for the next thing. So I'm going to fill them
with zeros basically by multiplying with some matrix there, and I'm going to fill this
with zero.
So basically I have this representation now of a digest, more representation that
is kind of ready to be input to the next level, okay? So I do this and so the digest
of node of the root, the actual collision [inaudible] representation of the sets we're
looking into, it does application hash functions to these things which if you -- if
you develop this thing, you get exactly this equation because this is the case.
Okay.
So we have expressed the digest in terms of lattice based things. Okay. So now
what you have done it's really nothing. Basically it's set 2 -- I mean, if we are
using set 2, it's going to be the same thing. I mean, you're not really winning
anything except that you're changing the hash function to impress people, I
guess. But that is not really the case. I mean, there has to be something that
you're going to win. Okay. And this is efficiency basically.
So it's really important to see that this G function that turns things to a small
interest has this [inaudible] nice property. So basically if you start breaking up
things by applying this function here, you're going to end up expressing the
digest of the whole set as appropriate functions of the data leaves. So basically
you're expressing the collision representation of the whole set as appropriate
function of the data of the leaves. Okay? Which means that you have this. So
basically the digest of the whole set can be compressed with this nice algebraic
expression. And now it is really easy to see that you can update things easily.
So basically you can just add -- so basically when you're change a value for an
element you can just subtract it and add that. So basically you don't of to do a
logarithmic size computation, you just do constant size operation to update the
whole thing. Which before it was not -- it was not possible.
Now, an important assumption here is that for each index of table we assume
that we have a constant number of possible values. For example, for a boolean
table, every index of the table can get zero and one. And why is that important?
Because we don't want to blow up the space. So basically what we're going to
do is we're going to compute these functions for all indexes, but L is constant.
So asymptotically is the same space. So if you have a state of the table what is
L 0, 1, and 2, your digest is basically this expression.
Now, if you want to update, you just remove this and you add this. Okay? And
so on and so forth. To updates really take constant time. It's a constant time
operation. Where there's updates before, we're taking logarithmic time. So
basically we're using lattices to go from logarithmic computation to constant
computation without really exploding other complexity measures. So we're still
keeping the proof size logarithmic. So basically we have a Merkle tree that
keeps most of the complexity measures logarithmic. But we have an efficient
update asymptotically, of course.
So this is what describes our result. And after that it is open to have a scheme
that really does the opposite. So basically now with this thing, with -- you have
still logarithmic proof side. But how can you have logarithmic updates every a
query and have a constant proof size? This is something very difficult and it has
not been achieved before. And Triandopoulos and Dangard [phonetic] are
working on that and have been, and they seem to have something -- something
in the works. But it's not clear yet that you can do it.
And okay. So finally I would like to say something about a problem, so basically
this is what I have talked so far was about very fine outsourced data, so now I'm
going to talk about very fine specific functionalities of computation. Okay? In an
efficient way of course. So let's see. Inverted index. I'm sure everybody knows
inverted index.
So inverted index is a nice little [inaudible] that allows you to answer keyword
searches for example. Okay? So if you have this -- say this is a collection of
your Web pages, these are your documents, and if you want to search for
documents that contain key workloads, asset, Microsoft, and pairing, what you
need to do is to go through these documents and compute at intersection of the
lists. Okay.
So what we're going to do, this works well when you don't of any security things,
but now I want to have this authenticated case. So basically want to have -- want
to authenticate the intersection that this [inaudible] after we send the query to a
server. Okay. And it is important basically to be able to do this verification of the
inter.
Section in a way that you don't have to access all the data. So basically the
intersection, the verification of the intersection should be proportional to the size
of the intersection and the communication the size of the proof should be
proportional to the size of intersection. So it's basically the same problem that's
guaranteed by the property that [inaudible] present before. So basically want the
output of the function to be the time takes to verify the communication to be
proportional [inaudible] and here that function is intersection. Okay?
So in [inaudible] solutions to do that, you can just download the elements, verify
the validity of the elements, and you can run the intersection locally but you don't
want to do that. Also you can precompute all the possible queries. But here you
have many of them. It's M to the -- it's a big polynomial of M where M is the size
of the keywords. And then you can sign the answers to where you just give back
the answers. But so the first is time inefficient, bandwidth inefficient, the second
is space inefficient. So you don't want to do these things.
So what we did is we have a solution where we use an accumulator function for
all of the lists of the keywords. So basically this is a bilinear accumulator here,
and this is a [inaudible] representation of everything that belongs to a certain
keyword. Okay? And we build on accumulation tree. The first tree we show it
was part of my first talk on top of these keywords. Okay. So why, why is this -does this help? Okay. So let's see how we answer intersection query.
So we have this authenticated data structure on the inverted index. And the
queries I want documents that contain keyword Seattle, Microsoft, and pairing,
okay? So which are these documents? If you go through the list you're going to
see that 7-10, 7-10, 7-10 are contained in Seattle, Microsoft, and pairing. Okay?
So the answer is 7-10. So very small answer. It has nothing to do with the size
of the list. Okay.
So let's see how you verify it in a way that it's proportional to this size. Okay?
So the proof for that is going to be the following. So basically you go and get
what remains and you compute a witness for this query. So basically the witness
for this thing is going to be what remains of the exponent. Okay? Some
polynomial here which is exponent of G. Same here and same here. So
basically you attend these witnesses. So basically the number of the witnesses,
the size of each witness is constant, okay? And how many of them? It's the
parameter of your query. If you -- your T sets is order of T. Okay. So you have
Z to the AS, B to the SC, to the S. So what you can do with this -- okay. So with
this witnesses you can prove that the intersection [inaudible] the intersection
indeed belongs to the sets, okay? But this is not enough. Because you also
need to prove that there's not anything else in common in these exponents.
Okay? Because you want to prove completeness too. This is how intersection
works.
Okay. So how you do that? You need to provide another polynomial -- the other
polynomials of how this properties. So basically if you multiple with the
witnesses on top, you get one. This means that the [inaudible] device with
polynomials is one and therefore there's no common multiplier here. Okay. But
these polynomials might be big. So if you send this polynomial software you
might defeat the purpose because basically they might be as big as this list.
So what do you do? You send them exponentiation of these polynomials.
Basically you make them very small. And then you're using the pairings to do the
verification. Because you basically are the client you basically need to compute
a multiplication of two things. And then add them all together. So basically you
compute a pairing with this, a pairing with this, a pairing with this, and then you
multiply the output of the three maps and use that to see this relation. And so
basically by doing -- by [inaudible] communication that is proportional to the size
of intersection and doing version of the proportional size of the intersection, you
are able to verify a very important query which is the intersection. And by using
similar techniques we saw how we are able to verify union in the same -- in a
[inaudible] way and subset queries [inaudible] and also time stamped keyword
searches. So basically the documents that are contained in the keyword and
somehow have another dimension in their -- you know, that also were sent, the
e-mails that were sent from this time period to that time period.
And yes, and this is my last slide. So basically we're able to do T plus delta
where delta is the size of the next and T is the number from the sets.
So now this solution as I described is static, but we also saw how to do it
dynamic. So basically how you can insert documents into keywords and get
documents out of the keywords. Yes?
>>: So I have a question. So I'm probably missing something here, so you're
proving this polynomial there are [inaudible].
>> Charalampos Papamanthou: Yes.
>>: And by knowing this [inaudible] but the pure [inaudible] so you -- you don't
prove knowledge of them necessarily, right? So you are [inaudible] so how do
you prove the [inaudible]. Are you using some [inaudible].
>> Charalampos Papamanthou: To prove -- yes. The proof based on the
[inaudible] assumption. So basically you don't send these polynomials ->>: [inaudible].
>> Charalampos Papamanthou: So you send G to the something. First of all
you can also [inaudible] because you don't know S is is the [inaudible] of the
scheme, okay.
So basically you know only the coefficients of the polynomials and you know G to
the S, G to the S squared. So by putting this together, you can compute the
exponentiation of this. Okay.
The security goes as follows. Says that we don't care what is this. The server is
going to come but with three values, okay? Alpha, beta, and gamma. And if this
-- this verification equation go through, then you start working on them [inaudible]
doing some bilinear mod computations, you are able to solve the [inaudible] in
the output of the bilinear mod group. So basically this [inaudible].
>>: Yeah. I was just wondering the fact that you don't really know this
polynomial ->> Charalampos Papamanthou: No. No. Yes.
[brief talking over].
>> Charalampos Papamanthou: The security proof needs only three [inaudible]
basically and then, you know.
>>: [inaudible].
>> Charalampos Papamanthou: Yes. Thank you.
[applause].
>>: Questions?
>> Charalampos Papamanthou: Yes?
>>: [inaudible].
>> Charalampos Papamanthou: Yes.
>>: [inaudible].
>> Charalampos Papamanthou: Okay.
>>: So the only difference here is that you [inaudible] so in his [inaudible].
>> Charalampos Papamanthou: Yes.
>>: [inaudible]. Give something to the server and then there's [inaudible].
>> Charalampos Papamanthou: Yes.
>>: All right. But in your case, a client doesn't of [inaudible].
>> Charalampos Papamanthou: Exactly.
>>: [inaudible].
>> Charalampos Papamanthou: So it's public verifiability. One other important
thing is updates. Because -- so basically if you consider [inaudible] problem that
data that I had uploaded into the server will go into the description of the function
F. Because my query here is basically mainly subset of indices of the sets.
Because I'm [inaudible] queries many different -- any possible intersection of
sets. Okay? So basically if I do an update and use a generic scheme, I will have
to regenerate everything because I have to regenerate the key. The function is
changing basically, completely. So basically I have to present all my things
again.
So this is -- I think this one difference, the public verifiability is not a thing. And
the other thing is the fact that if you fail once you discontinue. You don't have -yes. If you fail once, you can just continue -- I mean because we don't -- and of
course we don't have any [inaudible] I mean so it's only correctness. It has the
same flavor of [inaudible] solution which basically is properties that you need to
have verification that takes us longer is the answer and communication but it's as
big as the answer. So this is very important. And there are many problems that
you might consider like [inaudible] program, how you can verify shortest path that
is proportional to the size of the path basically.
>>: [inaudible] when this work [inaudible].
>> Charalampos Papamanthou: Yes. So basically the first work with
accumulation appeared in CCS, but the lattice and the intersection is in the works
basically. It's in some -- they are both in submission. So [inaudible].
>>: Okay. Let's thank the speaker.
>> Charalampos Papamanthou: Thank you.
[applause]
Download