>>: All right. So start the next session. ... Kevin Bowers. He's going to be talking about cloudy,...

advertisement
>>: All right. So start the next session. We've got two talks coming up. First is
Kevin Bowers. He's going to be talking about cloudy, without the chance of data
loss. Almost misspoke on that. Implementation. So I guess we'll be seeing lots
of source code through this.
>> Kevin Bowers: And actually no source code, but sort of follow-on work to Ari's
talk this morning. He made everything look nice and rosy and all happy. And the
problem is when you get to implementing some of it, there's a few more
challenges than you might expect.
So considering where we are, I'm going to skip right past the motivation part and
jump right into proofs of retrieve possibility. So sorry for the context switch. We
just did PDPs. Now we're jumping back to PORs. So try to bear with me.
So first a little setup here. This is sort of how we think about the world we design
PORs and HAIL and even some part for RAFT. You've got a user here, Alice.
She's got a collection of photos. And she's going to just line those up and make
one massive file out of them.
And for certain applications we might split this file into, you know, unique -- or
evenly sized pieces. So Ari talked about a proof of retrievability where I can now
take this file, I do some encoding on it and I store it in the cloud. But what we
failed to mention was how do you do that encoding piece, what does that look
like?
And so I'm very going to quickly walk through what that encoding looks like. So
this is an adversarial encoding. And the reason we have to worry about the
encoding here is we're going to take this file that we want to last forever, and
we're going to give it to the people we trust the least. So this file needs to be
resilient to anything they can try to do to it.
So the first thing we have to do is we actually have to permute the file, we have
to rearrange it. And the reason for this is because the files are so big, we can't
compute error correcting information over the entire thing in one pass. So we're
going to have to do striping. We're going to have to break it up into pieces and
do an error correcting code over that and then take the next piece.
But if I just did straight pieces, it's easy to identify where the piece boundaries are
and utilize that structure to break the file. So I actually have to permute the file
before I do my error correcting. Now I can do my error correcting over what is
essentially a randomly rearranged file.
I can now put the file back in the original order. And this is nice because now
when I go to download it, it's sitting there in the order that I expect. But the error
correcting information is computed over a random ordering.
And now to make it more complicated I have to actually rearrange and encrypt
that error correcting information. And this is again to hide the structure from the
people who are going to be storing this file so they can't utilize it.
And then we did our precomputation and encryption of responses. So we take a
selection of blocks, compute an aggregation of those. That gets stored at the
server and those get encrypted as well. So that's -- question?
>>: The original papers -- you know, any error correcting you [inaudible] so
you're saying that it's incorrect, that we need to have this random permutation?
I'm sorry. I just missed the point why you need to rearrange ->> Kevin Bowers: So we can use any error correcting code. The problem is
they're all essentially linear codes. And so there's structure that's easily pulled
out of those. And so if I just broke the file up into pieces and did my correction,
then I could easily map the piece here and its error correcting information out on
the end. And I could just knock out both of those and lose the whole file.
Essentially what we're worried about is the adversary, the person we're going to
store this file with, being able to take out very small pieces of the file and yet
render the whole thing unrecoverable.
>>: [inaudible].
>> Kevin Bowers: We can delete a large piece of the file, but that we can detect.
So that's the design of a PR. Any large corruption is detectable. But tiny
corruption are not.
>>: So [inaudible].
>> Kevin Bowers: That's basically impossible. There's no error correcting code
that can do that anywhere close to the [inaudible].
>>: [inaudible].
>> Kevin Bowers: So we've got to do it in small pieces. And now we've got to
hide where those pieces lie. Thank you.
>>: You want to do essentially [inaudible] and I assume that you can only do
error correction on small blocks, how to do an encoding of a large file?
>> Kevin Bowers: Exactly. Without letting that structure out in the open.
So the first challenge this raises is okay, so what is this permutation? How do I
rearrange a file? Well, it's a pseudorandom permutation over some medium-size
domain, for whatever that means. And specifically the domain is the number of
blocks in our file. So how big is our file?
That also implies that we need a variable length PRP here. The problem is the
smallest block ciphers that we know of that we could build this out of run over
64-bit values. So if we translate that to, you know, each byte is a block, we're
talking about a 16 exabyte file. So we're going to have to do something to shrink
this block's effort to the domain that we're looking for.
There are a couple of techniques for doing this. The first is known as ordinal
sorting. So take your numbers, pass them through your PRP, get some set of
outputs, and then sort those outputs. So what happens here is I have taken my
numbers, I've generated their outputs, I sort the outputs into their rank order, and
then what happens here is one map to three through my PRP, but that becomes
the second element in my ordered set. So my mapping is from one to two. And
you can follow that through for the others.
So this now, this last column here, this is the PRP that I could use over a smaller
domain. Another option known as cycle walking. So now you pass your
numbers or your input through a PRP and you see whether or not they land back
in the domain. And if they don't, you do it again.
>>: [inaudible] something smaller than 64 bits?
>> Kevin Bowers: Yes. So 64 bits gives you like 16 exabytes. We're looking at
files that are giga to terabytes probably.
>>: So that you're trying to find a PRP on a particular ->> Kevin Bowers: So we're trying to find a PRP [inaudible].
>>: What is K here I guess is the question.
>> Kevin Bowers: Exactly. So K is something like two to the 30th, two to the -maybe two to the 40th.
>>: Okay.
>> Kevin Bowers: And then, again, this changes depending on what file we're
encoding. Yeah. So it's got to be available. And so here basically if your
mapping doesn't land in the domain you're looking for, just map it again until you
get something that does.
So ordinal sorting is feasible up to about two to the 25th. Beyond that memory
requirements just -- memory and space just get too expensive to do. And cycle
walking -- so this sort of builds up from zero. Cycle walking sort of works down
from the original size. And it's only feasible to N over 500. So you got your 64
bits. You can sort of think about what that would look like. Basically gets you
down to a 32 petabyte file.
And unfortunately there's a pretty big gap between the two. And most of the files
that we would be thinking about probably land in that gap.
So the solution turns out to be a Feistel construction, and this is very similar to
the work on format-based encryption. They use a very similar construction to do
format-preserving encryption. Excuse me.
So it will work for anything of a power of two. And then you can use the cycle
walking to get you down to your exact domain that you're looking for.
There is a minimal loss in security but in the way our PRP actually gets used
there's no attack vector for that weakness. So getting less than perfect security
is acceptable.
This is what a Feistel construction looks like, for those of you who are interested.
Basically you take -- you take your input, you split it into two, and you do a bunch
of recombinations and shifts. Explained through -- look, I did get you some
pseudo code.
And what we found is that we can get good enough security bounds using a six
round implementation. So six splits recombinations permutes. So that -- each
section here is one round. So if you can do six of those, you get good enough
security.
>>: [inaudible].
>> Kevin Bowers: So, yeah, so there is a reduction based on the number of
rounds to the indistinguishability and you can get acceptable for our application
with six. In fact, the format preserving encryption uses even fewer rounds
because they have a slightly weaker model. They actually redefine security
primitives because they are not comparing to PRPs.
>>: [inaudible].
>> Kevin Bowers: Yeah. Work by Luby [phonetic] and follow-on work by
Rogelway, [phonetic] I believe. It's been a while.
>>: So the six round you're [inaudible].
>> Kevin Bowers: Thank you.
>>: And the [inaudible] and also [inaudible] and myself.
>> Kevin Bowers: Yes. Thank you.
So the second challenge with doing POR encoding is that we're using very large
files that aren't going to fit in memory. And the rotational drives that they're
sitting on are terrible at random access. So I can't go pull out pieces of the file
and do my encoding. So I'm going to have to do my encoding incrementally, I'm
going to have to read my data sequentially.
So if these are the pieces I need to compute one error correcting symbol, what I
do -- I can't do that because only a third of the file fits in memory. So I pull that
third into memory using sequential reading from the disk, and I take some
intermediate value. In this case, I can just take that one block.
As I read more of the file into memory, I pull in the blocks that I need from there
as well, and now I've got some intermediate value. This is some running
computation that's going to let me eventually compute the error correcting symbol
that I need.
And lastly, I pull in the last one and complete the computation, and then this can
be appended. And I've actually got to do that for every single error correcting
symbol all in one single pass through the file. Because otherwise the efficiencies
-- or the performance suffers so bad that it's not worth doing any of this.
So, in fact, you get to the point where you run out of temporary storage space on
the server or wherever you're doing your encoding at about 12 gigabytes. So if
you have a 12 gigabyte file you've got to keep so much temporary information to
be able to compute those that you're still running out of memory on some
standard desktop machine with maybe two gigs of RAM.
So that becomes a limit. And then you've got to think about doing things like
splitting up your file into 12 gigabyte chunks or whatever. So there's certainly a
challenge that we faced in implementing a proof of retrievability.
Moving on to HAIL. Quick little reminder of what this looks like. You've got your
file. You've got your encoding. You split that across your three providers and
then you compute some extra information for your additional providers.
And I already talked about this morning in at bit less detail. The concern here
again is that the order you do this stuff matters, matters drastically. Because
again, we've got to read these off of disk. So the first three blocks actually all
show up on the same disk in sequential order. So it's not too bad to compute
what's going to go on providers 4 and 5. And, in fact, I'm going to do that all the
way down.
And then I'm actually going to write that out to disk for all of them and read in
each file separately to compute these. If I were to do it in the opposite order, I
would incur all sorts of random access to disk and basically it becomes
impossible to do in any sort of realistic time. This as it is takes a couple of
minutes for a several gigabyte file. And to do it in the other order would be on
the order of days.
So now let's jump lastly to RAFT. And this is sort of our most recent work. And
what you'll notice is a lot of the issues we've come across is all based on this
very limited device. We're using rotational drives. That's the standard today.
Solid state is becoming more and more popular, but large data centers use
rotational drives. They have a seek delay when reading or writing that's on the
order of about four milliseconds. They have limited throughput, though it's not
bad. And they're good at sequential access. So if things are read in order, they
do a good job of that. They are absolutely terrible at random access.
And all of these limitations are sort of covered up by several layers of caching.
So there's caching that happens on the drive itself, there's caching that happens
in the operating system. And these actually lead to a challenge in RAFT. So for
example, if you were to pull 50 random blocks from a rotational drive of varying
sizes you would see time to do that that basically follows these curves.
And what's happening here is there's some caching effects that basically make
everything up to 64 kilobytes look exactly the same. So the caching on the drive
and in the operating system and anywhere else that caching is happening sort of
limit the design choices that we can make. So whether we choose a 512 byte
block or 64 kilobyte block isn't going to matter. And that affects how you can do
the testing for RAFT.
So let's jump into the testing for RAFT. Unlike Ari, I'm not going to give you a
little happy fraternity and pizza example. This is actually going to be how RAFT
actually works. So for those of you who didn't get the analogy this morning, here
are the details.
So again we do some encoding on the file. And we upload this to our cloud
provider. The cloud provider is then going to store that file across a number of
drives. And the idea is each of these drives are independent devices. I can
address them independently, I can get data back from them independently. So I
can specify one block on each and read those back in one unit time, whatever
that unit time is.
And, in fact, I can do this as many times as I want. And I can do it
non-interactively. So if I take the output that I got back, if I take those blocks, I
hash them all together, and use that hash to create some indices for my next
block, I can do this in some sort of lock step protocol. So each round I'm going to
pull one block from each disk, compute the next block I need to go get, pull those
blocks, again in one time step, and sort of continue this for as many time steps
as I need.
And I might have extra time steps here because I need to account for network
variability or other sources of variability. We'll get into those in just a minute.
The final one, the final set of blocks can then be returned to our user. Alice can
check that those are correct. And the other thing she checks, obviously, is the
time. Did they come back quickly enough? And this time's important because
remember each read takes one, one time, time of one. If we have fewer drives
than we're promised, so now we're only using three drives, whereas we were
promised four, if we ask for four blocks, there's got object a collision on at least
one of the drives. So one of the drives is going to have to do a double read.
And so we can do the first three reads all in one access one time, but that
second drive, the drive where the collision happened, is going to take an
additional time step. And again, we can magnify this by running this protocol on
lock step over a number of rounds.
So obviously now it's taken twice as long to get these back. And even if there's
significant variability in the network, Alice can get these back, they'll be correct,
but the time will be unacceptable. It will have taken too long even regardless of
network variability. Because we can have this extended protocol that goes
through numerous rounds.
So let's talk about sources of variability, because those are going to play a big
role here. One source of variability is the read time from disk. So I said disk
access time disk seek time was around four milliseconds. Well, so then you add
in the time to actually read the data from the disk, and you get a distribution that
looks something like this.
If the disk head happens to be sitting at the right spot, you can do it very, very
quickly. If, you know, your average case you're somewhere around six and a
half, seven milliseconds, maybe you have to wait for a whole disk rotation after
you've moved your disk head, in which case you end up here. And this graph
actually continues to go out because there can be misreads and rereads and
data that got remapped on the drive to other locations in the drive. So there are
a whole host of factors that make this not a perfect picture.
>>: [inaudible].
>> Kevin Bowers: No. This is with caching. So this is -- yeah, this is exactly the
system you would expect to be running on and so testing as close as we can to
real world.
Well, so why is that a problem? Here we are again back in our three drive model
where the server has promised us that our data is being striped across four
drives. Well, with caching it's possible that they all read at different rates. And,
in fact, I can do a double read before any of -- before one of my single reads
finishes. And that becomes problematic because now the server can cheat by
maybe he's got two desktop class drives and one enterprise class drive and the
odds of him being able to do that double read increase significantly. And do it
without being detected.
So we've done some tests over drives and then sort of simulated that out to a full
scale system. And what we found is that you have to increase the number of
steps obviously to detect the cheating. And we sort of get to the point at 17
drives or 17 steps, rather, sort of the adversary's chance of fooling you becomes
essentially negligible. So what's happening here is what I've graphed is the
difference between the best adversarial response time over 500 runs and the
worst expected case or correct server over 500 runs.
So there's a distribution for correct and a distribution for incorrect, and I'm
graphing their two closest points, the min and the max.
And what you see is the point where you get up to 50 steps in your protocol,
there's an 87 millisecond separation. If we look at that point in a little more detail,
this is the actual distribution you get of 500 runs, so your honest server is using
five drives will give you this distribution and your cheating server will give you
that. So at 50 steps, you have an 87 millisecond gap that you can play with.
Which is good because there's other sources of variability, like network latency.
So these are round trip pings from Boston to Santa Clara. Most of them fall in a
very, very small window. But there are others that you have to account for. And
there are a couple of ways you can think about doing that.
If you end up in an interval with very low ping times or very high ping times,
rather, you can just rerun the test at a later point. You know, if you can test a
nearby server so you know that your data is being stored in Seattle by Amazon
and so you ping Microsoft, so you can get some sense of what that cross-country
latency is going to be and use that. Or you can just design enough steps into
your protocol that you can overcome anything that you would happen to see.
And obviously this is suspect to how far the data is travelling. So Boston to
Santa Clara, basically a cross-country trip. If you do, you know, within the
northeast the -- you get a very similar graph, but everything's scaled down. And
if you do Boston to China, you get again very similar data, but expanded. So
you're looking at seven or 800 milliseconds round trip Boston to China and in the
northeast you're under 100 almost always. So that's all something you have to
design in to your protocol.
So jumping to the summary. PORs, as I already talked about this morning, let us
efficiently test whether our data is being actually stored. To do that you've got to
have a PRP that works on any size domain. And building those for the size of
files we looked at was nontrivial. Also that encoding must be done sequentially.
The very first version, try to do random access from disk and that didn't work at
all.
HAIL they be lets us build our own fault tolerance into the system so we can take
a bunch of cheap storage providers and cobble them together into something
that gives us fault tolerance. But again, you've got to pay attention to how you do
the encoding if you're going to do that. This is being done on a client machine
that has limited computational power, limited RAM and so the order in which you
do that encoding can actually make an enormous difference.
And then finally RAFT is really designed to allow us to test the data as being
stored with fault tolerance. If your cloud provider is promising you fault tolerance,
how do you actually test that? And by fault tolerance in this case we actually
mean resiliency to drive failures.
However, drive caching or hard drive caching restricts or design space. The
seek times to read data off of a drive and the variability you see in those enforces
a necessity for a lock step protocol that can run multiple rounds. And then you
also have to account for the network variability between you and whoever you're
testing.
And then publications we've put out on this stuff. Questions?
>>: Can you do it as far as [inaudible] the fault tolerance is [inaudible] and you
want to code it [inaudible].
>> Kevin Bowers: So in a POR we're essentially doing a single pass read over
the file, so, yeah, if you can keep up with the encoding then yeah, you could do it
on a stream. HAIL actually uses a two pass encoding so you would have to store
some temporary information to be able to do that. And the RAFT -- actually I
hadn't thought if you could do it in a streaming case with RAFT. Probably. It's -the encoding is very similar to a POR. It's more in the actual design of the test
that you have to account for variability. So that you could probably do in a
streaming model as well.
>>: So how [inaudible] how hard would it be for the provider to actually add
some noise themselves to the signal, basically? [inaudible] times and various
locations. I mean, can you account for that or are you basically [inaudible].
>> Kevin Bowers: So that -- that is very much within aside the threat model. Is
that the adversary will add noise, will delay responses in order to gain an
advantage. Basically, I assume you're talking mostly about RAFT here?
>>: Yes.
>> Kevin Bowers: Yeah. So the primary assumption there is that we know what
class of drives the storage provider is using. So there's a big difference between
desktop class drives, enterprise class drives and solid state drives. In terms of
their response characteristics. But within a class, the response times are close
enough that you can do the tests, assuming that you are actually getting the
hardware that you are expecting. And you sort of have to go back to an
economical motivation to assume that.
So if they're going to promise you desktop class drives, you're going to pay
desktop class rates, and they, thus, can't afford to actually have enterprise class
drives. And same going from enterprise to solid state.
So if you assume, based on how much you're paying, that you know what sort of
drives your data are being stored on, then you can do this regardless of whether
or not they're adding noise because we have a good characterization for how
those drives operate. Yeah?
>>: So you said you permute the data in PRP and ->> Kevin Bowers: Uh-huh.
>>: Okay. I don't see why it needs permutation or pseudorandomness at all
because permutation? Because you encrypt this -- you permute it, you
[inaudible] there's nothing leaking about the permutation that leaks. So I don't
see why [inaudible].
>> Kevin Bowers: So the reason you have to permute before you do ->>: I just don't understand why you need computational hardness in [inaudible].
>> Kevin Bowers: Right. So exactly. So that's why -- that's why we can go
down to a Feistel construction and still achieve the bounds that we're looking for.
So you're right. We don't -- we don't need strict, you know PRP guarantees. But
we need to be close enough that we can -- we can claim, with overwhelming
probability, the best an adversary can do ends up being random corruptions.
And so ->>: [inaudible] PRP but the bounds [inaudible] I don't see [inaudible] primitive at
that point. Are you permuting? [inaudible].
>>: [inaudible] combinatorial [inaudible].
>>: [inaudible] because you permute back, right?
>> Kevin Bowers: Yeah. Actually ->>: [inaudible].
>> Kevin Bowers: Yeah. It's been a couple years. I don't remember the
motivation or the proof we have behind it, but certainly something we should talk
about offline. That's an interesting thought. Because, yeah, if we could avoid
that cost, it makes all sorts of things easier.
>>: So [inaudible] can you modify RAFT or use the [inaudible] RAFT like
counting of hard drives that the cloud is servicing somehow? Or alternatively like
fingerprinting hard drives and having detailed measurements on particular hard
drives like timing?
>> Kevin Bowers: So fingerprinting would be harder. I'm not sure it's impossible.
It's not something we've looked at. In terms of ->>: [inaudible].
>> Kevin Bowers: Yes. In terms of counting, that should be possible. You
should be able to design the protocol in such a way that the counting becomes
feasible. It's not something we've done but certainly an interesting thought.
>>: Do you have a sense if providers are concerned about these types of things
like people being able to count their resources?
>> Kevin Bowers: So the people -- this is all my sense on this, but my
assumption would be that if you're going to advertise multiple copies then you
would be less concerned about someone actually trying to count them. And, you
know, if that's something you don't advertise, then it does become an issue. It -you know, it -- especially we get back to the deduplication question. If I've got a
file that's been deduped down to one hard drive, I'm not going to advertise that to
the world. So if I have somebody trying to count the number of drives I'm using,
that could raise some flags.
>>: Let's thank the speaker again.
[applause].
>>: Okay. So the last talk of the day, we've got Tom Roeder talking about
dynamic symmetric searchable encryption.
>> Tom Roeder: Thanks. So I work in the eXtreme Computing Group here. And
I like to say that our job is to take the bleeding edge and make it cutting edge.
So that's the goal here on searchable symmetric encryption. There have been a
lot of searchable symmetric encryption protocols out there, including Seny's and
others, and we came to this with the idea that we try and make basically the
product teams at Microsoft use good crypto and do interesting things with them.
So we're trying to do things that people are going to use and you're going to be
able to use.
Before I start, I should thank my collaborators of the theory and algorithms here
was jointly done with Seny. And in terms of practical design and implementation,
because we've implemented this, it's Mira Belenkiy, Karen Easterbrook, Larry
Joy, Kevin Kane, Brian LaMacchia, and Jason Mackay. As usual, the systems
like projects we had lots of people here.
So you've all seen this story about cloud backup, and I don't really need to talk
about this. And you've even seen from Seny the motivation that I really wanted
to give you which is searchable symmetric encryption. But so I'll say it again. So
the basically idea in searchable symmetric encryption is you're storing your files
in the cloud and you want to be able to search over them. Some terms from
those files, not necessarily if you have the words in a document but maybe the
artist this is an MP3 file, the composer, something that you want to search on in
these documents.
And the real problem that's out there is that in a real file system or in a real e-mail
system, this -- these documents change all the time. I'm modifying my files and
the thing that I've uploaded is no longer correct, so I need some way to update
the index that I uploaded to the cloud. And the implementations when we started
looking at this don't support that. There's no -- some of them there's just no
update supported at all, and in others there are efficient update operations but
they're not efficient in terms of their storage. So for instance you have to store
say size that's linear in the number of -- total number of documents times the
total number of unique words across all documents.
So we're going to talk about two different schemes with efficient update and
efficient both in storage and in terms of computation on both ends. And there's
two ways we can do this. One is we can update one word at a time. So if you
say removal in senses of the word the from a document, then you need to
remove that from the index. And we can do just that, that one word.
Another version which may seem naively not useful at all compared to this first
version is suppose I modify a document, then maybe I'm just going to change -delete the old document and upload all the terms for the new document. And I'll
explain in a minute why that might be a valuable way to see this problem.
So the basic overview of what we're going to do is look at these protocols. And
that's probably going to take most of our time. And if we end up having time, I'll
give you a brief outline of the security proofs and then talk about the
implementation that we have in some performance numbers.
So as a more explicit reminder of the encrypted search problem and how I'm
going to treat it, the user has some collection of documents where each
document D here is what I'm going to call a document identifier. And this isn't
just the name of the document. You might want to keep more information than
that. I mean, you might want to keep a little say 140 character twitter summary of
the document's content, maybe some unique identifier, a timestamp.
The point is this is going to be a fixed size representation of the document so that
when you encrypt it, it doesn't leak any information about the document at all.
And then the goal in what we're going to produce, and these are all for the
document based one, which I'm going to focus on and describe them in
protocols, search on some word, will return to the user a set of encrypted
document identifiers. The user can then decrypt them, read their summaries and
decide which one they want.
This is slightly different than what Seny presented where you get the documents
back. This is more like bing search, right? I search something, I get back a list
of links and then I click and actually go down the things that I'm looking for.
Add, want to add a document with a word set, delete a document. And then
extend index. This is something that will become obvious why this is important in
a second, but our indexes are padded to a particular fixed size, to hide some of
this information about the number of documents, number of unique words. And
so you need some way to extend that once you actually run out of space.
So the client's going to generate some of this information, send some tokens
from these algorithms over and then get some results back. This is the basic
paradigm.
Now, we're basing this on one of the schemes from the 2006 CCS paper,
Curtmola, Kamara and Ostrovsky. And the main idea in our slightly modified
version of the scheme is that each word gets mapped to a token. I should say -I'm going to say pseudorandom function sometimes here, and I don't really mean
it being a practical cryptographer. I always depend on the kindness of random
oracles. And so this is going to be some HMAC-SHA256 really that I'm going to
be dealing with. And tokens then map to some of the initial position in encrypted
array. And each array -- each index -- list entry here points to another position.
So one assumption that we're going to definitely depend on in all these protocols
is a stronger assumption than is normal, which is we're going to assume an
honest put curious server. So we're going to assume the server is going to fold
the protocol, and we're going to assume that they're just trying to learn as much
information as possible. Maybe there's some economic argument you can make
about why they don't want to be caught. And maybe you can apply some of
these protocols we heard about today in parallel to solve that other problem.
So here's a modified version of the Curtmola, et al, scheme. There's an index
and there are list entries. And the index then is -- has a key that's the output of
this function F which is really HMAC-SHA256 on some key on the word. And it
points to then a start position of a list of documents that contain that word. And
each list entry at a high level is the encryption of the next pointer for the next
element in the list under some key that depends on that word.
And then it also has a second component which is the encryption of the
document identifier you be a key known only to the user who generated this. So
how do you do search in this context? Well, the user has a word and they have
these keys, KC, KB, KG, and they can then -- for some key derivation function
with key G -- key for that word, and then if you construct a token, FKCW, FKBW
and KW and they can send this over the server who uses FKCW to look up in the
index the encryption of the -- or really a masked version of the start position,
which they can then unmask and point to the first element in the list and use the
decryption function to walk the list.
And to get back these encrypted document identifiers this didn't actually reveal
anything about the documents, it simply reveals account of the number of
documents. These are sent back to the user of the min encrypt. So this is really
a modified version of that scheme.
Now, the question is how do we do update? Well, the problem is we need to
change these list over time. You've changed a document, so we need to change
the list of -- we need to add that document say if we've added a word to that
document, we need to add it to one of the lists. But all the pointers in our lists
are encrypted. And we don't really want to go giving away KW, the key to that
list, every time we want to change it, otherwise we would be reviewing every
document, or at least the count of documents for that word.
So instead, we're going to depend on this honest but curious assumption and
choose a particular implementation for our encryption scheme, which is this
simple mask of a stream cipher effectively. So you have a random value and
then the pointers for the previous value, and the next value masked by F sub-KW
of R. Right? So suppose you want to remove X from this list of pointers.
>>: [inaudible].
>> Tom Roeder: That's meant to just be grouping. There's no inner product,
inner product, nothing. This is just brackets.
>>: [inaudible].
>> Tom Roeder: Yeah. So suppose a server can get P and N. And we'll say
NX, so it knows which one to delete. It's really easy to change this. All right. It
can XOR zero XOR N, and X XOR P0 and XOR them against that piece since
XOR is communicative, it changes it under the encryption and now you've pulled
out X. So it's a simple trick.
How does the server gets its Hans on P and N is the real question.
So to do that, we're going to add a new data structure. And it's basically going to
be a mirror image of the data structure we already had. So our sold old data
structure mapped words to the documents that contain those words. The new
data structure is going to map documents to the list of words in that document.
And it's going to contain for each word the pointer to the position of that word in
its list and the pointer to the previous and next is in the list and a little bit of
bookkeeping information that you can see in this overly complex specification.
So let's look at it in a picture. So here's the original list that we saw. Here's our
new data structure. This orange bar is the deletion list entry for this I believe,
which we'll see in a second, entry in the list. And it so, yeah, points to X and it
points to the previous value and the next value. It also, of course, is a list, so it
points to the other words in this document.
And finally, notice that this deletion entry contains pointers to previous and next
for each word entry. So we have to patch the deletion list entries when we patch
the entries in the other list too. So additionally this deletion list entry also
contains the deletion -- a pointer to the deletion list entry for N and a pointer to
the deletion list entry for P. So we can patch those at the same time. But all the
patching is done according to the simple XOR trick. Okay?
So there's still something missing. Which is we need some way to be able to
keep track of the empty space in this padded index. In the original scheme you
generate these lists and then you just pad the list with randomness. And the
problem is if you need to add something to the list, you need to know which
spaces is unused so you can get that space and add it. So the simple and
perhaps obvious solution is to encrypt the list of entries, of free entries, encrypt
the freelist in this index.
But we can't just use the simple encryption that we used before, right?
Remember that for words we encrypted them with a key that depended on that
word. So if we did that for the freelist, and we just had a single key for the entire
freelist, then we would give that to the server when it needed some information,
and it could decrypt the entire freelist and learn all that information that we were
trying to hide, as soon as we added a single document, which would defeat the
whole purpose. So instead we have a unique token that maps to the freelist.
This is uniquified so it's not really exactly that, and so you can still search on the
real word freelist.
It's some list, and each of the main entries also points to a unique deletion list
entry. Note that there's a one-to-one mapping between used main entries and
used deletion list entries. Because each deletion list entry is set up for one main
list entry. So you can also set up a one-to-one map for the free entries, and you
can just keep track of them with a single list.
So instead of keeping a single key, we just use some more masks so that this
entry contains a pointer to the next entry. It also contains a pointer to the current
deletion entry, and it's masked by the output of our HMAC on a new key with its
position in the freelist.
So that means that the user can dole out just enough freelist tokens to give
space as needed for add -- and this doesn't actually -- for add and delete
operations. So this doesn't asymptotically change the class of tokens sent over,
there was already a constant and lot of information that needed to be sent for
each word that was changed. We're simply adding an extra bit of information
that needs to be sent, which is the -- this masked value.
Now, you -- there's one thing that does change significantly because of this,
though, which is that the user now needs to keep track of the current count of the
freelist because otherwise it won't be able to produce the correct towns to
decrypt the freelist. So this makes the protocol stateful in some sense.
You can still have other users search and give them their key, but if anyone
wants to add, they have to coordinate, and they have to keep track of this bit of
information.
So given all of that, let's see how to add a document. Here's the full data
structure with main and deletion. Our deletion index and our main index. And
the user generates a set of tokens as written at the top here. So as doc token
that is what's going to be added to the deletion index. And it also gives freelist
tokens so that you can find the freelist and move it so you can get the top entry.
And then it provides some templates that are the filled out -- partially filled out
main and deletion list entries. And it-whoops, provides a word token for each
word so you can find the list, server can find the list and then does the patching
trick to put the word on -- the document on the front of that list for that word.
And deletion is similar. So the deletion tone is somewhat simpler. All you need
is the document token, the dock key and some freelist tokens, right? So there is
the document token that lets you find the deletion list entry and the main list entry
is then for that entry which are part of some word. And they have their own
corresponding deletion list entries.
And then you perform patching. So given this entry, you have -- the server has
enough information to patch around and to patch the deletionless entries and
patch it out and turn it into a new freelist entry at the top of the freelist. So these
are all very efficient operations. This is mostly computing a couple symmetric
operations and doing some XORs.
So finally index extension, which is trivial, given that you filled up the list or you
want to add more elements to the list as the user, you can encrypt a set of freelist
entries and permute them into an array that's just a block and send it over to the
server and then tell the server okay, here's the new top of the freelist, and here's
the bottom of the freelist so the server can patch it back in to the old top, and you
end up with an identically looking array. Okay. So that's the document based
scheme.
The word based scheme is almost identical. So I only need one slide to say it
given what we've said here. The only difference is since you're changing one
word at a time, you don't need to keep track of lists of words for a given
document. Instead you map in the index just -- you keep the key as a key on the
document identifier and the word concatenated together and then keep again
track of the XPN values. And now the index keys, instead of the positions in the
list for the other deletion list entries. And you can perform exactly the same
algorithms given this structure.
So why would you use one of these over the other actual term -- practical terms.
Well, what does word based update by you? The nice thing about word-based
update is the cost of your tokens is linear in the number of words that you're
changing whereas the document-based scheme, the only way to change a
document is to delete the entire document and then add a new document with all
the terms in it.
So this may seem useless but the value of the document based scheme is that
you don't have to keep state on the user's machine except of course for the small
piece of statements, the freelist. So if you're actually implementing the scheme,
when you do doc -- word based update you have to keep track of the diffs every
time you modify a file and so there's a lot more work to be done on the user's
side. So actually in our implementation we currently use document based
update.
Okay. So I'm going to briefly touch on the outline -- lots of time? So I'll briefly
touch on the outline of the security proofs. The basically idea I think was
presented by Seny which is you have this fairly large number of algorithms, so
generation -- key generation and index generation each of the trap versions
present -- generates the token. So the trap asks for search, trap A for add, trap
D for delete and then search and extend index. And the adversary has to
distinguish between two scenarios.
So one in which it's interacting with the SSC protocol and another in which
there's a simulator and it's interacting with this random oracle that the simulator
actually programs.
So I'm only going to show you how to write the simulator. It's really
straightforward. The simulator doesn't have any information. So as usual the
simulator is just going to make almost everything up. And the only trick is how
the simulator programs the random oracle. And the one thing I didn't draw in this
diagram is that there's some sort of leakage. So Seny already talked about the
query and the access pattern. The query pattern is you're leaking some unique
information about the things that you're searching. And the access pattern is
really which documents that are retrieved. Our algorithm, of course, leaks
somewhat more. So it leaks this update pattern, which is it leaks unique
identifiers for every word that you add, for each document.
And those up -- those identifiers can be correlated to the query patterns. You
can tell that you've added the something you've searched before. And it leaks
some information about the freelist, so it leaks when you reach the bottom, it
leaks the tail of the freelist. But given this, it's straightforward to do the
simulation. So index generation and index expansion the simulator just
generates large random arrays and keeps track of them.
The search is the same as in the original paper, but I'll say it briefly here. So
what gets leaked in our version of the search protocol is the number of results
that you get returned. Because we actually return just an encrypt -- set of
encryptions for the document identifiers. So what the simulator does is if it's
seen the search before, fine, it's easy, otherwise it chooses random entries in the
array and chooses a random index entry and then produces for its token masks
that will first unmask the index entry to point to its first starting point of the
random entries -- list entries it shows. And then it programs the random oracle
like this.
So suppose it wants to end up pointing X at a next value N. Well, it programs the
random oracle so if you seek the KW, which is a random value I generated, and
R, then generate this value, which is PN XOR prime and then when the
adversary queries the random oracle for these values it ends up decrypting just
to the pointers that are necessary.
Similarly add and delete are straightforward. Add chooses random values to add
and didn't need to program the random oracle in any way. It just chooses masks
correctly. And deletion is -- given the unique IDs of the deleted words so it again
gets to choose some random key, key sub-D, and programs the random oracle
and that decrypts the deletion list correctly to point to the random values it chose.
So this is -- this is -- these are straightforward if somewhat complex given the
data structures. Okay. So we've implemented this. This is an implementation in
C++ over Intel Xeon x64 under over Win 2008 R2. And the test here -- so some
of these results are highly dependent on the particular distribution of words that
you have because there's some overhead and say adding a new list versus
adding words to that list or adding new document as opposed to adding words to
that document.
So the way we did this is something that we hope is reasonable with respect to
normal distributions and documents which is what I'm going to call a double zipf.
So you have distribution of 10,000 documents scoring to zipf and 500,000 unique
terms. And the way you generate your pairs is you draw a document from your
zipf distribution and then you just draw words until you add a new word that that
document hasn't had before.
So what this is basically doing is writing the document from the fly according to
the zipf distribution. And we end up with 250,000 unique document word pairs
for this task. So we did it 50 times. These are the sample standard deviations is
what the plus minus is.
So in this case, index generation per document, unique document word pair is 88
microseconds per pair. So this is something you can sit and wait for to finish.
And if you're running it on a server in the middle of the night, it wouldn't hardly be
noticeable.
Search per document. This is searching on the largest document in the zipf
distribution is 3.2 microseconds per document on the server. Add has a dual
cost both add and delete because you have to generate these tokens. So add
has more cost on the client side because you're generating templates that get
inserted. So it's 38 microseconds on the client and 2.1 microseconds on the
server and then delete is the flip because deletion most of the work is done on
the server side.
But these numbers basically show that it's easily efficient number to be used in
practice and we have an implementation build over Windows. So this is a
straightforward obvious implementation perhaps but there's a backup service that
runs. And then it has a key manager and some indexer that hooks into the
Windows indexing service so that anything that can be indexed by Windows just
gets the terms out of that indexer and passes it into SSE.
And then these -- it generates tokens to send to a remote foreign server. Maybe
it's in the cloud, maybe it's some server say run by someone else in your family,
someone somewhere across the continent. And it just performs this backup.
And every time a file changes the Windows indexing service picks up those
changes and passes it to the SSE algorithm. Which can generate the tokens to
get sent over into update. And so searches can be performed again across this
connection.
I mentioned the Curtmola et al scheme. This other scheme that we're closely
related to here is Sedghi, et al, which has an update mechanism, it's the only
other SSE scheme I know that has an update mechanism. But their update
mechanism basically has a bit pattern for each of the documents that's under an
XOR encryption. So that's what allows them to do the update as in our scheme.
But since they're using this bit pattern for the documents, they have to have a bit
for every document that is stored on the server. And so it's significantly less
efficient in terms of the actual storage costs.
So I've shown you a couple of dynamic SSE algorithms. Practical
implementation. And they support the add and delete operations that you
actually need. And I've shown two different versions of them, which you can then
trade off, performance versus leakage. Thanks.
[applause].
>>: Questions?
>> Tom Roeder: Time to go.
>>: I have a question. So you have this honest but curious assumption, right?
>> Tom Roeder: Right.
>>: There's no way to tell the [inaudible] deviation from protocol in [inaudible].
>> Tom Roeder: Right. Basically if you have a server they can give you the
wrong results, they can -- what they can't do is [inaudible].
>>: [inaudible] confidentiality, they can -- I mean, how easy is it to break the
scheme [inaudible].
>> Tom Roeder: I don't think -- yeah. They can't break the confidentiality I don't
-- I don't have a proof for that.
>>: Okay.
>> Tom Roeder: But I don't think that a [inaudible] server can break the
confidentiality given that it's masked with random values for keys that it doesn't
know in its unique values because of the randomness for each of the chosen
masks.
>>: So in terms of confidentiality, what assumption would be [inaudible].
>> Tom Roeder: So actually it may be that malicious server can learn more
about the distributions and get more leakage from your system. So for instance if
they can send you messages, if you're encrypting an e-mail, it can send you an
e-mail that says cryptography, right, and so if you index this naively then you're
going to learn some information about maybe this is cryptography. Right? This
particular piece.
>>: [inaudible].
>> Tom Roeder: Sure. Yeah.
>>: [inaudible] you don't necessarily know whether [inaudible] which is
detectable. So that's not [inaudible].
>> Tom Roeder: Well, it's ->>: [inaudible].
>> Tom Roeder: You can check availability -- I mean, you can test that you're
getting true documents back, right, because you [inaudible] encryption scheme,
but you can't tell that they're the documents for that word. Right? Because he
could just arbitrary document.
>>: Right. But you can test that by keeping [inaudible].
>>: [inaudible]. You can test for if you decrypt your answer, you can see that
your keyword -- document you don't always [inaudible].
>> Tom Roeder: That's the other problem, yeah. That's the other side of the
availability problem.
>>: [inaudible] documents first, right? And then you decrypt it.
>>: Any other?
>>: [inaudible] right?
>> Tom Roeder: So when an entry is deleted, it's base is returned to the freelist.
So I don't keep extra information for things that I've deleted.
>>: Okay. So you don't keep any pointer [inaudible].
>> Tom Roeder: No. This space for the thing that got deleted is then returned to
the freelist to be used for other things. So that just recycles the space.
>>: Okay.
>> Tom Roeder: Yes?
>>: [inaudible].
>> Tom Roeder: Right.
>>: [inaudible].
>> Tom Roeder: Well, so, yeah. I mean, for the particular -- I mean, you can
work it out, right? But the number -- so you can see it takes, what would it be,
like 250 seconds approximately, a little bit less, 200 seconds for the encryption of
the particular example that I had generating the index. And then the add and
delete are basically unnoticeable, right, even with the zipf distribution you have
some thousands of files, but they only take, you know, three microseconds each
on the server. So it's just not a noticeable amount of time to do those operations.
>>: [inaudible].
>> Tom Roeder: I mean, you'd have up at the -- for add and delete you'd have to
be up with the hundreds of millions of entries. At that point your network
bandwidth is really is what's going to kill you, sending the tokens across is going
to be much, much more expensive than the actual computation. Yeah.
>>: Any other questions? Great. Let's thank all the speakers today again.
[applause]
Download