>>: All right. So start the next session. We've got two talks coming up. First is Kevin Bowers. He's going to be talking about cloudy, without the chance of data loss. Almost misspoke on that. Implementation. So I guess we'll be seeing lots of source code through this. >> Kevin Bowers: And actually no source code, but sort of follow-on work to Ari's talk this morning. He made everything look nice and rosy and all happy. And the problem is when you get to implementing some of it, there's a few more challenges than you might expect. So considering where we are, I'm going to skip right past the motivation part and jump right into proofs of retrieve possibility. So sorry for the context switch. We just did PDPs. Now we're jumping back to PORs. So try to bear with me. So first a little setup here. This is sort of how we think about the world we design PORs and HAIL and even some part for RAFT. You've got a user here, Alice. She's got a collection of photos. And she's going to just line those up and make one massive file out of them. And for certain applications we might split this file into, you know, unique -- or evenly sized pieces. So Ari talked about a proof of retrievability where I can now take this file, I do some encoding on it and I store it in the cloud. But what we failed to mention was how do you do that encoding piece, what does that look like? And so I'm very going to quickly walk through what that encoding looks like. So this is an adversarial encoding. And the reason we have to worry about the encoding here is we're going to take this file that we want to last forever, and we're going to give it to the people we trust the least. So this file needs to be resilient to anything they can try to do to it. So the first thing we have to do is we actually have to permute the file, we have to rearrange it. And the reason for this is because the files are so big, we can't compute error correcting information over the entire thing in one pass. So we're going to have to do striping. We're going to have to break it up into pieces and do an error correcting code over that and then take the next piece. But if I just did straight pieces, it's easy to identify where the piece boundaries are and utilize that structure to break the file. So I actually have to permute the file before I do my error correcting. Now I can do my error correcting over what is essentially a randomly rearranged file. I can now put the file back in the original order. And this is nice because now when I go to download it, it's sitting there in the order that I expect. But the error correcting information is computed over a random ordering. And now to make it more complicated I have to actually rearrange and encrypt that error correcting information. And this is again to hide the structure from the people who are going to be storing this file so they can't utilize it. And then we did our precomputation and encryption of responses. So we take a selection of blocks, compute an aggregation of those. That gets stored at the server and those get encrypted as well. So that's -- question? >>: The original papers -- you know, any error correcting you [inaudible] so you're saying that it's incorrect, that we need to have this random permutation? I'm sorry. I just missed the point why you need to rearrange ->> Kevin Bowers: So we can use any error correcting code. The problem is they're all essentially linear codes. And so there's structure that's easily pulled out of those. And so if I just broke the file up into pieces and did my correction, then I could easily map the piece here and its error correcting information out on the end. And I could just knock out both of those and lose the whole file. Essentially what we're worried about is the adversary, the person we're going to store this file with, being able to take out very small pieces of the file and yet render the whole thing unrecoverable. >>: [inaudible]. >> Kevin Bowers: We can delete a large piece of the file, but that we can detect. So that's the design of a PR. Any large corruption is detectable. But tiny corruption are not. >>: So [inaudible]. >> Kevin Bowers: That's basically impossible. There's no error correcting code that can do that anywhere close to the [inaudible]. >>: [inaudible]. >> Kevin Bowers: So we've got to do it in small pieces. And now we've got to hide where those pieces lie. Thank you. >>: You want to do essentially [inaudible] and I assume that you can only do error correction on small blocks, how to do an encoding of a large file? >> Kevin Bowers: Exactly. Without letting that structure out in the open. So the first challenge this raises is okay, so what is this permutation? How do I rearrange a file? Well, it's a pseudorandom permutation over some medium-size domain, for whatever that means. And specifically the domain is the number of blocks in our file. So how big is our file? That also implies that we need a variable length PRP here. The problem is the smallest block ciphers that we know of that we could build this out of run over 64-bit values. So if we translate that to, you know, each byte is a block, we're talking about a 16 exabyte file. So we're going to have to do something to shrink this block's effort to the domain that we're looking for. There are a couple of techniques for doing this. The first is known as ordinal sorting. So take your numbers, pass them through your PRP, get some set of outputs, and then sort those outputs. So what happens here is I have taken my numbers, I've generated their outputs, I sort the outputs into their rank order, and then what happens here is one map to three through my PRP, but that becomes the second element in my ordered set. So my mapping is from one to two. And you can follow that through for the others. So this now, this last column here, this is the PRP that I could use over a smaller domain. Another option known as cycle walking. So now you pass your numbers or your input through a PRP and you see whether or not they land back in the domain. And if they don't, you do it again. >>: [inaudible] something smaller than 64 bits? >> Kevin Bowers: Yes. So 64 bits gives you like 16 exabytes. We're looking at files that are giga to terabytes probably. >>: So that you're trying to find a PRP on a particular ->> Kevin Bowers: So we're trying to find a PRP [inaudible]. >>: What is K here I guess is the question. >> Kevin Bowers: Exactly. So K is something like two to the 30th, two to the -maybe two to the 40th. >>: Okay. >> Kevin Bowers: And then, again, this changes depending on what file we're encoding. Yeah. So it's got to be available. And so here basically if your mapping doesn't land in the domain you're looking for, just map it again until you get something that does. So ordinal sorting is feasible up to about two to the 25th. Beyond that memory requirements just -- memory and space just get too expensive to do. And cycle walking -- so this sort of builds up from zero. Cycle walking sort of works down from the original size. And it's only feasible to N over 500. So you got your 64 bits. You can sort of think about what that would look like. Basically gets you down to a 32 petabyte file. And unfortunately there's a pretty big gap between the two. And most of the files that we would be thinking about probably land in that gap. So the solution turns out to be a Feistel construction, and this is very similar to the work on format-based encryption. They use a very similar construction to do format-preserving encryption. Excuse me. So it will work for anything of a power of two. And then you can use the cycle walking to get you down to your exact domain that you're looking for. There is a minimal loss in security but in the way our PRP actually gets used there's no attack vector for that weakness. So getting less than perfect security is acceptable. This is what a Feistel construction looks like, for those of you who are interested. Basically you take -- you take your input, you split it into two, and you do a bunch of recombinations and shifts. Explained through -- look, I did get you some pseudo code. And what we found is that we can get good enough security bounds using a six round implementation. So six splits recombinations permutes. So that -- each section here is one round. So if you can do six of those, you get good enough security. >>: [inaudible]. >> Kevin Bowers: So, yeah, so there is a reduction based on the number of rounds to the indistinguishability and you can get acceptable for our application with six. In fact, the format preserving encryption uses even fewer rounds because they have a slightly weaker model. They actually redefine security primitives because they are not comparing to PRPs. >>: [inaudible]. >> Kevin Bowers: Yeah. Work by Luby [phonetic] and follow-on work by Rogelway, [phonetic] I believe. It's been a while. >>: So the six round you're [inaudible]. >> Kevin Bowers: Thank you. >>: And the [inaudible] and also [inaudible] and myself. >> Kevin Bowers: Yes. Thank you. So the second challenge with doing POR encoding is that we're using very large files that aren't going to fit in memory. And the rotational drives that they're sitting on are terrible at random access. So I can't go pull out pieces of the file and do my encoding. So I'm going to have to do my encoding incrementally, I'm going to have to read my data sequentially. So if these are the pieces I need to compute one error correcting symbol, what I do -- I can't do that because only a third of the file fits in memory. So I pull that third into memory using sequential reading from the disk, and I take some intermediate value. In this case, I can just take that one block. As I read more of the file into memory, I pull in the blocks that I need from there as well, and now I've got some intermediate value. This is some running computation that's going to let me eventually compute the error correcting symbol that I need. And lastly, I pull in the last one and complete the computation, and then this can be appended. And I've actually got to do that for every single error correcting symbol all in one single pass through the file. Because otherwise the efficiencies -- or the performance suffers so bad that it's not worth doing any of this. So, in fact, you get to the point where you run out of temporary storage space on the server or wherever you're doing your encoding at about 12 gigabytes. So if you have a 12 gigabyte file you've got to keep so much temporary information to be able to compute those that you're still running out of memory on some standard desktop machine with maybe two gigs of RAM. So that becomes a limit. And then you've got to think about doing things like splitting up your file into 12 gigabyte chunks or whatever. So there's certainly a challenge that we faced in implementing a proof of retrievability. Moving on to HAIL. Quick little reminder of what this looks like. You've got your file. You've got your encoding. You split that across your three providers and then you compute some extra information for your additional providers. And I already talked about this morning in at bit less detail. The concern here again is that the order you do this stuff matters, matters drastically. Because again, we've got to read these off of disk. So the first three blocks actually all show up on the same disk in sequential order. So it's not too bad to compute what's going to go on providers 4 and 5. And, in fact, I'm going to do that all the way down. And then I'm actually going to write that out to disk for all of them and read in each file separately to compute these. If I were to do it in the opposite order, I would incur all sorts of random access to disk and basically it becomes impossible to do in any sort of realistic time. This as it is takes a couple of minutes for a several gigabyte file. And to do it in the other order would be on the order of days. So now let's jump lastly to RAFT. And this is sort of our most recent work. And what you'll notice is a lot of the issues we've come across is all based on this very limited device. We're using rotational drives. That's the standard today. Solid state is becoming more and more popular, but large data centers use rotational drives. They have a seek delay when reading or writing that's on the order of about four milliseconds. They have limited throughput, though it's not bad. And they're good at sequential access. So if things are read in order, they do a good job of that. They are absolutely terrible at random access. And all of these limitations are sort of covered up by several layers of caching. So there's caching that happens on the drive itself, there's caching that happens in the operating system. And these actually lead to a challenge in RAFT. So for example, if you were to pull 50 random blocks from a rotational drive of varying sizes you would see time to do that that basically follows these curves. And what's happening here is there's some caching effects that basically make everything up to 64 kilobytes look exactly the same. So the caching on the drive and in the operating system and anywhere else that caching is happening sort of limit the design choices that we can make. So whether we choose a 512 byte block or 64 kilobyte block isn't going to matter. And that affects how you can do the testing for RAFT. So let's jump into the testing for RAFT. Unlike Ari, I'm not going to give you a little happy fraternity and pizza example. This is actually going to be how RAFT actually works. So for those of you who didn't get the analogy this morning, here are the details. So again we do some encoding on the file. And we upload this to our cloud provider. The cloud provider is then going to store that file across a number of drives. And the idea is each of these drives are independent devices. I can address them independently, I can get data back from them independently. So I can specify one block on each and read those back in one unit time, whatever that unit time is. And, in fact, I can do this as many times as I want. And I can do it non-interactively. So if I take the output that I got back, if I take those blocks, I hash them all together, and use that hash to create some indices for my next block, I can do this in some sort of lock step protocol. So each round I'm going to pull one block from each disk, compute the next block I need to go get, pull those blocks, again in one time step, and sort of continue this for as many time steps as I need. And I might have extra time steps here because I need to account for network variability or other sources of variability. We'll get into those in just a minute. The final one, the final set of blocks can then be returned to our user. Alice can check that those are correct. And the other thing she checks, obviously, is the time. Did they come back quickly enough? And this time's important because remember each read takes one, one time, time of one. If we have fewer drives than we're promised, so now we're only using three drives, whereas we were promised four, if we ask for four blocks, there's got object a collision on at least one of the drives. So one of the drives is going to have to do a double read. And so we can do the first three reads all in one access one time, but that second drive, the drive where the collision happened, is going to take an additional time step. And again, we can magnify this by running this protocol on lock step over a number of rounds. So obviously now it's taken twice as long to get these back. And even if there's significant variability in the network, Alice can get these back, they'll be correct, but the time will be unacceptable. It will have taken too long even regardless of network variability. Because we can have this extended protocol that goes through numerous rounds. So let's talk about sources of variability, because those are going to play a big role here. One source of variability is the read time from disk. So I said disk access time disk seek time was around four milliseconds. Well, so then you add in the time to actually read the data from the disk, and you get a distribution that looks something like this. If the disk head happens to be sitting at the right spot, you can do it very, very quickly. If, you know, your average case you're somewhere around six and a half, seven milliseconds, maybe you have to wait for a whole disk rotation after you've moved your disk head, in which case you end up here. And this graph actually continues to go out because there can be misreads and rereads and data that got remapped on the drive to other locations in the drive. So there are a whole host of factors that make this not a perfect picture. >>: [inaudible]. >> Kevin Bowers: No. This is with caching. So this is -- yeah, this is exactly the system you would expect to be running on and so testing as close as we can to real world. Well, so why is that a problem? Here we are again back in our three drive model where the server has promised us that our data is being striped across four drives. Well, with caching it's possible that they all read at different rates. And, in fact, I can do a double read before any of -- before one of my single reads finishes. And that becomes problematic because now the server can cheat by maybe he's got two desktop class drives and one enterprise class drive and the odds of him being able to do that double read increase significantly. And do it without being detected. So we've done some tests over drives and then sort of simulated that out to a full scale system. And what we found is that you have to increase the number of steps obviously to detect the cheating. And we sort of get to the point at 17 drives or 17 steps, rather, sort of the adversary's chance of fooling you becomes essentially negligible. So what's happening here is what I've graphed is the difference between the best adversarial response time over 500 runs and the worst expected case or correct server over 500 runs. So there's a distribution for correct and a distribution for incorrect, and I'm graphing their two closest points, the min and the max. And what you see is the point where you get up to 50 steps in your protocol, there's an 87 millisecond separation. If we look at that point in a little more detail, this is the actual distribution you get of 500 runs, so your honest server is using five drives will give you this distribution and your cheating server will give you that. So at 50 steps, you have an 87 millisecond gap that you can play with. Which is good because there's other sources of variability, like network latency. So these are round trip pings from Boston to Santa Clara. Most of them fall in a very, very small window. But there are others that you have to account for. And there are a couple of ways you can think about doing that. If you end up in an interval with very low ping times or very high ping times, rather, you can just rerun the test at a later point. You know, if you can test a nearby server so you know that your data is being stored in Seattle by Amazon and so you ping Microsoft, so you can get some sense of what that cross-country latency is going to be and use that. Or you can just design enough steps into your protocol that you can overcome anything that you would happen to see. And obviously this is suspect to how far the data is travelling. So Boston to Santa Clara, basically a cross-country trip. If you do, you know, within the northeast the -- you get a very similar graph, but everything's scaled down. And if you do Boston to China, you get again very similar data, but expanded. So you're looking at seven or 800 milliseconds round trip Boston to China and in the northeast you're under 100 almost always. So that's all something you have to design in to your protocol. So jumping to the summary. PORs, as I already talked about this morning, let us efficiently test whether our data is being actually stored. To do that you've got to have a PRP that works on any size domain. And building those for the size of files we looked at was nontrivial. Also that encoding must be done sequentially. The very first version, try to do random access from disk and that didn't work at all. HAIL they be lets us build our own fault tolerance into the system so we can take a bunch of cheap storage providers and cobble them together into something that gives us fault tolerance. But again, you've got to pay attention to how you do the encoding if you're going to do that. This is being done on a client machine that has limited computational power, limited RAM and so the order in which you do that encoding can actually make an enormous difference. And then finally RAFT is really designed to allow us to test the data as being stored with fault tolerance. If your cloud provider is promising you fault tolerance, how do you actually test that? And by fault tolerance in this case we actually mean resiliency to drive failures. However, drive caching or hard drive caching restricts or design space. The seek times to read data off of a drive and the variability you see in those enforces a necessity for a lock step protocol that can run multiple rounds. And then you also have to account for the network variability between you and whoever you're testing. And then publications we've put out on this stuff. Questions? >>: Can you do it as far as [inaudible] the fault tolerance is [inaudible] and you want to code it [inaudible]. >> Kevin Bowers: So in a POR we're essentially doing a single pass read over the file, so, yeah, if you can keep up with the encoding then yeah, you could do it on a stream. HAIL actually uses a two pass encoding so you would have to store some temporary information to be able to do that. And the RAFT -- actually I hadn't thought if you could do it in a streaming case with RAFT. Probably. It's -the encoding is very similar to a POR. It's more in the actual design of the test that you have to account for variability. So that you could probably do in a streaming model as well. >>: So how [inaudible] how hard would it be for the provider to actually add some noise themselves to the signal, basically? [inaudible] times and various locations. I mean, can you account for that or are you basically [inaudible]. >> Kevin Bowers: So that -- that is very much within aside the threat model. Is that the adversary will add noise, will delay responses in order to gain an advantage. Basically, I assume you're talking mostly about RAFT here? >>: Yes. >> Kevin Bowers: Yeah. So the primary assumption there is that we know what class of drives the storage provider is using. So there's a big difference between desktop class drives, enterprise class drives and solid state drives. In terms of their response characteristics. But within a class, the response times are close enough that you can do the tests, assuming that you are actually getting the hardware that you are expecting. And you sort of have to go back to an economical motivation to assume that. So if they're going to promise you desktop class drives, you're going to pay desktop class rates, and they, thus, can't afford to actually have enterprise class drives. And same going from enterprise to solid state. So if you assume, based on how much you're paying, that you know what sort of drives your data are being stored on, then you can do this regardless of whether or not they're adding noise because we have a good characterization for how those drives operate. Yeah? >>: So you said you permute the data in PRP and ->> Kevin Bowers: Uh-huh. >>: Okay. I don't see why it needs permutation or pseudorandomness at all because permutation? Because you encrypt this -- you permute it, you [inaudible] there's nothing leaking about the permutation that leaks. So I don't see why [inaudible]. >> Kevin Bowers: So the reason you have to permute before you do ->>: I just don't understand why you need computational hardness in [inaudible]. >> Kevin Bowers: Right. So exactly. So that's why -- that's why we can go down to a Feistel construction and still achieve the bounds that we're looking for. So you're right. We don't -- we don't need strict, you know PRP guarantees. But we need to be close enough that we can -- we can claim, with overwhelming probability, the best an adversary can do ends up being random corruptions. And so ->>: [inaudible] PRP but the bounds [inaudible] I don't see [inaudible] primitive at that point. Are you permuting? [inaudible]. >>: [inaudible] combinatorial [inaudible]. >>: [inaudible] because you permute back, right? >> Kevin Bowers: Yeah. Actually ->>: [inaudible]. >> Kevin Bowers: Yeah. It's been a couple years. I don't remember the motivation or the proof we have behind it, but certainly something we should talk about offline. That's an interesting thought. Because, yeah, if we could avoid that cost, it makes all sorts of things easier. >>: So [inaudible] can you modify RAFT or use the [inaudible] RAFT like counting of hard drives that the cloud is servicing somehow? Or alternatively like fingerprinting hard drives and having detailed measurements on particular hard drives like timing? >> Kevin Bowers: So fingerprinting would be harder. I'm not sure it's impossible. It's not something we've looked at. In terms of ->>: [inaudible]. >> Kevin Bowers: Yes. In terms of counting, that should be possible. You should be able to design the protocol in such a way that the counting becomes feasible. It's not something we've done but certainly an interesting thought. >>: Do you have a sense if providers are concerned about these types of things like people being able to count their resources? >> Kevin Bowers: So the people -- this is all my sense on this, but my assumption would be that if you're going to advertise multiple copies then you would be less concerned about someone actually trying to count them. And, you know, if that's something you don't advertise, then it does become an issue. It -you know, it -- especially we get back to the deduplication question. If I've got a file that's been deduped down to one hard drive, I'm not going to advertise that to the world. So if I have somebody trying to count the number of drives I'm using, that could raise some flags. >>: Let's thank the speaker again. [applause]. >>: Okay. So the last talk of the day, we've got Tom Roeder talking about dynamic symmetric searchable encryption. >> Tom Roeder: Thanks. So I work in the eXtreme Computing Group here. And I like to say that our job is to take the bleeding edge and make it cutting edge. So that's the goal here on searchable symmetric encryption. There have been a lot of searchable symmetric encryption protocols out there, including Seny's and others, and we came to this with the idea that we try and make basically the product teams at Microsoft use good crypto and do interesting things with them. So we're trying to do things that people are going to use and you're going to be able to use. Before I start, I should thank my collaborators of the theory and algorithms here was jointly done with Seny. And in terms of practical design and implementation, because we've implemented this, it's Mira Belenkiy, Karen Easterbrook, Larry Joy, Kevin Kane, Brian LaMacchia, and Jason Mackay. As usual, the systems like projects we had lots of people here. So you've all seen this story about cloud backup, and I don't really need to talk about this. And you've even seen from Seny the motivation that I really wanted to give you which is searchable symmetric encryption. But so I'll say it again. So the basically idea in searchable symmetric encryption is you're storing your files in the cloud and you want to be able to search over them. Some terms from those files, not necessarily if you have the words in a document but maybe the artist this is an MP3 file, the composer, something that you want to search on in these documents. And the real problem that's out there is that in a real file system or in a real e-mail system, this -- these documents change all the time. I'm modifying my files and the thing that I've uploaded is no longer correct, so I need some way to update the index that I uploaded to the cloud. And the implementations when we started looking at this don't support that. There's no -- some of them there's just no update supported at all, and in others there are efficient update operations but they're not efficient in terms of their storage. So for instance you have to store say size that's linear in the number of -- total number of documents times the total number of unique words across all documents. So we're going to talk about two different schemes with efficient update and efficient both in storage and in terms of computation on both ends. And there's two ways we can do this. One is we can update one word at a time. So if you say removal in senses of the word the from a document, then you need to remove that from the index. And we can do just that, that one word. Another version which may seem naively not useful at all compared to this first version is suppose I modify a document, then maybe I'm just going to change -delete the old document and upload all the terms for the new document. And I'll explain in a minute why that might be a valuable way to see this problem. So the basic overview of what we're going to do is look at these protocols. And that's probably going to take most of our time. And if we end up having time, I'll give you a brief outline of the security proofs and then talk about the implementation that we have in some performance numbers. So as a more explicit reminder of the encrypted search problem and how I'm going to treat it, the user has some collection of documents where each document D here is what I'm going to call a document identifier. And this isn't just the name of the document. You might want to keep more information than that. I mean, you might want to keep a little say 140 character twitter summary of the document's content, maybe some unique identifier, a timestamp. The point is this is going to be a fixed size representation of the document so that when you encrypt it, it doesn't leak any information about the document at all. And then the goal in what we're going to produce, and these are all for the document based one, which I'm going to focus on and describe them in protocols, search on some word, will return to the user a set of encrypted document identifiers. The user can then decrypt them, read their summaries and decide which one they want. This is slightly different than what Seny presented where you get the documents back. This is more like bing search, right? I search something, I get back a list of links and then I click and actually go down the things that I'm looking for. Add, want to add a document with a word set, delete a document. And then extend index. This is something that will become obvious why this is important in a second, but our indexes are padded to a particular fixed size, to hide some of this information about the number of documents, number of unique words. And so you need some way to extend that once you actually run out of space. So the client's going to generate some of this information, send some tokens from these algorithms over and then get some results back. This is the basic paradigm. Now, we're basing this on one of the schemes from the 2006 CCS paper, Curtmola, Kamara and Ostrovsky. And the main idea in our slightly modified version of the scheme is that each word gets mapped to a token. I should say -I'm going to say pseudorandom function sometimes here, and I don't really mean it being a practical cryptographer. I always depend on the kindness of random oracles. And so this is going to be some HMAC-SHA256 really that I'm going to be dealing with. And tokens then map to some of the initial position in encrypted array. And each array -- each index -- list entry here points to another position. So one assumption that we're going to definitely depend on in all these protocols is a stronger assumption than is normal, which is we're going to assume an honest put curious server. So we're going to assume the server is going to fold the protocol, and we're going to assume that they're just trying to learn as much information as possible. Maybe there's some economic argument you can make about why they don't want to be caught. And maybe you can apply some of these protocols we heard about today in parallel to solve that other problem. So here's a modified version of the Curtmola, et al, scheme. There's an index and there are list entries. And the index then is -- has a key that's the output of this function F which is really HMAC-SHA256 on some key on the word. And it points to then a start position of a list of documents that contain that word. And each list entry at a high level is the encryption of the next pointer for the next element in the list under some key that depends on that word. And then it also has a second component which is the encryption of the document identifier you be a key known only to the user who generated this. So how do you do search in this context? Well, the user has a word and they have these keys, KC, KB, KG, and they can then -- for some key derivation function with key G -- key for that word, and then if you construct a token, FKCW, FKBW and KW and they can send this over the server who uses FKCW to look up in the index the encryption of the -- or really a masked version of the start position, which they can then unmask and point to the first element in the list and use the decryption function to walk the list. And to get back these encrypted document identifiers this didn't actually reveal anything about the documents, it simply reveals account of the number of documents. These are sent back to the user of the min encrypt. So this is really a modified version of that scheme. Now, the question is how do we do update? Well, the problem is we need to change these list over time. You've changed a document, so we need to change the list of -- we need to add that document say if we've added a word to that document, we need to add it to one of the lists. But all the pointers in our lists are encrypted. And we don't really want to go giving away KW, the key to that list, every time we want to change it, otherwise we would be reviewing every document, or at least the count of documents for that word. So instead, we're going to depend on this honest but curious assumption and choose a particular implementation for our encryption scheme, which is this simple mask of a stream cipher effectively. So you have a random value and then the pointers for the previous value, and the next value masked by F sub-KW of R. Right? So suppose you want to remove X from this list of pointers. >>: [inaudible]. >> Tom Roeder: That's meant to just be grouping. There's no inner product, inner product, nothing. This is just brackets. >>: [inaudible]. >> Tom Roeder: Yeah. So suppose a server can get P and N. And we'll say NX, so it knows which one to delete. It's really easy to change this. All right. It can XOR zero XOR N, and X XOR P0 and XOR them against that piece since XOR is communicative, it changes it under the encryption and now you've pulled out X. So it's a simple trick. How does the server gets its Hans on P and N is the real question. So to do that, we're going to add a new data structure. And it's basically going to be a mirror image of the data structure we already had. So our sold old data structure mapped words to the documents that contain those words. The new data structure is going to map documents to the list of words in that document. And it's going to contain for each word the pointer to the position of that word in its list and the pointer to the previous and next is in the list and a little bit of bookkeeping information that you can see in this overly complex specification. So let's look at it in a picture. So here's the original list that we saw. Here's our new data structure. This orange bar is the deletion list entry for this I believe, which we'll see in a second, entry in the list. And it so, yeah, points to X and it points to the previous value and the next value. It also, of course, is a list, so it points to the other words in this document. And finally, notice that this deletion entry contains pointers to previous and next for each word entry. So we have to patch the deletion list entries when we patch the entries in the other list too. So additionally this deletion list entry also contains the deletion -- a pointer to the deletion list entry for N and a pointer to the deletion list entry for P. So we can patch those at the same time. But all the patching is done according to the simple XOR trick. Okay? So there's still something missing. Which is we need some way to be able to keep track of the empty space in this padded index. In the original scheme you generate these lists and then you just pad the list with randomness. And the problem is if you need to add something to the list, you need to know which spaces is unused so you can get that space and add it. So the simple and perhaps obvious solution is to encrypt the list of entries, of free entries, encrypt the freelist in this index. But we can't just use the simple encryption that we used before, right? Remember that for words we encrypted them with a key that depended on that word. So if we did that for the freelist, and we just had a single key for the entire freelist, then we would give that to the server when it needed some information, and it could decrypt the entire freelist and learn all that information that we were trying to hide, as soon as we added a single document, which would defeat the whole purpose. So instead we have a unique token that maps to the freelist. This is uniquified so it's not really exactly that, and so you can still search on the real word freelist. It's some list, and each of the main entries also points to a unique deletion list entry. Note that there's a one-to-one mapping between used main entries and used deletion list entries. Because each deletion list entry is set up for one main list entry. So you can also set up a one-to-one map for the free entries, and you can just keep track of them with a single list. So instead of keeping a single key, we just use some more masks so that this entry contains a pointer to the next entry. It also contains a pointer to the current deletion entry, and it's masked by the output of our HMAC on a new key with its position in the freelist. So that means that the user can dole out just enough freelist tokens to give space as needed for add -- and this doesn't actually -- for add and delete operations. So this doesn't asymptotically change the class of tokens sent over, there was already a constant and lot of information that needed to be sent for each word that was changed. We're simply adding an extra bit of information that needs to be sent, which is the -- this masked value. Now, you -- there's one thing that does change significantly because of this, though, which is that the user now needs to keep track of the current count of the freelist because otherwise it won't be able to produce the correct towns to decrypt the freelist. So this makes the protocol stateful in some sense. You can still have other users search and give them their key, but if anyone wants to add, they have to coordinate, and they have to keep track of this bit of information. So given all of that, let's see how to add a document. Here's the full data structure with main and deletion. Our deletion index and our main index. And the user generates a set of tokens as written at the top here. So as doc token that is what's going to be added to the deletion index. And it also gives freelist tokens so that you can find the freelist and move it so you can get the top entry. And then it provides some templates that are the filled out -- partially filled out main and deletion list entries. And it-whoops, provides a word token for each word so you can find the list, server can find the list and then does the patching trick to put the word on -- the document on the front of that list for that word. And deletion is similar. So the deletion tone is somewhat simpler. All you need is the document token, the dock key and some freelist tokens, right? So there is the document token that lets you find the deletion list entry and the main list entry is then for that entry which are part of some word. And they have their own corresponding deletion list entries. And then you perform patching. So given this entry, you have -- the server has enough information to patch around and to patch the deletionless entries and patch it out and turn it into a new freelist entry at the top of the freelist. So these are all very efficient operations. This is mostly computing a couple symmetric operations and doing some XORs. So finally index extension, which is trivial, given that you filled up the list or you want to add more elements to the list as the user, you can encrypt a set of freelist entries and permute them into an array that's just a block and send it over to the server and then tell the server okay, here's the new top of the freelist, and here's the bottom of the freelist so the server can patch it back in to the old top, and you end up with an identically looking array. Okay. So that's the document based scheme. The word based scheme is almost identical. So I only need one slide to say it given what we've said here. The only difference is since you're changing one word at a time, you don't need to keep track of lists of words for a given document. Instead you map in the index just -- you keep the key as a key on the document identifier and the word concatenated together and then keep again track of the XPN values. And now the index keys, instead of the positions in the list for the other deletion list entries. And you can perform exactly the same algorithms given this structure. So why would you use one of these over the other actual term -- practical terms. Well, what does word based update by you? The nice thing about word-based update is the cost of your tokens is linear in the number of words that you're changing whereas the document-based scheme, the only way to change a document is to delete the entire document and then add a new document with all the terms in it. So this may seem useless but the value of the document based scheme is that you don't have to keep state on the user's machine except of course for the small piece of statements, the freelist. So if you're actually implementing the scheme, when you do doc -- word based update you have to keep track of the diffs every time you modify a file and so there's a lot more work to be done on the user's side. So actually in our implementation we currently use document based update. Okay. So I'm going to briefly touch on the outline -- lots of time? So I'll briefly touch on the outline of the security proofs. The basically idea I think was presented by Seny which is you have this fairly large number of algorithms, so generation -- key generation and index generation each of the trap versions present -- generates the token. So the trap asks for search, trap A for add, trap D for delete and then search and extend index. And the adversary has to distinguish between two scenarios. So one in which it's interacting with the SSC protocol and another in which there's a simulator and it's interacting with this random oracle that the simulator actually programs. So I'm only going to show you how to write the simulator. It's really straightforward. The simulator doesn't have any information. So as usual the simulator is just going to make almost everything up. And the only trick is how the simulator programs the random oracle. And the one thing I didn't draw in this diagram is that there's some sort of leakage. So Seny already talked about the query and the access pattern. The query pattern is you're leaking some unique information about the things that you're searching. And the access pattern is really which documents that are retrieved. Our algorithm, of course, leaks somewhat more. So it leaks this update pattern, which is it leaks unique identifiers for every word that you add, for each document. And those up -- those identifiers can be correlated to the query patterns. You can tell that you've added the something you've searched before. And it leaks some information about the freelist, so it leaks when you reach the bottom, it leaks the tail of the freelist. But given this, it's straightforward to do the simulation. So index generation and index expansion the simulator just generates large random arrays and keeps track of them. The search is the same as in the original paper, but I'll say it briefly here. So what gets leaked in our version of the search protocol is the number of results that you get returned. Because we actually return just an encrypt -- set of encryptions for the document identifiers. So what the simulator does is if it's seen the search before, fine, it's easy, otherwise it chooses random entries in the array and chooses a random index entry and then produces for its token masks that will first unmask the index entry to point to its first starting point of the random entries -- list entries it shows. And then it programs the random oracle like this. So suppose it wants to end up pointing X at a next value N. Well, it programs the random oracle so if you seek the KW, which is a random value I generated, and R, then generate this value, which is PN XOR prime and then when the adversary queries the random oracle for these values it ends up decrypting just to the pointers that are necessary. Similarly add and delete are straightforward. Add chooses random values to add and didn't need to program the random oracle in any way. It just chooses masks correctly. And deletion is -- given the unique IDs of the deleted words so it again gets to choose some random key, key sub-D, and programs the random oracle and that decrypts the deletion list correctly to point to the random values it chose. So this is -- this is -- these are straightforward if somewhat complex given the data structures. Okay. So we've implemented this. This is an implementation in C++ over Intel Xeon x64 under over Win 2008 R2. And the test here -- so some of these results are highly dependent on the particular distribution of words that you have because there's some overhead and say adding a new list versus adding words to that list or adding new document as opposed to adding words to that document. So the way we did this is something that we hope is reasonable with respect to normal distributions and documents which is what I'm going to call a double zipf. So you have distribution of 10,000 documents scoring to zipf and 500,000 unique terms. And the way you generate your pairs is you draw a document from your zipf distribution and then you just draw words until you add a new word that that document hasn't had before. So what this is basically doing is writing the document from the fly according to the zipf distribution. And we end up with 250,000 unique document word pairs for this task. So we did it 50 times. These are the sample standard deviations is what the plus minus is. So in this case, index generation per document, unique document word pair is 88 microseconds per pair. So this is something you can sit and wait for to finish. And if you're running it on a server in the middle of the night, it wouldn't hardly be noticeable. Search per document. This is searching on the largest document in the zipf distribution is 3.2 microseconds per document on the server. Add has a dual cost both add and delete because you have to generate these tokens. So add has more cost on the client side because you're generating templates that get inserted. So it's 38 microseconds on the client and 2.1 microseconds on the server and then delete is the flip because deletion most of the work is done on the server side. But these numbers basically show that it's easily efficient number to be used in practice and we have an implementation build over Windows. So this is a straightforward obvious implementation perhaps but there's a backup service that runs. And then it has a key manager and some indexer that hooks into the Windows indexing service so that anything that can be indexed by Windows just gets the terms out of that indexer and passes it into SSE. And then these -- it generates tokens to send to a remote foreign server. Maybe it's in the cloud, maybe it's some server say run by someone else in your family, someone somewhere across the continent. And it just performs this backup. And every time a file changes the Windows indexing service picks up those changes and passes it to the SSE algorithm. Which can generate the tokens to get sent over into update. And so searches can be performed again across this connection. I mentioned the Curtmola et al scheme. This other scheme that we're closely related to here is Sedghi, et al, which has an update mechanism, it's the only other SSE scheme I know that has an update mechanism. But their update mechanism basically has a bit pattern for each of the documents that's under an XOR encryption. So that's what allows them to do the update as in our scheme. But since they're using this bit pattern for the documents, they have to have a bit for every document that is stored on the server. And so it's significantly less efficient in terms of the actual storage costs. So I've shown you a couple of dynamic SSE algorithms. Practical implementation. And they support the add and delete operations that you actually need. And I've shown two different versions of them, which you can then trade off, performance versus leakage. Thanks. [applause]. >>: Questions? >> Tom Roeder: Time to go. >>: I have a question. So you have this honest but curious assumption, right? >> Tom Roeder: Right. >>: There's no way to tell the [inaudible] deviation from protocol in [inaudible]. >> Tom Roeder: Right. Basically if you have a server they can give you the wrong results, they can -- what they can't do is [inaudible]. >>: [inaudible] confidentiality, they can -- I mean, how easy is it to break the scheme [inaudible]. >> Tom Roeder: I don't think -- yeah. They can't break the confidentiality I don't -- I don't have a proof for that. >>: Okay. >> Tom Roeder: But I don't think that a [inaudible] server can break the confidentiality given that it's masked with random values for keys that it doesn't know in its unique values because of the randomness for each of the chosen masks. >>: So in terms of confidentiality, what assumption would be [inaudible]. >> Tom Roeder: So actually it may be that malicious server can learn more about the distributions and get more leakage from your system. So for instance if they can send you messages, if you're encrypting an e-mail, it can send you an e-mail that says cryptography, right, and so if you index this naively then you're going to learn some information about maybe this is cryptography. Right? This particular piece. >>: [inaudible]. >> Tom Roeder: Sure. Yeah. >>: [inaudible] you don't necessarily know whether [inaudible] which is detectable. So that's not [inaudible]. >> Tom Roeder: Well, it's ->>: [inaudible]. >> Tom Roeder: You can check availability -- I mean, you can test that you're getting true documents back, right, because you [inaudible] encryption scheme, but you can't tell that they're the documents for that word. Right? Because he could just arbitrary document. >>: Right. But you can test that by keeping [inaudible]. >>: [inaudible]. You can test for if you decrypt your answer, you can see that your keyword -- document you don't always [inaudible]. >> Tom Roeder: That's the other problem, yeah. That's the other side of the availability problem. >>: [inaudible] documents first, right? And then you decrypt it. >>: Any other? >>: [inaudible] right? >> Tom Roeder: So when an entry is deleted, it's base is returned to the freelist. So I don't keep extra information for things that I've deleted. >>: Okay. So you don't keep any pointer [inaudible]. >> Tom Roeder: No. This space for the thing that got deleted is then returned to the freelist to be used for other things. So that just recycles the space. >>: Okay. >> Tom Roeder: Yes? >>: [inaudible]. >> Tom Roeder: Right. >>: [inaudible]. >> Tom Roeder: Well, so, yeah. I mean, for the particular -- I mean, you can work it out, right? But the number -- so you can see it takes, what would it be, like 250 seconds approximately, a little bit less, 200 seconds for the encryption of the particular example that I had generating the index. And then the add and delete are basically unnoticeable, right, even with the zipf distribution you have some thousands of files, but they only take, you know, three microseconds each on the server. So it's just not a noticeable amount of time to do those operations. >>: [inaudible]. >> Tom Roeder: I mean, you'd have up at the -- for add and delete you'd have to be up with the hundreds of millions of entries. At that point your network bandwidth is really is what's going to kill you, sending the tokens across is going to be much, much more expensive than the actual computation. Yeah. >>: Any other questions? Great. Let's thank all the speakers today again. [applause]