>> Melissa Chase: Today we're happy to have Hovav Shacham visiting us from UCSD and he's going to tell us about outsourced storage and proofs retrievability.
>> Hovav Shacham: Thanks. And so this is joint work with Brent Waters, who's now at
UT Austin. And the paper that I'll talk about mostly appeared at Asia (inaudible) just a little while ago, but I'm going to try to focus less on just what we have in the paper and more about explaining proofs of retrievability and why we think they're interesting and kind of the motivation for what we did in the paper.
So the setting here is what seems to be a pretty common one, which is where clients want to outsource the storage of their files and put them somewhere -- I guess these days the buzz word is "in the clouds," but some storage provider is going to keep the files for them. And this really frankly makes a lot of economic sense because storing a file is easy. I can buy a hard disk. I can put my file on the hard disk. And my file is now on the hard disk. What is hard is retrieving the file later.
Actual long-term reliable digital storage is not really a solved problem by any means and actually doing it well requires infrastructure that scales quite nicely so you get economies of scale and it makes sense for me not to worry about my problem storing my terabyte data on my own. I should give it to somebody else, but then that creates an economic situation where I need to be sure that the company that's storing the data for me really is actually storing the data. They haven't just taken my money and gone to Bermuda with it and left me high and dry.
And that's not really just the client's problem, it's also the server's problem. It's not just the people who are buying the storage, it's also the people who are selling the storage, because actually doing the storage right costs more than not doing it right. And so if the clients has no information about how the files are being stored and whether they're being faithfully stored, then all she can uses as information is the price signal and of course the price signal means that the people who are not actually storing the file well are going to be cheaper and they're going to -- you can argue this in terms of a market for limits, where the good providers or the conscientious providers will be driven out.
So what you really want from both the client's perspective and the honest server's perspective is a mechanism that allows clients to verify that their files are being stored. If
I can spot check, if I can efficiently check that my file is being stored, then I will know to continue to pay the somewhat higher prices that my storage provider is charging me.
And this is actually a problem that's interesting to real systems people. I guess maybe it's kind of a cynical observation, but in crypto, a lot of the times we start with a systems problem and then we abstract it away into some sort of crypto thing and then we solve it in a way that systems people don't actually care about.
And I'm as guilty of that as anybody else, but my evidence for stating that this is something that systems people care about is that you can go to systems papers and they've actually tried to come up with protocols that do this. In other words, this is something that systems people care enough to try to solve on their own. And unfortunately, systems people are not photographers or maybe fortunately because otherwise nothing would get built, but that means that their protocols are suboptimal and this is really where we try to improve.
So I'll give you just a few of these protocols, just kind of to set the scene. I'm not actually endorsing any of these protocols and some of them I kind of actually have problems, but this is more just to establish that a, people care about this, and b, that this is the notation that we're going to use. So the idea is that we've got a prover who corresponds to the server and a verifying who corresponds to the client and the prover is storing the file in them. The things in parenthesis are kind of the knowledge that the two parties have.
And this is the protocol that the parties use to graph.
So the simplest possible protocol is one in which the prover simply sends the file over to the client and the client checks that the file's correct. For example, by computing a hash of it. And I think it's pretty easy for us to convince ourselves that if the prover has sent over the file then the prover must know the file. And so this protocol really proves that the prover knows the file, which is what we want. So this is a good protocol. The down side, of course, is that the prover has to send over this ridiculous, potentially ridiculously long file and it's just not practical at all.
Now the first thing that you might try to do is you might try to say that the prover will just send over the hash of the file and the client will verify that the hash is correct. But that of course doesn't work. Why doesn't it work? Anybody?
>> Question: They cached them.
>> Hovav Shacham: Yeah. So you just instead of storing the file, you store the hash of the file and it's -- okay, so there's this paper from Usnik's(phonetic) Technical Conference
2007 where some folks at UT Austin came up with a protocol that's essentially a fix of the protocol that we just rejected right now where instead of having the server send the hash of the file, it sends over the hash of the file together with some challenge. And the
advantage here is that we think that this is something that the server can't do without actually storing the whole file. The disadvantage is that the client now needs to store the file, so this is only appropriate in certain kinds of applications. But this particular paper is my example for where systems people really care about this.
And you can get more sophisticated. This is a nice crypto protocol from a conference around 2003, where the server will prove that it knows M by computing G to the M modulo N, where N is an RSA Modulus. And the idea here is that the server needs to compute G to the M by actually knowing M because if it knew anything smaller than M that also computed the same thing then the difference between those two would be of multiple of 5N and therefore reveal the factorization and therefore this is secure in some sense.
And the nice thing is here is of course the client knows the factorization, so instead of storing the whole file it can actually really just store M on file note. So this is a really clever protocol. And I could list more protocols that all have kind of different properties and different tradeoffs. And the real question is how we can evaluate something like this.
And broadly speaking I'm going to claim that there are two classes of criteria. One is the systems criteria. And one is the crypto criteria. So on the systems side, we really care about how this protocol interacts with the rest of the system that we built and what sort of properties it enables. So of course we care about efficiency and efficiency is in terms of the storage overhead on the server meaning how much more than just the file does the server need to store? Storage overhead on the client meaning does the client actually need to store the file or maybe just a key? It refers to computation on both the client and the server as part of setting up the storage and as part of running the protocol itself and of course that also depends on the number of blocks that you read off disk.
If you have a terabyte file and you have to read the whole thing and do even very light processing on top of each block, that's tons of work. You really want to have just a few reads from disks because disks are slow and of course the communication between the parties is a parameter.
And then there's additional criteria that are maybe a little bit less obvious. So one of them is how many times you can run the protocol over and over. If there's only a few times that you can run before you run out of challenges or before the server can answer any challenge without storing the file any longer then that's less attractive than being able to run the protocol arbitrarily many times. You also don't want to keep state, because the moment you keep state on the client it becomes hard to distribute the verification because the different verifiers have to communicate between them and that gets us to
the last point, which is that really you'd like anybody to be able to verify, for example, in a public archive storing some sort of important document that shouldn't just be the Library of Congress that put it there that can verify that it's there, but any interested citizen should be able to run the protocol and that means that you shouldn't need some sort of private key that only the file's owner possesses.
And you could go on like this, but this is kind of the flavor of citizens criteria.
>> Question: This verifier specific?
>> Hovav Shacham: Yes. So the protocols that I showed you before were just a single interaction. Right? Between the -- now you might imagine that I want to check once a day that my file's being stored. And one thing that those diagrams didn't make clear is whether I need to update any kind of a state locally on the verifier side between runs of the protocol. For example, if I have a list of challenges and I can only use each challenge once, then I will need to remember which challenges I've used so that I don't reuse one because maybe it's trivial for the server to answer if I reuse a challenge. Whereas, if I can just randomly pick a challenge from some sort of space and send it over then that is kind of better because I won't have to coordinate.
So these are all the systems criteria. And they're pretty well known and then the crypto criterion is maybe a little bit subtle or it's some kind of super position of subtle and so obvious that why would you even say it, which is that these protocols are only worth while if they prove that the server is storing the file. If the server can run this protocol and it's not storing the file, then what's the point of the protocol? So the only time that a prover passes the protocol should be if the prover is storing the file. And in crypto, the only way we know that we can say that the prover is storing the file is if we can pull the file out of the prover.
So that means that any adversary that successfully passes the verification with some non-negligible probability epsilon can be used in a black box way to retrieve the file. In other words, the only way you pass the test is by having the file. And the way we know that you have the file is because we get it out of you. And this really looks a lot like the definition of visual knowledge, proof of knowledge.
In the sense it was due to two of a kind of semi-independently to two of the three papers that we follow-up most closely, which are the Nora Rothblum(phonetic) kind of theoretical view of related problems from Fox and then the Jules Kuliski(phonetic) paper, which was one of the two papers at CCS last year that kind of spawned the modern interest in this problem. And they both had this insight and they both gave security models, our security
model tries to simplify their models and get at the essentials, but so we think that the security model is right, but maybe not everybody agrees. And the idea here is you've got some sort of key generation routine that outputs a secret key or maybe a key pair if we're going to have public keys and that key is used for storing a file item. And the output of the store routine is M-Star, which is the file encoded in whatever way you need to encode. And that goes on a server. And then a short tag. And that tag is maybe just the name of the file, some way of disambiguating that this tag goes with this encoded file and some other tag goes with some other encoded file.
And once we've encoded the file, we can engage in the proof of storage protocol, which is just this funny protocol notation for the protocol like the ones we showed before, which is that the prover and the verifier both get to see the tag. The verifier additionally gets the secret key. And the prover gets the message, the encoded file. And they talk amongst themselves and at some point the verifier is either convinced or not convinced that the file is being stored.
And if we want public verifiability, if we want this is the Library of Congress case, where anybody -- any citizen can check that the file is being stored, then key gen will have to output a key pair of public key and secret key and the secret key will no longer be necessary for verification.
>> Question: (Inaudible) -- the secret key?
>> Hovav Shacham: It depends on the scheme. In some cases no, but in the schemes that we care about, the answer is going to be yes. Okay?
So that is just the protocols that are involved. So this right now, this is the honest prover.
This is just what the honest prover would behave like in this protocol. And the claim is that there is no such thing as a dishonest prover that is not storing the file that can pass the verification check. And the way that we prove that is by playing a game with the adversary and showing that no adversary could win that game. And in this game the challenger generates some secret key and then the adversary gets to interact with the challenge. It gets to pick files. It gets to have the files stored and then it gets to learn all the information about them. And then it gets to interact with the challenger as the verifier -- sorry, as the prover. So the verifier kind of talks to the challenger and then tells the -- sorry, talks to -- the challenger acting as verifier talks to the adversary and afterwards tells it, hey, you passed that time, or hey, you failed.
And this is kind of just a set-up phase. And in the end the adversary outputs what it claims is a cheating prover. So it outputs a touring machine description for a Chain
prover P prime for a file T that was one of the files that it stored. Files that it never stored, it doesn't make sense to talk about. So this is just one of the storage queries and it claims that this prover will convince the verifier that it's storing the file, but actually won't be storing the file at all. And the adversary claims that that's happening. We claim that's impossible because the scheme is secure and the security guarantee that we get is that a scheme is secure exactly if there exists an extractor algorithm that given -- that given the cheating prover can pull out the file whenever the cheating prover is actually a successful prover.
So in order, any time the cheating prover convinces the verifier that it's storing the challenge file with some non-negligible probability epsilon, we can pull out the file. Does that make sense?
>> Question: (Inaudible) --
>> Hovav Shacham: Let's see. So they -- for one thing they allow state because their solutions actually are state full, whereas ours are state less. So that's one definition that is just kind of a simplification and if I remember correctly they don't allow some of the
Oracle queries that we do. So it's a -- I believe I should check because I'm not totally
100% sure, but I think it was this protocol they didn't allow this interaction, but I'm not sure. If I remember correctly our definition is strictly stronger than theirs, other than this business with state. But you can prove all their schemes secure under definition and so it sort of just kind of definitional -- in the end it's just kind of definitional nonsense.
>> Question: Going back to that slide.
>> Hovav Shacham: Yes.
>> Question: Maybe the previous one, but the challenger generates SK, so is it obvious that -- well, I mean this is your security model (inaudible), is it obvious that to solve this problem in general that it has to be a public key, secret key setup? Like I mean, the first solution that you gave where you -- Challenger stores the list of challenges, which is cumbersome, you know, so that's not such a good solution, but is it possible that there's a solution that isn't based on a public key, secret key there?
>> Hovav Shacham: So I think that -- so I think the verifier needs some amount of information that is not known to the file prover. Because we want the verifier to keep less state than the file plus everything else.
>> Question: That information you could call that --
>> Hovav Shacham: If there's some small amount of information that suffices on the verifier side in order to interact with the prover and be convinced that its responses are correct, then the prover -- if that were public then the prover could store that instead of the file, I think. I'm not 100% sure.
>> Question: But a verifier whose public is by definition stateless, right?
>> Hovav Shacham: A verifier whose --
>> Question: If I go to the Library of Congress and the individual --
>> Hovav Shacham: Right.
>> Question: -- I don't have any states.
>> Hovav Shacham: You don't have, right, yeah. But -- yeah. So you can certainly run the verifier with just a public key. Okay. So what I'm saying is that somebody somewhere has to have some secret knowledge that's not available to the prover because if nobody anywhere in the system had secret knowledge not available to the prover then the prover could I think pas the verification of test knowing only what the verifier knows, which we want to be much smaller than the size of the font.
But you don't need the verifier to have a secret key. The secret key can be part of the setup phase and it can be maybe needed also in the extraction phase, but everybody else can just use a public key.
>> Question: And then also what is this protocol on TI in face?
>> Hovav Shacham: Yeah. So, so this -- okay. So this proof of storage protocol happens on some particular file. And the file's named by a tag because there might be many files being stored by single storer on a particular server and so we need to disambiguate between them. In particular the adversary gets to store arbitrarily many files. I guess he stores them himself, but he can say, here is a big file in my 23rd file storage query. Give me the tag that corresponds to this file and then the encoded version.
And then some point it can say, okay, so remember my 23rd file, that file had tag T23.
Now I want to see what it would look like if you challenged me on storing that file. And now it gets to interact with the challenger. It gets to play the role of prover for T23, the
challenger gets to play the role of verifier for T23 and in the end the challenger sends over, hey, you passed.
So I hope that makes sense. This is just there's kind of state that's kept by the challenger, that corresponds to each of these queries and then this names one of the queries that happened before.
Okay. And one thing that we insist on that's a little bit unusual is that this should work for any epsilon whatsoever. So long as epsilon is non-negligible. And the schemes we come up with give you a -- an extraction time that's polynomial in 1 over epsilon, essentially. And what that means is that if you have a server that passes the test one, one millionth of the time, meaning that we can send in a bunch of verification queries and once in a million it gets lucky and actually sends us something that convinces us, we still want to be able to pull out the file. Some of the other constructions really only give you security guarantees up to say epsilon's one half.
Now a lot of the time you can argue, well, in a systems construction if I'm testing my file verifier that my storage server that I gave my file to and it only kind of answers half the time or a quarter of the time and it's right and the other three quarters of the time it either doesn't answer or it gives me the wrong answer, then I'm going to know something's up.
And that's probably true in most system settings, but we think that since we can actually construct a system that is secure up to arbitrary epsilon, it's better to give the strongest crypto guarantees that you can and then depending on what sort of actual properties the -- kind of the system that you build around this needs, it can do whatever it wants. It doesn't have to worry about whether the security guarantee that it provides breaks downs as soon as it allows intermittent network connection where maybe only one-third of the messages go through.
On the other hand, in our actual construction, if you're willing to put a bound on epsilon, say you don't care about epsilon smaller than one in a thousand, then the parameters that we derive in our scheme actually get better. So you can decide how paranoid you are and build this scheme accordingly.
Okay. So all of this didn't actually propose a solution for how we were going to build the scheme that satisfied this definition. It just said, this is what we want to achieve. And getting to that solution is going to require kind of a fundamental insight. This is really the big insight in Nora Rothblum(phonetic) and their part of this paper. And the point here is that I said before that one of the important systems criteria was that if you read a whole bunch of blocks in the file, you're going to be slow. If you've got a terabyte file and you
have to do even very little processing on every single blocks as part of this protocol interaction, that's no good.
So we really want probabilistically only to check a few blocks. We want to say, name 80 blocks at random in the file and we want to read only those because reading 80 blocks from this is much faster than reading an entire gigabyte file or an entire terabyte file. So that's great. We want to do that. But the problem is that the moment we do that if the server cheats just a tiny bit, we probably won't catch it.
So if you think about a file with a million blocks where we query 80 blocks at random and suppose that the server just changed or deleted one of the blocks and we have a way if we asked about that block we can tell that the server is cheating because we can tell that the response is not what it should be. But if we don't ask that block, we're none the wiser. So the probability we'll actually be able to catch the server in this case is tiny. It's well under a percent. Well under a hundredth of a percent, I guess.
So that really is no good, the security guarantee that we get there is just really no good.
That --
>> Question: Like just the economics would mean there would be very little incentive for a storage to delete people such as small percentage, right?
>> Hovav Shacham: Well --
>> Question: Say -- I don't know if you can quantify this, but say they can get some date by deleting quarters, don't need to be one in a million.
>> Hovav Shacham: In some cases you might need your whole file when you go back to get it.
>> Question: (Inaudible) -- want your whole file back, but I'm just saying what incentive would a storer have to at least to achieve in that manner, to only delete one of the million blocks?
>> Hovav Shacham: So, I mean -- so in part it could delete substantially more than one in a million and we still won't -- so a quarter won't catch it, but --
>> Question: Yeah. Couldn't you just look at the economics of it in terms of the cost for them to store things and at what point they would actually want to have -- there would be any incentive for them to delete a small amount?
>> Hovav Shacham: So I think, the answer -- yeah, yeah, so -- yeah, so that's one of the answers. So if they don't know that they lost a bit of a file then the fact is your file is gone and they didn't actually mean to have it be gone.
>> Question: Expansion in the ratio code --
>> Hovav Shacham: Oh, oh, sorry, we'll get to that in a second. The other thing is that, yeah, it's true against an economically motivated adversary, kind of it probably won't bother erasing just a little bit of the file because it doesn't really matter to it. But you might imagine the scenario in which the adversary is behaving more maliciously. And the nice thing is that cryptographically we can actually provide security and it's even business team adversaries.
But, yeah, if all you're worried about is just erase your of kind of the server is going to forget a few of the blocks, then you can give a solution that doesn't go through this. This is kind of the main idea and the attendance at all paper, also at CCS last year, where they do proofs of data possession based on this, kind of this idea.
But if we imagine that we're worried about this kind of erasure, then we're not going to be able to catch when the server erases just a little bit of the file. However, if the server erases a lot of the file we absolutely will catch it because if it erases half the file, then the probability that a single block, random block that we pick is in the half that it knows is one half and if we pick 80 blocks independently random then the probability that all of them are in the half that it knows is one minus half to the AD, which is -- sorry, the probability that they're all in the half that it knows is one-half to the AD, and so the probability that at least one of the blocks is not, is one minus that, which is basically 100%.
And so if we could somehow move from the world where we need to worry about a tiny bit of erasure into the world where we need to worry only about a very large amount of erasure then we're better off. And the Nora Rothblum(phonetic) observation is that we can just take the file and encode it using an erasure code, which is this magical thing that we get out of coding theory where any half of the encoded file or some other parameter suffices for recovering a file, then we could test. And we could -- the moment the server passes the test we know that it must be storing substantially more than half the file because otherwise it would never have passed the test and that means that the amount that it's storing suffices for us to apply the erasure code.
Now to answer your question about rates. In the talk I'm simplifying by assuming that the rate is always two. So that imposes a -- an expansion factor of 2x. However, you can
play with the parameter -- first of all, that is total overkill because this is just a hugely small number here. And you can also play with the parameter where you test more blocks and then you make this big even though with the expansion break it becomes smaller.
>> Question: If you use (inaudible) coding in data, right?
>> Hovav Shacham: Uh-huh.
>> Question: And let's say sort of (inaudible), right, so less original simply require you to test for say about half of encoding (inaudible). Right? So what do you gain in terms of a server (inaudible) so as to work that's basically the same. Right?
>> Hovav Shacham: Um --
>> Question: (Inaudible) something like (inaudible), for example, like it will double your data and you need about half of the doubling, which is exactly what we started out with.
>> Hovav Shacham: So, but -- so, okay. The server clearly has to store a bunch more if we're going to use a rate of 2 and now the server has to store two terabytes where before it only stored one.
>> Question: So in order to prove probability ratio it had to work, it's basically equivalent to your original size.
>> Hovav Shacham: But you -- no, you can just test 80 blocks at random.
>> Question: Saying that the (inaudible) codes where the number of blocks you need in order to decode the data is less than the original number?
>> Hovav Shacham: In order to pull out the -- so I'm going to separate verification from extraction. So in order to pull out the pile I will in fact have to read at least a terabyte of the file because I mean even just information theoretically you couldn't hope to do better.
And then actually apply the erasure decoding procedure.
>> Question: (Inaudible) --
>> Hovav Shacham: The only thing that erasure codes buy me is that in verification time I can just check assuming and I haven't explained how, but assuming I had a way when I saw a block to tell that it was the right one, then I could just test 80 blocks, see
that the server responds with the right block for each of those 80 blocks and I would be convinced that it must be storing at least half the file.
And then later if I somehow trap today and pulled out its disk and took the file off, then I would have to sit there and do the tornado decoding routine. And that would still take a long time.
>> Question: Going to ask assistance question. So if you add redundancy there, can you place it in a way so that it sits on separate spindles in storage so that it also deems that the redundancy that I have to be with at the storage level --
>> Hovav Shacham: Well, so, okay, so...
>> Question: Like (inaudible) for instance. I'm keeping two copies anyway, so I don't want to keep four copies.
>> Hovav Shacham: Um, you may -- depending on the code that you use you may be able to integrate this with any kind of encoding you actually have in your system. I haven't thought about that, but certainly if what you are doing is compatible or can be made compatible with this then that would be a great one, All right, because you are sort of already paying the cost.
So the answer is I'm not sure, but that would be very nice. So putting everything together that we've kind of laboriously constructed all this, what we get is a trivial solution. But the idea is that we're going to apply our erasure code, whatever it is, to the file. That's going to give us M-star and then we're going to authenticate each block of the file. We're going to authenticate, just have a map key that is going to be secret and we're only going to know it. And along with every block of the file MI, we're going to store sigma I which will authenticate the pair I, comma, MI. And that effectively is a proof of the statement MI is the block in the file at Index I. And so now the prover has to store all these MI, sigma I pairs.
So on top of the 2X rate that I've just prescribed in the erasure code, now you also have to store all these authenticators and then the verifier, I mean you couldn't ask for more.
The verifier just stores the map key here. And it just says, okay, I want -- give me these
80 blocks within the sigma I and it gets those 80 blocks along with the sigma Is and it just checks that the max verify for each one.
And the claim is that this secure exactly because of the argument that we made before, which is that the server in order to answer for all of those correct, you only pass if every
single one of these verifies. So in order to correctly respond with all of these, it must be storing at least half the file, which means that if we got the file out of it, we could apply the decoding routine in the erasure code.
Okay. So this is implicit, this is implicit is Nora Rothblum(phonetic). And we're going to do better and the way we're going to do better is by reducing the communication complexity in the server's response. Because the server still had to send 80 blocks and
80 macs and that is kind of a bummer. So we think we can do better.
So the way that we're going to try to do better is instead of sending -- let's just kind of wish the problem away. Instead of having the server send all of the MI's and all of the sigma I's individually, let's assume that they come from some sort of group or field where addition is defined and then let's just send the sum of. Right? This would be instead of sending 80 blocks it now sends the equivalent of just one block. And that's great. We saved a lot of space. But there's a problem here, which is?
>> Question: (Inaudible) --
>> Hovav Shacham: Yep. Yeah. So Macs are these kind of funny crypto-graphic object. You can compute a Mac on an input and now not only do we not have the Macs themselves, we have the sum of them. We also don't even have the inputs. So this kind of improvement that I claimed that you could have isn't working. So it turns out that if we build the authenticators so they're not just general Macs, but they're specific authenticators with a specific structure, then we can actually do this. And in fact this is the, what we think of as the big contribution in the third paper that I mentioned that we follow-up on which is in Sapenics(phonetic) paper at CCS, you can do this nice
RSA-based homomorphic authenticator where not only can you authenticate the sum of
MI's using this case the product, not the sum of the sigma I's, you can even do more and you can authenticate a linear combination.
So you can have these weights new I and you can authenticate the sum of new I, MI , using the product of sigma I and new I. So that's really pretty cool and in fact you can plug thee authenticators into this protocol that we showed here and you can get rid of these question marks. You now have a verification procedure and you've saved communication and that's great. Although there is a problem with this particular solution, that we'll get to in a second.
So having said all this, with 15 minutes to go I'm now ready to tell you what we actually did in our paper, which is that we added two new -- described two new kinds of homomorphic authenticators to go with the RSA 1s that Ateni Sadal(phonetic) showed.
One of them is based on BLS signatures. And is sort of a nicer version or RSA 1s that is just more space efficient. The other one is based on PRFs and gives you an even more extremely efficient system in the case where you're only worried about private verification.
And the second contribution we make is we actually think is the more important one, is that for the first time we give a proof of security against a completely arbitrary adversary for the schemes that we described. Both of the papers at CCS made restrictions on what the adversary could do in order to get a proof. We actually don't make any restrictions whatsoever as a result of which our proof is actually kind of painful and complicated, but
I'll give some of the insights in the proof in a bit.
So let me show you a little bit about our authenticators. This is the one based on PRFs, so we're just assuming that we're working in some field K that can just be 80 bits long so it can be GF2 to the 80, it can be ZP for appropriate prime, whatever it convenient to work with. And we assume that we have a PRF that maps into that field. And now the key generation -- the key that you need to store to authenticate messages is going to be
K and then for PRF and then a group element, a field element alpha. And the authenticator for the authenticator for the statement that a message I exists at index I is just alpha times MI, plus the PRF applied to the index I.
That's it. And the aggregation routines for building the authenticator for a linear combination of messages just building the linear combination of authenticators and it so happens if you just work out the arithmetic that this all verifies.
Secret.
>> Question: Alpha and K consistent?
>> Hovav Shacham: Yes. Yeah. Sorry. This is a problem with -- this font came from
Tech and this font came from my presentation software. So -- but that's the same symbol and you can just verify and what you check is you check -- you are giving Mu, which is the linear combination of the MI's and you check that Mu times alpha equals the sum of new I times the appropriate PRF entries.
And clearly you can only do this if you know alpha and K, which this only allows for secret verification. The BLS authenticator looks actually quite a bit like this. Your secret is going to be X and the authenticator for block I is going to be the hash of I, the BLS hash times U to the MI all raised to the X. But the nice thing here is that the verification doesn't require knowing X, it just requires knowing G to the X, which is what we call V down here.
And I apologize that the V's look a little bit like the Mu's and this requires 160-bit elements for all of the authenticators so it is nicer than the RSA 1, but not quite as nice as the PRF 1, but again allows public verification.
Sorry?
>> Question: You just run (inaudible) --
>> Hovav Shacham: Yeah, yeah. Sorry. This is -- this is basically using the proof of the BLS signature so yeah this only works on random orders. We have some ways, where in kind of limited ways you can remove random or closed, but then you need very large kind of public parameters so it's not very appealing.
>> Question: This version is a lot less efficient than the PRF version?
>> Hovav Shacham: I suspect, yeah, I suspect that it's going to be. For authentication it's not that bad, right? Because you're only computing exponentiations to known group elements and so you can build these aggressive look-up tables. So that's probably going to be fast and verification is only going to require one pairing for the whole operation, so it's going to be slower, but I suppose it's not too bad.
So, okay. So I said that you could take the attending set authenticator based on RSA and you could plug it into that improved solution tri-1 protocol that we talked about before that had the sums and you would get a scheme that would work, not necessarily be secure, we'll talk about that in a second, but would work. You can similarly plug in either the PRF authenticator that I showed on the previous slide or this one. They would also go into that one, they would work.
But I'm going to do something different. Instead of just computing the sums of the messages and the sums of the authenticators I'm going to let the power that I just derived with these authenticators go to my head and I'm going to compute these linear combinations. So I'm going to add the new Is for no reason whatsoever, except that I can, because I don't know, it's just cool and I like to draw news or something. And now the new I's have to come from somewhere so unfortunately we have to increase the size of the query by including new I's along with the indexes that we're querying about.
And again, with this you could plug in any of the three authenticators that we have and the scheme would still work. But it would more than work -- it is actually secure and this had is the scheme that we proved is secure.
And the prove of security proceeds in three phases and it's -- we think it's nicely modularized. So we kind of like this proof. I don't know if anybody else does, but the first part is the only part that uses crypto-graphic techniques and in fact for the three different constructions that we have based on the three different authenticators only the first part changes. The other parts are shared amongst all the constructions.
So the first part, I'm glad that Brent isn't here because he hates this word, I call it the straightening, in the sense of straight jacket. And the idea is that the crypto, the authenticator means that the adversary simply captain cheat by sending us -- remember it doesn't actually send us the blocks, it sends us this thing, this Mu sigma pair that all we know about it is that it passes the verification check and what we hope, but don't know is that in fact is computed like this. So the straightening step says if the authenticators are secure then the only way that the adversary can convince us is in fact by sending news over that are completed properly.
So once we've demonstrated that we no longer need to worry about any of the authentication business and we can just talk about an adversary that we query on different linear combinations and it either responds with giving us a the sum of new I, MI, or its, I don't know this. Because the case where it tries to cheat us is ruled out. And this is the part that uses the crypto.
And once we've done that, we're done with the crypto. And we can talk about commonatorics(phonetic) and this is the extraction phase where we argue that against an adversary that behaves like that and responds properly, at least in epsilon a fraction of the time, we can always extract half the box with all but -- this is negligible probability.
This is the hardest part of the proof.
And then the third part of the proof is trivial. Once we've extracted half the blocks we can ply the ratio decoding routine and outcome is the real file. So I won't talk about the proofs for part one or part three, I'm just going to focus on this part two first.
And as a warm-up, let's give the part two proof for the simple solution. Remember the simple solution was the one that didn't use any homomorphic anything, we just asked for the blocks back and we got those blocks. So the way this is going to work is that the extractor is going to keep track of -- excuse me, the extractor is going to keep track of the knowledge that it has. As soon as it asks about block 70 and it gets block 70 back it knows that that's the right block and that means that it now knows block 70. And that's one of the blocks that it knows that is part of its knowledge that it is building up towards knowing half the blocks.
So this is just what it's going to keep doing, it's going to keep coming up with a random set of indexes I, asking to verify -- sorry, the cheating prover to operate on I. The cheating prover with probability epsilon gives a response that actually here are the blocks that correspond to this index set and in that case we know that the response is okay and we can add it to the set of indexes that we know about.
Sorry. You had a question?
>> Question: No.
>> Hovav Shacham: And we can just keep doing this until the number of indexes that we know about is more than the half the number of indexes at which point we have enough information to apply the decoding routine.
So I claim that this extractor works. And the way it works is again the response is not some sort of special symbol. The response is helpful to us with probability epsilon because that's the definition of the cheating number. And whenever the response is helpful to us, then we can start talking about these indexes, these 80 indexes that we got.
Are they indexes that we already know about?
Now the indexes that we know about are in the set S. So whenever the indexes that we queried about are not all in the set S, S increases. So every one of these steps increases our knowledge by 1 and we just need to get N over 2 and we're done. But the claim is that if S is small then the case where I is already an S so in order, these indexes are things that we already knew so they don't help us is vanishingly uncommon. And the reason is because there's fewer than half the indexes in S. I was chosen completely at random. And so the probability that this hold system just one half to the 80 or les and so this case doesn't happen.
Does that make sense? Everybody's with -- okay. So what we need to do now is we need to expand this analysis for these improved solutions that use homomorphism, where these -- the indexes that we get from the attacker are not -- from the cheating prover are not necessarily immediately helpful to us in this way. So, okay, I'm going to need a little tiny bit of notation. Notation's going to simplify a few things.
So the query Q was the set of indexes and for each index a corresponding weight new I and instead of that we're going to think of these as just a vector, so it's just new I times these vectors UI, which are just the usual basis vector for the space cam. And if we think about it this way, then the response is going to be simply QI.M -- sorry Q.M, where M is
the vector of block values. And then there's also the sigma business, but that's taken care of in the straightening step.
Okay. So all I've done is I've just recast what the prover -- what the cheating prover sends us in this kind of more convenient way to work with. Okay. So let's try to extract from that first triad of the improved solution, one that didn't go mad with power, that didn't have the new I's, that just only had the messages, the sum of the messages, sorry.
And so what we can do here is we can just basically recast the extractor from the previous phase into this kind of vector notation, what we'll learn is the subspace, so that is what I'm going to note by D, that's a subspace of all the space over kind of the messages over cam. And we'll just keep going until the size of the subspace, the number of dimensions and the subspace is more than N over 2. Then we will have learned more than N over 2 of the dimensions.
So we'll choose a random query Q of the appropriate shape. We'll run the cheating prover on that query. If the response is not bought then we will increase the subspace by also having learned this factor. So this factor is now in the subspace that is in our knowledge. And we can give the same analysis to prove kind of a variant of the same analysis to prove that we can grow our knowledge so that we learn more than N over 2 dimensions in the original message space.
And that's all fine and good, except that learning more than N over 2 subspaces in our message space actually doesn't help us one bit to extract. And not only is the proof not going to go through, the scheme is not secure and no proof can go through. And this is the attack that you can come up with, the idea is that the attacker is going to forget about one index out of all the indexes. So it's going to forget index I star and it simply will not store M sub I star. Just simply will forget it and not have it in its memory. And instead of storing MI for I, not equal to I star, it will store at each index either MI plus MI star or MI minus MI star. Okay.
So none of the blocks that it will actually store will be the correct block value. They'll all be perturbed either plus or minus this thing that it forgot. Okay.
So now why did it do this? My claim is for a non-negligible fraction of the time, it will actually still be able to answer queries correctly. And one of these queries in this improved solution try number one case, they're just a set of indexes for which it needs to compute the sum of the MI's over those indexes. Now it doesn't know the MI's, but it knows these kind of perturbed MI's, and so it can just compute that sum. And that sum is going to be MI, plus -- the sum of MI plus AI, MI star and so that is the correct solution
plus MI star times the sum of the AI's. But the AI's are plus or minus one and each of them is independently random so the -- so it will be kind of it will compute the right answer, provided that the sum equals 0. But that's kind of the central binomial coefficient and so that actually is pretty common and it will happen for the parameters that we care about, this 80-block queries almost 9% of the time.
So in other words, this is a case where for epsilon equals say one-tenth the adversary can actually convince us that it knows the file. And yet the file does not -- the adversary simply does not know the file, because if we can express its knowledge as a dimension N minus 1 subspace this is why I said the dimension N over two subspace is useless so this is a dimension N minus 1 subspace where for each of the blocks that's not MI -- sorry MI star, I just put MI star at the end. It actually knows that block, plus or minus this thing.
And the fact is that you simply can't extract the basis vector, any of the standard basis factors for the full space out of this knowledge. It's not there.
>> Question: (Inaudible) -- took it each place?
>> Hovav Shacham: Um, it sort of doesn't matter because it will also store the authenticator for those blocks and so we'll be able to reject when it gives us the wrong answer, so it just depends on whether it's worried about being detected -- whether it's more worried -- it has two choices. One is it can store the signs and then it can know when the answer it gives is going to be wrong and just not bother responding those times. The other times it cannot store it and then it can respond and we'll throw away that answer anyway because we will know it's wrong. So it sort of depends on kind of what the usage model is.
But the point is that against this adversary there is no extractor possible. There is no way
I can learn the value of even one block against this adversary and yet it can actually answer -- it can actually succeed in this verification protocol almost a tenth of the time. In fact if you go to some of the previous papers their performance measurements are based on exactly this protocol that we just showed as insecure.
And this explains why, I mean besides the fact that I got mad with linear combination power, this explains why we went to this case of this improved solution try two where we had the linear coefficients, these weights new I because this improved -- this first try was hopeless. We really needed them to make the security go through and we can make the security go through.
So now that we kind of argued that knowing the kind of having subspace of knowledge where we know the dimension of that space and we make that dimension large is
hopeless because we can get up to dimension N minus 1 and it doesn't buy us anything, we need another measure, which is how many block consist we actually pull out? So in other words, in a subspace D how many of the usual basis vectors can we learn in that subspace? And we'll call that the number of free dimensions. And once the free dimension system at N over 2, we can clearly pull out half the file because we can just take our knowledge and kind of do the linear algebra and pull out these blocks. And now we have N over 2 of the blocks.
So and again -- sorry. This simply will not -- this analysis won't work on the first try because we can -- you can have this adversary and the free of the subspace that you learn is never more than 0. Okay. So the revised analysis needs to handle a special case. There is now not two cases, which is what we kind of had before, but three.
So the first case is Q is not in our knowledge. And in this case our knowledge increases and we're happy. The third is that the index I, the set of indexes I over which -- forget the linear coefficients. The set of indexes that we query about, we already know all those blocks and that doesn't help us, but this again, we can argue, happens only with negligible probability when the number of free dimensions is small -- is less than half.
And so the third case that we need to carry about, sorry, the third case that we need to care about and this is exactly the case that trips up the analysis for the other improved solutions, is when the query is in the subspace that we know, but the indexes are not indexes that we actually know the corresponding file blocks for.
And here this is where we need a little bit more analysis and the claim is effectively that if the space span by the basis vector in the index I intersects the D in fewer than 80 blocks then effectively you're kind of drawing out of it move 79 dimensions in an 80 dimensional space and the probability that you get Q is Q in D intersection I is most 1 over the size -- this K is the size of the space that the new I parameters are drawn from.
And so long as the new I's come from exponentially large space, this case doesn't happen and this is exactly what we need to make the analysis go through. And you can view the first try at the improved solution, where the coefficients were always one as drawing from a much smaller space here which is why the analysis didn't go through.
Okay. Let me see, I guess I have a bit of time so I can kind of explain here more. So in other words, we have 80 indexes that in this particular query that we're kind of thinking about we've got 80 indexes that are named and each of those has a weight. So we can take those 80 indexes and use the weights and build a vector that's this query Q. And
this query Q is clearly random vector in this particular space span by the basis vectors of
I.
And what we're worried about is that Q is in D. In other words, we know say N1 plus 3M3 even though we don't know M1 and M3. That's the case that we care about. And we're going to rule that out.
And the way we're going to rule it out is that the dimension of -- if we don't know the individual blocks in the set I, then the dimension of our knowledge intersected with that kind of restricted to those indexes had better be less than 80 because if were -- actually were 80 then we would know all the blocks, which contradicted the assumption that we had before. And that means the problem we can think of here is what's the probability that if we pick a random vector from an 80-dimensional space it will fall in a subspace of dimension at most 79. And so longs as kind of the -- each entry in the vector comes from a sufficiently large collection, which is the size of this set K then that probability is small.
So wrapping up what we gave in the paper and what I tried to talk about in this talk is that we have improved homomorphic authenticators from pure F's and from BLS and also through attaining set from RSA. If we take those authenticators and we plug them into this improved solution try number two protocol that had the weights then we get a proof of retrievability scheme that actually comes with a complete proof and security -- actually security proofs are important because that try number one scheme is insecure.
And the nice thing about this scheme is that we get a compact response and a compact query in the random Oracle model, as well, and really kind of the only down side is that the query is kind of large because we had to send all the indexes and all the weights that correspond to all those indexes and that's large, but there's actually a paper that will appear at TCC this year that solves that by (inaudible) and Danny Wicks.
So with that, I guess I'm over time, so I'll stop and take any questions you have.
(applause)
>> Question: So not to play devil's advocate here, but so you gave an example of how you could fail if you tried to use your try number one --
>> Hovav Shacham: Uh-huh.
>> Question: -- scheme. Let's say the BLS authenticators with the common response just the sum. But that was again, you could argue, just like my first objection in the
beginning, that there would be very little incentive for a storer to just forget one block and in this case it would actually have to do it intentionally in order to still succeed with some probability, right? It would have to add it -- add it or subtract it to all the other blocks.
>> Hovav Shacham: Right.
>> Question: So just purely from a practical point of view, I'm just wondering how serious the failure of the proof is, you know, using this counter example, whether in practice you might not want to use that scheme any way given that like for example you don't have to send those EI's for the query and just a little bit less communication and things like that. How serious does that fly in the proof?
>> Hovav Shacham: So -- so for the BLS case it's a little bit nicer because if you're already in the random Oracle model, which you are because you need to model the hash that way, then the query can actually be just an 80-bit string that you send in to the random Oracle, and outcomes what would have been sent by the verifier, so the queries actually just 80 bits in this case.
I agree that the attack that I just described is not an attack that would be carried out by an adversary that was concern body saving storage rather than just being mean for meanness sake. And I think if that's what you care about then that scheme might be a good choice provided that it came attached with a proof of security in some model that captured that intuition. What I'm more worried about, we had in some of the literature that we were kind of following up on, their were schemes claiming to be secure in -- kind of in models that were kind of as general as ours, there were in fact not secure. I mean, like this one.
All right. So this scheme simply can't be proved secure in the most general mod they'll we described. You might be able to prove it secure in a more restricted adversarial setting and that might be all you care about and that might be fine. But we think that if you're going to make the claim that you have provided a scheme that's a secure proof-and-retrievability setting scheme in the Jules Kaliski(phonetic) definition or equivalently in our definition, then that scheme had better really be secure against adverse -- arbitrary adversarial behavior.
>> Question: You think you might be able to concoct a worse, you know, behavior of an adversary, but they could still succeed so say they want to, you know, forget a tenth of the blocks or something like that, something trivial, and then your algebra --
>> Hovav Shacham: It's no question. We kind of tried to think about this and we don't have a better attack, but we also don't have kind of a proof that there isn't one. I mean, having -- kind of having this like -- the parameter that we derived, it is like 8.9%, not really clear why that should be the optimal kind of cheating probability. So that might be an interesting question to look at kind of.
>> Question: I also wondered whether the introduction of your VI's, you know, your multipliers, whether that could be motivated by some kind of network coding applications? I don't see the advantage immediately of trying to use network coding here because just like with one store and one retriever, you know, you're going to need a certain minimum amount of information to reconstruct the file anyway. But I wonder if you stick it in a system with maybe multiple storers like even from a systems angle like if you had -- were retrieving some blocks for some servers and other blocks for other servers.
>> Hovav Shacham: It is possible.
>> Question: Disbursal, yeah, but that network coding might give you an advantage of not having to specifically retrieve certain blocks from certain places.
>> Hovav Shacham: I think it's a good question. We kind of tried to look at this, we didn't see any way that kind of the same conclusion you just had. We didn't see any natural way to tie this in. But there is a paper by Brent together with I believe John Katz, and Dan and Mike Hamburg, where they use very, very similar authenticators to do network coding.
So it seems likely that if there were some sort of reason to combine this that these kind of two things would dovetail quite naturally.
>> Question: (Inaudible) -- your data and so (inaudible) change it, so sort of a way that you know of to sort of (inaudible) --
>> Hovav Shacham: Yeah. So --
>> Question: (Inaudible) --
>> Hovav Shacham: So it is sort of an inherent problem, I think, because the whole erasure code promise is that the adversary would have to change a lot of the file in order to cause your decoding to come up with even -- I mean, once you Mac things in order to cause your decoding to come up with even a small change in its value. So they're in a
sense, in a scheme that's had erasure coding it seems to me and I might be missing something because erasure codes are not really my specialty, but it seems to me that you can't possibly have extremely efficient updates, that in a sense you kind of have to touch a lot of the file and you can't help it.
Now, there might be a way where for specific kinds of updates that you wanted to do you might be able to have kind of have that interact with a particular erasure code to limit the amount of work that you have to do. It might be a lot still but not quite as much as kind of generically. The worry is that if you only affect a small amount of the file then the server might be able to roll back to the previous version without being detected.
The -- I don't know that anybody has really made any progress on taking the definition that looks like the full definition that we gave and integrating dynamic updates into it.
People have been mostly working on kind of the weaker definition for provable data possession which is kind of the (inaudible) line of work and there you don't require any erasure codes. And there are people that actually have some very nice constructions that do dynamic updates, kind of tracking kind of pretty sophisticated data structures on the server.
>> Question: Okay. So security definition is the same. The difference is that in your case (inaudible) case the extractor is required to sort of -- with the extractor algorithm it's using your (inaudible) in order to extract, but the security definition is --
>> Hovav Shacham: Are you saying to the (inaudible)
>> Question: So the (inaudible) wasn't formal, but since you formalized the solution
(inaudible) --
>> Hovav Shacham: I think, I'm not sure that I agree with that. I think the attending set,
I mean those guys seem to talk a lot about only wanting to prove that at least 70% of the file is still being there. So they don't -- they never will have an extractor for 100% of the file because they don't actually provide that guarantee because you would kind of need erasure codes to provide it.
>> Question: Dealing with problems (inaudible), then (inaudible) --
>> Hovav Shacham: Yes. Okay. Yes. Yeah, okay, so I was kind of implicitly already in the mindset of probabilistic sampling.
>> Question: (Inaudible) definition only (inaudible) same definition in the case if
(inaudible) talking about the example (inaudible) --
>> Hovav Shacham: I think that makes sense, yeah. But so somehow kind of the updates works has seemed to sort of focus on doing that.
I mean, I think if it were somehow possible to integrate erasure codes with some kind of specific semi-efficient updating that would be really cool. But we try to think about that and we don't really have any good solutions.
All right. Thanks again.
(applause)