>> Seny Kamara: All right. So it's a pleasure to have Marina Blanton with us today. Marina's from University of Notre Dame and she's done a lot of work in security and cryptography on key management, biometrics, and cloud storage. And I guess today she'll be speaking about biometrics. >> Marina Blanton: All right. Good afternoon, everyone. I appreciate the opportunity to visit Microsoft Research and be speaking here today. So as Seny said, this work -- this talk that I'll be presenting today deals with secure computational biometric operations and secure outsourcing of the same computation as well. And it combines a couple of results that I have, and perhaps it might be ambitious, but this is a good audience to be talking to. So I'd like to cover a couple of, you know, different topics that I have. So, as I said, I'll be talking about secure computation of biometric operations, biometric comparisons that actually involves a two-party computation. Then I will proceed with secure outsourcing on the same type of computation to one server and to multiple servers. And the last part that I'd like to mention is that verification of the result that comes back from outsource computation. And so most of it deals -- that I'm going to be talking about -- iris codes, comparing two iris codes, but the last part that deals with the verification is applicable to different types of biometrics. Okay. But most of it is for iris codes. Okay. So here. Let's start. So in general when we talk about biometric data, we understand it is sensitive. Right. And unlike a lot of different types of information that can be used, this is something that we can just invalidate and get new biometric. Right? There is -- this is incentive and also it's worse because it can be changed to something. If it's compromised, we have limited options in this case. So for that reason I want to protect this information when it's used around the computation, and there's a lot of different scenarios when protection might be required. And in particular I'd like to -- see, we have, for instance, to databases, right, and by law or rather regulations actually not allowed to see that database but it's actually for us advantageous to compute whether there's some biometric that appears on both of them. It also could be that my computation is so large in scope that I can't do it in one machine or the computers that I have available and I have to, you know, outsource it, and the same thing happens here, that the data is sensitive enough that I would like to be able to protect it. And even if it's not for outsourcing reasons, if something happens on machines that I own but are not trustworthy and suddenly -- and we would like to be able to protect that information from the break-ins or malware or some other, you know, malicious behavior, you can apply these techniques as well. Okay. And the main problem with biometric data is it's noisy. Right? When you take a sample, you take another one, two biometrics are going to be different, two samples that correspond to the same person or related biometrics. And often in cryptography when you apply techniques such as hash functions, for example, right, by design they'll defuse the differences so much so that while looking at two outputs of very closely related inputs you can't tell if they're related or not. But in this case we actually want the opposite, right, the more related biometrics to remain similar, and so we have to use some other techniques. Okay. All right. Let's see. So there are a couple of different scenarios that we're referring to throughout this talk, and the first deals with sort of secure multiparty computation. And I'll be talking about two-party computation, and it could be that, say, one person has the database, maybe the government, right, and there's private agency that's processing and investigating some of the people that have their samples, but they're actually not allowed to get just access to the federal database. And so in this case they would like to see if the biometric appears on a sees product list somewhere, but they can't accept the database, they're also not necessarily willing to give all of the data that they have available. So in this case we would like to see -- we would like to answer the question if my biometric appears on database of -- that a different party has. Right? So we basically would like to compare our biometric with each biometric that's stored in the database and say, yes, there's a match or there's no match, right, there's no relationship between the two. And the second scenario that actually motivates this work is outsourcing, but this is something that happens close to us, because in our department we have a large strong biometric group and basically what they do, they collect a lot of biometrics from students and other people actually willing to help them with this collection, and the databases is actually very large. Right now it's about 100,000 different biometrics that they store. And when they test a new recognition algorithm, what would happen is actually take each biometric in their database and compare it to every other. Okay? So the computation is very large in scope, they can't use it through one machine, they run out of memory, they don't have enough computing part, so they send it to a grid. And right now it happens all in house, but if we could have techniques where you protect the data as it's used throughout the computation that you can use external resources basically anywhere when it's available. So this is two different scenarios that's sort of relevant to secure computation over biometric data. So let's look at the specific scenario. So what I have here is from now on I'll be, you know, looking more at iris codes. So basically this is the pattern that they construct, take picture of an eye, and extract pattern that's unique to a lot of people. And actually it has a lot of information, much more information that you can get from fingerprints or, you know, some other types of biometrics like faces or palm prints. Um-hmm? >>: By information, you mean it's more -- >> Marina Blanton: There's more entropy, right. So they say that there is 250 degrees of freedom in the basically data that they extract, so basically 250 bits that you can get. And normally it's, you know, you can get as much from fingerprints. Okay. So we have two parties, Alice and Bob. So Alice will have a database of images, so it can be large, and we'll say this is D, and Bob has a biometric image, and Bob would like to know whether this image appears on the database or not, if there was another one that's related. Okay. So, you know, after taking a picture, so all of them just take a picture and extract features, and in this case you can actually represent each iris of code as a binary string [inaudible] and this is independent of the databases for anything. You can do this for any biometric. And so in this case when they would engage in the computation, right, so for each biometric in the database Bob will engage in a computation with Alice, and then they compare Bob's biometric with the first biometric in the database, second, et cetera. Okay. And they actually learn -- for each computation they learn yes, no. Basically yes, they're related; no, they're far enough that they're not related. And so you can specify who learns the bit. In our case, we say, for instance, Bob learns the bit for each computation. And then normally there will be one match or there will be no matches. Okay? So they will be able to tell if the person appeared in the database or not. Okay. So let's take a closer look at how to two iris codes are compared. As I said, each iris code is represented as a binary string in general to compare -- to compute a distance between two binary codes you will take you will compute a Hamming distance. Right? You have one code, another one, and you count how many bits are different. Okay? And this will be the difference. So the more bits are different, the larger the distance between them is, and it would basically -- when I see if the distance between two iris codes is within a certain threshold, and then are considered to be related. Okay. But it gets a little bit more complicated because with these codes what happens is when you extract features, you have this binary string, but there's quantization involved in producing those bits. And some of them are unreliable, meaning that we had a hard time deciding whether we should set it to zero or one. And for that reason there is a second binary string of exactly the same length that says -- that's called the mask, right, and it says that certain bits should be marked as unreliable and we don't want to use them in the computation. Okay. So one string that's the code itself and another one that says if the mask bit is set to one, say we're going to use this bit of the iris code in the computation. If it's is set to 0, we'll say ignore it, this is not reliable, we can't make good decision based on those bits. Okay. So for that reason, the distance between two iris codes, in this case I'll say X and Y, will be a little bit more complicated. Okay. Normally this is actually limitation that the people who do biometric recognition use, right, in -- but this is what is more intuitive to us, this is what I would do. So here this says the Hamming distance between two basically strengths and we also from this strength where you do XOR, so some bits will be different, some bits -- where the bits are different, you're going to have 1, where the bits are the same, you're going to have 0, and basically when I count how many 1s you got. Okay. But this means that in addition to computing this you also want to make sure that only the bits that are set to 1 in both masks will be used. Okay. So this basically says for each bit of the iris code at first you XOR, but I'll add it to the distance only if the same bit is set to 1 in both of the masks. Okay. So this is actually the distance itself, but then you also want to scale it to be -- to now say now we're not using all of the bits of the iris codes. Right? You want to make sure that this value is comparable but actually depends on how many bits were used. Right? So we also divide it with -- in some way normalize it by dividing with the number of bits that are reliable in the in both of the iris codes. So we compute this Hamming distance only using the bits that are reliable and divided by the number of the reliable bits in both. Okay. So this distance will be between 0 and 1. >>: Equivalent to a turn or a system, where a bit is -- plus one or minus 1 or God only knows it is 0. >> Marina Blanton: So for -- you mean for the iris code itself? >>: Yeah. You have plus one bits, minus 1 bits, which are hard, and 0 basically means you have no idea. And then it just becomes [inaudible]. >> Marina Blanton: Yeah, so in this case they use only -- you know, there's no minus 1. Right? Okay. And so after you've competed the distance in the way that I've specified, we want to compare it to a certain threshold. This T is fixed. It could be, you know, 25 percent. Or, you know, 30 percent, and then we considered it's related. And so what we do in this case, we actually compare this to the distance and there's also the yes, no; they're related or they're not related. But in practice when this computation is used, there's an additional complication, and basically what it is is that when they take a picture of the iris, right, I can tilt my head a little bit, and so it means that when I get one iris in another, I can't necessarily compare them directly, and so for that reason they apply a rotation a couple of times to one of the biometrics. Okay. And so in this case you want to compensate for small basically differences in the hat, in the head tilt. And so for that reason, you know, you choose a particular small constant which say we're going to tilt it to the left five times using some unit and then to the right, and so the computation becomes [inaudible]. So in this case I'll use left shift and right shift. This is circle of shift. And you use the eye times. So we take the one biometric is modified, the second one is shifted to the left a few times, right, starting from one 2C, and then it also gets shifted to the right, and we want to make sure that we're finding the best match, okay, the best alignment, and, based on that, we'll compare to a threshold. Okay? >>: [Inaudible] >> Marina Blanton: Yes. They're shifted together. Yeah. Consistently. >>: Does the iris not rotate without the head rotating [inaudible]? >> Marina Blanton: It -- so, no. No, no. This is how it's extracted. Basically they take a picture of your eye, then cut the part that corresponds to the iris and unwind it like this. So it becomes a rectangle that represents your iris basically. Okay. All right. So let's look at secure execution of the separation. I want to compare it to iris strings. And so what I'll be talking on right now is, you know, two-party computation. And in general this [inaudible] is going to proceed on encrypted data and we're going to use homomorphic encryption. And intuitively the distance itself will be computed using homomorphic encryption where encrypted data is submitted to one party and the second party will compute the distance and then they will engage in the computation to compare it to the threshold. Okay. Together. So we'll say if we want to use encryption, we know that the value that they can encrypt have to be integers. But we said that basically the distance that we obtain is valid between 0 and 1, you know, how we should be able -- how can we handle this. And so the answer to this particular question is pretty simple. So we will quantize the possibilities we know that there are certain values for the distances that we compute, we divide for the other in order to scale everything up. Say we will allocate a certain number of bits to represent -- you know, to mean basically some granularity to go from 0 to 1 in order to represent our -- basically all of the computation will be lifted up by sort of number of bits, and a threshold is going to an integer value in this case. Notice that there is a division. In general division is very difficult. Well, not necessarily difficult, but it's expensive if we want to implement it directly. But fortunately in this case we can get rid of the division and not use it at all. Okay. So we restructure the computation in such a way that we don't have to do division at all. And so the way it happens, actually, what we're going to do, if you notice, this is the distance, how it's computed, and then this value gets compared to a threshold. Okay? So what we're going to do instead, where we'll take the dividend here and compare it to the threshold multiplied to the divisor. Okay. We'll just move it up right there, and we can compare. But this is actually not everything. If you look at this computation, we have -- this distance is computed multiple times over different versions of the same biometric and then compared to the threshold and you actually take the minimum, you want to know the minimum is below the threshold. But now the divisor here is different from divisor here and basically all of them are different. You can't necessarily just take it one and take it out and multiply this threshold so that all of them are going to be correct. Okay? But fortunately we still cannot use the division, we can use this restructuring, but we are not going to do the minimum computation anymore. What we do instead, we replace this minimum with a Boolean or. Okay. So the idea in this case is that we take the dividend of this distance and compare it to the threshold times the divisor, then we do or of this, the same thing, or of this comparison, this, and this. Okay. So the idea here will be if it lists one of them is one, there is always one. This means that if it lists one distance with below the threshold, this output would be yes. And actually would correspond to exactly the same computation, become minimal here in this case means that we'll find the smallest and we'll know that at least one will match the basically computation that we're looking at, right, below the threshold. So basically what we get is after this transformation what we're going to get is this formula. Okay. Let me -- I would like to rewrite the computation we had so this D of X and I is going to be the dividend of the distance, and M of X and Y is the divisor of the distance, so exactly the same computation that I just presented. So now we'll compare this dividend to the threshold times the divisor or do the same thing over multiple shifts of the same biometric. Okay. So the computation doesn't change. The result is going to be exactly the same, but now we're going to have to do division. Okay. And so we'll also implement XOR, which is used throughout the entire computation. We're going to use arithmetic operations. So if you have bit XI and YI, there are two different ways to rewrite it using arithmetic operations, and we're actually going to use both of them. For different computation, one would be better or the other. Okay. Okay. So next what I would like to look at is actually the protocol itself, the interaction between Alice and Bob when they engage in a two-party comparison. Okay. So if we look here, first, you know, Alice has a database, Bob has a biometric, and they'd like to compare the biometric to every entry in the database. So Bob will create a public key pair -- you know, a public and private key pair for a homomorphic encryption scheme, and he'll give the public key to Alice. Okay. So he knows the secret key, and then Bob has this biometric X and the mask that corresponds to it. And so he is going to encode it, he's going to encrypt it and send it to Alice, but it's important to make sure that the encryption allows Alice to compute the distance. You can't just encrypt one bit at a time and -- because otherwise it's not going to go through because the computation is actually not as simple as just, you know, the inner product or Hamming distance. So what Bob sends instead, he encrypts each bit of the biometric multiplied with the corresponding math bit. And you do the same thing only we have a complement of the biometric bit multiplied by the math bit. So once Alice gets this information, she actually will be able to compute both the dividend and divisor of each distance and -- but what she'll need to first compute to be able to do so, she'll take these two, these two encryptions, multiply them together, and if you'll look, this valley XI and 1 minus XI will cancel out, and what she's going to obtain is just an encryption of the math bit of a single ->>: [inaudible] >> Marina Blanton: No, this is one encryption. This is another. >>: [inaudible] >> Marina Blanton: No. Right now, you know, this is a bit and this is the bit. >>: Oh, bit [inaudible]. >> Marina Blanton: Right. So when I say XI, it means a particular bit of the string. Okay. So basically there is a binary string and it's then encryption of one bit at a time. Okay. So this is a bit and this is a bit, and they compute a bit here as well. Okay. So basically now Alice -- for each bit of the original biometric, Alice has three ciphertext. This is the first, this is second, and this is the third. And she'll want to compute basically the encrypted dividend that we have here, and this is the divisor. So let me just expand this and see how they would compute this. Okay. So for each biometric in the database Alice will take it and produce multiple versions. She'll rotate it multiple times, okay, according to this constant that they have. So this is 2C plus 1 rotations of this biometric. So to compute this basically the first part of the distance basically what they need to do is this is XOR of XI and YI and this J means that it's a shifted version [inaudible] shift. It can always produce it, right? And so this part is the XOR and then you multiply it by the first math bit from the first iris code, and this is the math bit from the second iris code. And so if you look at this, basically she'll take the first cipher text that she got from Bob and raise it into basically the values that she has. So this is encryption of the bit. This is also a bit. And it will multiply to the second one raised also in certain power that depends on the values that she knows. Okay. So if you look at this, this AI1 encodes basically the bit of the first biometric and the math bit together. Right? And this is what Alice adds to it by raising it to this power. And the second component here has the complement of the iris code bit and the corresponding math bit. And this is what Alice adds to it. So as a result, you'll see that if you take these mask bits, you separate them from the rest of the computation, put outside of the brackets, you'll get exactly the same computation. Okay. So this computes XOR of two bits multiplied by the mask bits. And what we need to do to create the Hamming distance, the distance across all of the bits, is to add them together. So this is for one bit, and you multiply the corresponding ciphertext N times and you get the overall basically encryption of the distance. Then the divisor is actually easier because what we have here, we have the first mask, the second one, and we need to compute the intersection or the number of bits that are set to one in both of them simultaneously and multiplied by the threshold. So in this case Alice will take the third ciphertext from Bob that actually she computed, just encodes the bit of the mask bit, raise it to this power, and what you're going to get is desired computation for a single bit. And, once again, you multiply them together to get their sum and raise it to the power of T, which is the threshold, to get the desired computation. So this is -- all can be done by Alice without any interaction, and as a result we have two parts; that now we need to compare this was the dividend of the distance and this is divisor multiplied by the threshold. And so the only operation that remains right now is the comparisons and the or of all different shifts. Okay. Now, one thing before I go to the next slide I would like to mention is that all of those separations earned bits. Okay. The ciphertext of the bit, this value that the power is either is 0 or 1, so they're not regular exponentiations, they're actually basically if it's zero, it means that you don't even need to use it, right, if it's one, you take it and modify it. Okay. So this is going to be later very fast. Okay. So to finish this, the last step, as I said, is to do secure comparisons. You have to see plus one of those and then you need to do the or of the resulting bits. And so the most efficient way to do comparison with two parties is to actually use, you know, garbled circuit of relation. >>: Isn't that [inaudible] because I know mask is just 01, [inaudible] encryption of mask so I could just check whether encryption of 0 and encryption of 1 [inaudible]. >> Marina Blanton: So are you talking about the result, this one? >>: Well, if you go one step before and then just the thing which has -- the next one [inaudible]. >>: The two information [inaudible] what is the guarantee? >> Marina Blanton: So AA -- this. This is the product of two ciphertexts. >>: So this one I'm saying because you [inaudible] is just 0 level, right? >> Marina Blanton: Right. >>: [inaudible] >> Marina Blanton: Right. >>: So could we just check with encryption of 0, encryption of 1 and figure out what it is? Because you didn't ->> Marina Blanton: Those are randomized encryptions, right? The encryption itself is secure. >>: [inaudible] >> Marina Blanton: Right. The encryption -- just by looking at a ciphertext, you can't try all the possibilities. >>: But to do the operation there, do I need the ->> Marina Blanton: So later on if you'll look at when -- once we decrypt, we're actually going to randomize it, okay, before we proceed to the decryption. So at this point the last point -- because now they are going to decrypt and use the garbled circuit of relation for the comparison. So to be able to actually decrypt without either party learning what they have computed, they are going to split it. Okay. And so in this case Alice will just add a random number to the computation, then she'll keep this random number as her share. And Bob will obtain basically is an encryption of the actual distance that is blinded by that is hidden by the random number, and when she gets this value she doesn't know what the distance was. Okay. And later on I will actually -- this will actually, you know, how it's done depends on the encryption also. I will mention this later on. Okay. So this was the basic idea behind the protocol. But once we implemented or wanted to use it in practice, there's a lot of different optimizations that you can do. And the first one that actually made a huge difference for us was the choice of the encryption scheme. Okay. When I say [inaudible] homomorphic encryption scheme [inaudible] the people actually often use parallel encryption which is very popular, right, but it turns out that there is a newer encryption scheme due Damgard, Geisler, and Kroigard, which was designed for comparisons. So basically the only way how it was used to encrypt one bit and the domain the plaintext space is very small and actually variable. You can set it to anything you want. And it was very helpful for us in this case. It actually sped up all of the public operations by about a factor of 10. Okay. So you get an order of magnitude improvement compared to what you had previously. And then once -- when you're computing those -- when you're doing those separations, there's a lot that can be precomputed -- um-hmm. >> Is the reason you get the [inaudible] specialized for small [inaudible]? >> Marina Blanton: Yes. Small plaintext. They used it for one bit, but you can say I need 20 bits. So it's worth -- the plaintext are shorter and the operations are faster. And so basically in this case when we say there's a lot of computation that can be done in advance before you know what your biometric would be, and in particular the encryption uses randomization. Right? They're normal and modular exponentiations that you can do in advance by choosing the random points. And what's interesting in this case we can even -- Bob can even precompute the encrypted bits if he wanted or even send them; that there is optimization where you can just send something in advance so later on the work for you is so little when actually the input becomes available. And what I mentioned previously is that if you go back to the computation where they were computing the product, right, so say this product combines M different ciphertext. Okay. But if we notice that this ciphertext was produced -- taken some value that already available, raising it to a certain power, which is a bit. If this bit is zero, right, means that I'll just ignore the ciphertext. It is not used here. Okay. So basically in my case if certain bits here are zero means that instead of doing M modular multiplications, I'm going to do 75 percent of the work, for instance. And if you look at this, those are products of two bits. If at least one of them is zero, it means that I can just discard the separation. I don't need to do the modular multiplication all together. Okay. So you're doing a savings here, and there's some additional setting that I'm not going to go into detail where you can process multiple bits together. Instead of doing one at a time, you can do two or three together at the cost of some one time preprocessing that's actually independent of the number of entries in the database. Okay. So basically this is primarily all that I want to talk about is this kind of computation, and I briefly mentioned the implementation that we had. As I said, the choice of encryption made a huge difference. Another advantage that you get by using encryption that has small plaintext size is that when you decrypt, right, the values are going to wrap around modular small number. Okay. Now, normally, when you actually would like to decrypt and feed something into a circuit, you need statistical security. Okay. You need all [inaudible] trying to hide the fact that I'm now using small plaintext that can wrap around and actually protecting that better. I don't need the statistical secure parameter and I can use values that are significantly smaller and my circuit is going to be faster because the inputs now are short. So this is an additional advantage. >>: You [inaudible] you said the size of plaintext [inaudible]. >> Marina Blanton: Yes, yes. And basically the decryption operation is only near in the size of your plaintext, because you're doing basically -- you're computing a discrete log. But by choosing the plaintext space accordingly, if it's smooth group, you can do this very efficiently. Okay? You have control. >>: [Inaudible] that would be less sufficient, or let's say you use [inaudible]. >> Marina Blanton: In our case [inaudible] basically is going to be slower, right, because you have to try them all. In our case the values that we're computing may be 20 bits. You have to brute force through the entire space. In our case, you don't need to. You do the number of modular multiplications that says 20, right, to decrypt. I mean, anyhow, it was tremendous difference in the performance. >>: In the number of bits in the iris [inaudible]? >> Marina Blanton: C is how many times for a shift and it's left and right. >>: I have five shifts. >> Marina Blanton: So overall we're doing 11 shifts. Well, one without shift and five shift to the left, five shift to right, so we get 11. So if you look at this in the computation we have the iris bits that are 2,000 bits. We're shifting them where we're doing the computation 11 times. So it's about 22,000 ciphertext or operations that we need to do, and the performance is actually very fast. It's a fraction of the second. We have 0.2 seconds to process one basically biometric. And if actually the database that our biometric group at Notre Dame has doesn't -- is actually the [inaudible] especially if you take a picture of both eyes at the same time, they're going to be very well aligned so you don't necessarily need to do any shifts. In this case you can do something like 50 milliseconds per comparison. >>: You said that the iris is 250 bits? >> Marina Blanton: No, no, this is -- well, this is the amount of entropy that they have, but they can represent using values that are that short. So their representation is redundant. >>: [inaudible] you just do a first one, without shifting [inaudible]. >> Marina Blanton: Right, right. So here you have to do 11 of them in parallel, but then you will just shift one. >>: Sense of performance have any Y values what you expect to have? >> Marina Blanton: Well, depends on the size of the database. Right? If it's -- I mean, could be hundred, could be a million, depending, you know, what you store there. I mean, large databases probably will be on the order of millions if you want to do this. >>: So this [inaudible] microseconds. That's fast. Defining that. >> Marina Blanton: Right. The [inaudible] is that we're skipping some of them. We're throwing away some of them depending on the input. Because we know input 0 means we don't have to do the operation. We can skip it. So you get savings this way. >>: [inaudible]. >> Marina Blanton: Right, right. But the garbled circuit is small, because it's just simple comparison over 20-bit values. It's very fast. It's probably like 2 milliseconds. We actually have timing. So this work is going to appear in [inaudible] this year. So if you want to see more information, you're welcome to look at the experiments that we had and the exact timing. We were very detailed there. Okay. So let's now look at the outsourcing. So outsourcing -- when we say outsourcing we now mean one party. There's only Alice. And Alice has a database the same way, and she has a particular biometric. She -- Alice wanted to know if it appears. And Alice is weak. She doesn't have the power to go through this large database, so she would like to outsource this computation to a more powerful server. Well, in some cases it's possible that basically the computation can be outsourced to multiple providers, in which case it could take the form of secure multi-party computation. And it would be faster and it's going to be more powerful in terms of what kind of functionality you can compute. So basically for one server solution we're using a predicate encryption. And in this case basically with predicate encryption what you get is there's certain attributes that will associate with one biometric and there are will be predicate that will be associated with a different biometric. Okay. And so a ciphertext has attributes encoded in it, and when you have basically a key that represents a certain predicate, you'll apply to the ciphertext and the decryption is successful if and only if the predicate relates to true on the attribute that's encoded in the ciphertext. Okay. So the most powerful type of the predicate encryption that we have is just the inner product. So it's not necessarily very flexible in terms of what we can do with this. But, intuitively, how we're going to do this, we'll store encrypted database where each plaintext corresponds to a particular biometric and there are certain attributes that are encoded in each ciphertext. And when a new biometric comes in, Alice will create a certain key with the attributes, with the predicate, right, and the server will apply this basically predicate to each of the ciphertext that they have. And actually the server will learn which of the biometrics, if any, matched. Okay. So in this case Alice gets just the answer of the form, the biometric number to 153 matched. Now, the server will know if there was a match or not. In this way Alice doesn't have to receive output that's proportional to the database size. So Alice will just get the indices for what biometric matched. So I'm not sure if I actually have time to go through everything. So this was yet a separate work that was done in, and this is -- it's under submission. But you can look it up. I put it recently on a print, right, just last week. So if you want more details, you can look it up. But the idea behind this computation was that we couldn't do the division, right, and the trick that we had before is where you will, you know, just take the divisor and multiply to the threshold with work, because, in this case, the only way where you could do comparison is, say -- is the distance between these two biometrics, 1 or 2 or 3 or up to certain threshold. But if you're comparing to their bold number that's not known in advance, you can do this. Okay. So what we've done instead, we say we're going to use an approximation. There's no division. We're just going to drop the divisor altogether. But we did experiments on the iris database using a more robust representation. Basically instead of taking one sample, you sample the biometrics several times and create one that corresponds to the majority. Okay. For each bit, you compute the majority, and it's actually -- we've done some experiments and it showed that it's pretty robust, different types of approximations that you might want to do on this type of data. Okay. So the idea was to encode this using inner products. And the inner product in this case supports evaluation of polynomials in testing whether a particular polynomial is equal to a certain value. You can do also ors. So in our case you can do all of the ors, you can do comparison by basically trying a lot of different values. Okay. So this method is not efficient. It does not compute the distances exactly. It computes an approximation, but the approximation is good enough. The only problem is it's actually not going to be very practical because this computation tends to grow fast. So I'll probably keep the slides that describe how you will form the basically attributes and predicates themself. This is how it's done. But the idea here would be that each attribute corresponds to function of the biometrics. This is, you know, function of the first bit of the biometrics stored in the database, and then the second, and then you keep going until you represent all of them. Okay. So for the predicate you do the same or similar thing with the other biometric. And so the inner product will be tested for equality with certain values. Okay. You tested with 1, 2, 3, et cetera, up to T minus 1. And for different shifts you can do or as well, and basically you will be able to obtain the answer, is it within the threshold or it's not within the threshold, and then the server will learn this value and it will inform Alice what biometrics matched. Okay. So I think I'm going to skip this. And then the second choice for outsourcing is to take the computation to multiple servers. Right? And in this case, as I mentioned, it could take form of a secure multiparty computation, could be much faster -- well, it will be much faster, in our case, and it's obviously more powerful in terms of what you can do. And what's interesting in our case, we'll say we're going to use multiparty computation that use linear secret sharing scheme. And in this case the computation is structured in similar ways to how I described for the two-party computation. But what's interesting is normally for techniques that are based on linear secret sharing all of the linear transformation -- linear combination of different shared values can be done locally with no interaction. You compute addition, multiplication by cost and locally, but when you multiply by a shared value, then you need to engage in an interactive computation, this is what is the most expensive part. Right? You count how many interactive computations you need to do. And what was interesting in our case is that the interaction is not linear in the size of the biometrics. Okay. You can basically do an inner product at the cost of one multiplication. And this is what is used heavily, you know, this computation. Basically what we're going to do, the inner product has cost of 1. Okay. And the value that we get -- the result of inner product when you actually doing Hamming distance on the bits, the result is the number of bits that you need to represent, the result is logarithmic in the length of the biometric. Okay. So when we're comparing them, the work is linear in log M, not M. And, you know, I found that this is actually significant. Right? So you have those values that are, you know, thousands bits long, but when you actually, you know -- all of that is processed locally on the shares, you paid the cost of 1 to compute basically the inner product, and then you compare the product to a different value, which will cost you a number of interactions that are linear in the representation of the value. Okay. And because if use [inaudible] techniques you can also go to active adversaries by just using the standard measures when each of the parties have to prove that they have actually behaved correctly at each step. Okay. So as I said, if you're interested more in this work that there's a print, the paper is available in a print. The last thing that I'd like to mention deals with verification of outsource computation. Okay. So in this case we assume the same scenario that comparisons are outsourced to third party, right, and the result comes back to you. And basically the verification that we're concerned with is we're saying the adversary is not particularly malicious but the adversary tries to save their computing power. We're giving them a large task, but it's possible that they're not interested in doing all that much work, they just want to give something back to me and say this is the answer, right, how do I now that the answer is correct. So what we're trying to protect against is so-called lazy adversaries that are not going to do all of the work, they'll skip some of it, and they'll want to give the answer back to us so that hopefully we wouldn't know that they didn't compute anything. Okay. Maybe they will just get something at random, may they will take one part of the computation, copy over and give it to us as the correct result. And we would like to know whether this then -- if the actual server did enough work to compute most of the results or not. And in particular the guarantees that we would like to achieve is, you know, it can be something of the following, right, if the server performed, say, 95 percent of the computation or less, 50 percent, 20 percent, anything up to this value, we would like to be able to detect this cheating with a certain probability, you know, such as 99 percent. Obviously the adversary didn't compute just 2 percent, this probability will go down, right, we're not going to guarantee the same result, but we say maybe 2 percent is not that significant. We can live with a small error. So basically up front we have two parameters, is the first one is assumption on how much work the server is done and the second one is with what probability we want to be able to tell. >>: [inaudible] more like. >> Marina Blanton: So if the server performs 95 percent of the work or less, so that 5 percent was not computed, and the server gives us something as, you know, corresponding to those 5 percent, we would like to be able to tell with high probability that this took place. If they computed 99 percent, we probably might not be able to tell. But this depends on the guarantees that were given up front before we send the computation in. Okay. So basically we would have done this for slightly different setting. And, in particular, so the previous scenario when I said that Alice has a database and would like to know if this biometric appears in a database, the previous solution was such that Alice would get only indices of the values that matched. Okay. If in this case -- if you want to be able to verify the computation, you actually have to get something for each comparison. Okay. Alice will have to get a result of the first comparison, the second, the third. And there should be some sort of interaction or some sort of communication that will let her verify each result. Okay. So instead we'll proceed with a slightly different setting, and this is what is used at Notre Dame, when I said that there is a massive computation that takes place and gets placed on the cloud or a grid, and the idea in this case is the computation is so-called all-pairs computation. When you have a database, right, and you put this database as rows of, you know, your matrix and, you know, as columns as well and what you need to compute is the distance between each pair of those. Okay. Basically the idea is to find all of the distances between each element and all other elements, and then perhaps given this information you want to compute some statistics about the distribution of those distances. Okay. You might actually want to separate distance of basically computation of distances between related biometrics and computation of distances of unrelated biometrics so that you have two different distributions. Okay. So the idea in this case is the client has a database, and actually we can do work linear in database size, but to compute all pairs it's quadratic. Okay. And this is much bigger, so if the database size is 100,000, you know, then obviously, you know, N squared is going to be, well, billions. All right? And so in this case basically the verification could be also somehow linear in the database size. But when -- in our particular work what we do is say we can -- basically in this case to compute statistics we compile information, the server will compile information without losing any distribution information. Okay. Basically what the server will give us is, say, you know 250 values had distance 1, certain number of values or distances were -- had this value, so basically you have -- you don't lose any information. You present complete information about distribution and the client can compute whatever statistics that it's willing to compute from that. Okay. So this is how the computation is given back do the client. Of course the server doesn't know what it's computing. So just a little bit more on the ideas that we had. So the ideas are actually simple. The analysis is complicated. Okay. And so the idea is we have this dataset where we'll insert some of the fake elements that are inserted in random positions, and the server doesn't know where they are, so it doesn't matter how the server tries to cheat, by copying something over or guessing at random. These values are unpredictable and in no locations. And if we insert enough of them to meet our security guarantees that the [inaudible] the server will be caught. And another additional flavor of this computation in particular is that each biometric consists of a number of elements. Okay. It's not once and go unit. There are multiple bits, or if you look at different type of biometrics, like faces, there will be also multiple values in a, you know, 50 dimensional space that you have to combine together. For this reason we also have to insert fake elements to make sure the server just didn't compute the first 90 percent and stop there. Okay. We challenge the server sort of two different aspects, one on the fake items and the second one is on fake elements within each item. And so in this case basically our analysis for just computing all pairs is independent of the distance. Works for all of them. But when you go to statistics, it actually depends on how the distance is computed and we treated the Hamming distance, the Euclidean distance as an intersection. Okay. So let me see. This is probably everything. So if you're interested in this type of work, this is also going to appear in PASAT this year in October, but it's short paper. We have a full version, which is a technical report. If you're interested in it, you're welcome to look at it as well. Okay. So I'm going to conclude here, and hopefully I convinced you that computation of biometric data or outsourcing is something that we're interested in pursuing, see how we can do this more efficiently. And this is perhaps one of the directions that I would like to continue working on to see how we can do this faster. So this is actually feasible. And, for me, you know, if -- actually I would like to convince our biometric group, right, this is okay to use and the secure techniques can enable them to use external power elsewhere. All right. Thank you for your attention. [applause] >> Seny Kamara: Questions? >>: So on the previous slide you insert some biometrics that you already know the answer for? Or why can't you just compute -- go ahead and compute the answer for a real one? >> Marina Blanton: Right. So basically what we do, we insert fake ones that are precomputed. They're known. And because the server doesn't see the data itself, we can reuse them over and over. Okay. So they could be the same, just inserted in different location each time. So that same -- saves some computation. But a lot of it -- you know, all of it assumes that the server doesn't see the data. >>: [inaudible] plaintext data that [inaudible]? >> Marina Blanton: Right. So whether this is encryption or secret sharing, you'll need to make sure that they can't tell that's the same thing. >> Seny Kamara: Let's thank the speaker. [applause]