>> Raluca Ada Popa: The mic working? Is the mic working? Maybe I'll put it -- is it better now? Yes. Okay. So I'll tell you about CryptDB, which is a system for protecting the data in the database by computing on encrypted data. This is joint work with Catherine Redfield, Nickolai Zeldovich and Hari Balakrishnan from MIT. So we hear many times in the news that confidential data leaks from databases. To give you a few examples, the Homeland Security newswire listed that between 2009 and 2011, eight million medical records were leaked. Also another example is last year a group of hackers infiltrated into the Sony Playstation network and were able to access 77 million user profiles, many of which contained credit card information. So there are many reasons why confidential data leaks, and here in this talk we're going to consider attacks to the database server. So, for example, hackers notoriously break into database systems and steal sensitive information. Or, in fact, system administrators oftentimes have root access to the database servers and may be able to read company data such as financial or medical, when really their only task is to maintain database servers. So we group all these attacks together into one thread that we call passive database server attacks. So basically the database server can be attacked, an adversary can have full access to a database server, it's passive meaning that the attacker doesn't actually change the data or issues queries. It just tries to read confidential information. >>: The data administrator they have legitimate reasons to change a database. >> Raluca Ada Popa: Right. So here -- okay. So, for example, if you look at system administrators that only have to balance the load, they don't necessarily need to look at the data, but oftentimes they actually can. So that's one thread model. And just to -- for cryptographers here, by passive we mean honest but curious adversary. So the approach of CryptDB is to process queries on encrypted data. The reason why we take this approach is that in this way the database server never gets the decryption key. It only gets encrypted data. Even if an attacker gets full access to the database server, the attacker still cannot learn anything other than encrypted data. So before I explain how CryptDB works, let me summarize at the high level CryptDB's contributions. So CryptDB is the first practical DBMS to process most SQL queries on encrypted data. lot on technicality. In CryptDB we focused a One can use CryptDB, for example, to hide the database from system administrators while still allowing them to maintain the database servers. Or it can be used to put the database on the cloud. One of the main contributions of CryptDB is that it has a modest overhead. So CryptDB has a 26 percent throughput loss for DBC compared to MySQL with no encryption. So by throughput, what it means number of queries per second the server can execute. And TPCC is standard industry database benchmark. And perhaps surprisingly CryptDB makes no changes to existing debug MSs, such as processing MySQL. And I will explain it why. And also it makes no changes to existing applications. So basically applications can run on top of CryptDB without knowing that they're actually running on top of CryptDB. And that's because CryptDB exports SQL interface. Okay. So to explain the approach of CryptDB. Let me put it in relation to existing approaches. There are really two points in the design, two extremes in the design space in terms of practicality and security. So at one extreme there are unencrypted databases. So these are very efficient. They've been optimized for over 40 years of experience and they're efficient because they process simple operations such as equality, how many items are equal to 100, for example, in a column. And also because they have specialized data structures such as indexes. So you can think of an index as not a binary, as a search tree, that has a database server look at items fast. The other extreme, there's fully homomorphic encryption, which is first constructed by Craig Gentry in 2009. So FHE shortly, it's called FHE. FHE allows any kind of general computations to be performed on the data. And it has great security, semantic security, which virtually leaks nothing about the data. However, it is prohibitively inefficient. Yes, there have been a lot of improvements in practicality of FHE since then. For example, Gentry Halogen and Smart [phonetic] in 2012, they implemented a yes on top of FHE and they did some clever optimizations. However, even then the scheme was nine orders of magnitude slower than unencrypted computation. And in fact even besides the cryptographic overhead, there's one main inherent reason why FHE is impractical for databases. And here's the reason. So in order to compute the query on a database, the client has to express the query as a circuit over the entire database, which means that for every single query, the entire database has to be processed. Whereas in real database, the database server just uses an index, locks item fast. So our ID with CryptDB is to try to come up with an intermediate point. Ideally we'd like it to be almost as fast as unencrypted databases while at the same time having a high degree of security. So the insight into the practicality of CryptDB is to try to have the computation on encrypted data be the same as unencrypted data. So, for example, if we do an equality check on unencrypted data, that should turn into an equality check on encrypted data. And also indexes will still be useful. And the insight here is that in fact most SQL operations use a limited set of operators. So if you can't support those with efficient encryption schemes, then we are roughly mostly done. And in terms of the security that achieves is, well, we will reveal the server relations necessary to compute the query but really no relations among the data beyond those needed to process the query. And I will go back to explain the security in more detail. I'd like to mention other related work. For example, search and encrypted data initially pioneered by Song et al. and then followed by a lot of work by [inaudible] and others, actually has a different purpose, has the purpose of locating keywords on encrypted text, but it could be used to process certain SQL queries such as equality. However, those are only specific kinds of queries and sometimes database indexes cannot be used so it's not as efficient as CryptDB. Also system proposals that result in weaker security and functionality and also efficiency, and many times those require significant client side processing of the data. So beyond the database server, the client also has to do a lot of data filtering. So now that I explained CryptDB's approach, let's go into the design. So as the system setup let me remind that you we have a database server that is under attack. And an attacker can get full access of the data, database server, but it's passive. And the application is trusted. There's a second part actually to CryptDB that deals with attacks to the application as well but I'll just refer you to our paper, because it's just an extension to the solution I show here. So first we're going to introduce a lightweight machine called a proxy on the client side. So it's also trusted. So the proxy only stores the schema and the master key. So the schema is basically the name of the tables and of the fields and data types but not actually the content. So it's schema is short. Whenever the application issues a query to the database server, the proxy intercepts it, transforms it by doing certain encryptions that we will talk about soon, and then sends it to the database server. The server processes the queries on the encrypted data completely, then sends the encrypted results to the proxy. The proxy decrypts them and sends them to the application. Note that the proxy doesn't do any query execution at all. It all happens in the database server entirely. So let's see how do we process queries on encrypted data, database server and I'll start with a simple example. So consider we have the table employees database server that has three columns, rank name and salary. In fact, we anonymas the names of the table and column, we have table one, column one, column two and three this is what the database server sees. Now each grid block indicates a data item that is encrypted and initially we encrypted with randomized encryption. So for cryptographers this is semantic cryptographer probabilistic encryption. The reason this is very strong security, very strong encryption scheme. For your information I'm showing the values in the salary column unencrypted but the server only sees the encrypted ciphertext. Okay. So let's consider the application sends a query to the proxy setting give me all the rows where salary equals 100. Remember that our goal in CryptDB was to keep the computational encrypted data the same as unencrypted data. So what we would like to do here is to have database server just do an equality check. But we can see that that's not possible because of the randomized encryption. The encryptions of 100 are different. Okay. So the first simple idea is to in fact use deterministic encryption. So we can see in this way the encryption, the two encryptions of hundred are mapped to the same value. Now all the proxy has to do is encrypt a value 100 with the same encryption scheme and key and then just send that query to the server. Now, the server can perform the equality check on encrypted data as if the data was not encrypted to begin with. Then it sends the encrypted results back to the proxy. decrypts them and sends them to the application. The proxy So what happens if instead the application requested all values where salary was at least 100. We can see that the deterministic encryption scheme doesn't preserve order. So 60 is smaller than 100. But the encryption of 60 is larger than encryption of 100. So performing greater operation won't work. Instead the idea is to use order preserving encryption, which are some recent encryption schemes that preserve order. So basically 60 is smaller than 100, then the encryption of 60 smaller than encryption of 100. Okay. So now we can task the proxies easy. It encrypts the value 100 with the same, with this order preserving encryption scheme, send it to the server and now the server can perform the greater than operation on the encrypted data as if it was encrypted to begin with, sends back the encrypted results, and proxy decrypts them and sends them to the application. Okay. So this gives you an insight into our two main techniques. The first is to use a SQL aware set of encryption schemes, basically have encryption schemes that can cover most common SQL operations and the second is to adjust the encryption of the database based on the queries. So we saw that different queries require the data to be encrypted with different encryption schemes. So we have to have a way to adapt to adjust between those encryption schemes. I'm going to present each of these techniques in detail. So let me start with the first one. So on this slide I'm going to show you all the encryption schemes we use. We use six of them. And I'm going to show you in roughly decreasing security and but increasing functionality. So the first one is we call it RAND, stands for randomized encryption, implemented with yes. Provides security, semantic security, which basically leaks nothing about the data but it supports no computation. The second one we call it homes, stands for homomorphic. Note here we're using a specific kind of homomorphism [inaudible]. It's efficient, fully homomorphic encryption that supports general computation is not so efficient. So this allows us to support SQL operations such as sum. And the homomorphic encryption is roughly as secure as RAND, semantic security as well, has strong security properties. The third one is search, which is, which allows us to do word search, so to locate words and encrypted text. And this is just the scheme of Song of 2000. It enables us therefore to support the I like in SQL but restricted type. Basically we just support whether there's full word matches. And search is roughly secure home as well. Then we have that seen as deterministic encryption, and it allows us to perform equality type of operations. So therefore in turn we can support a lot of SQL operations such as equal different [inaudible] group by distinct and so forth. And remaining two are actually some new encryption schemes we provide. The first one you join, and the join is useful for finding equality matches between two columns. And OP, we've seen order preserving encryption is useful for order type of operations, which supports a lot of SQL operations such as greater, smaller order by sort max/min greatest and so forth. So I'm going to show how to use all this encryption schemes. First, let me discuss briefly about the encryption schemes we provide. >>: Can I ask a question. CMC? >> Raluca Ada Popa: CMC is it's just a mode of using a yes, basically one of the encryption streams into, becomes the IV for the next encryption and you also go backwards, go one direction, go back. It basically has, basically allows us to provide the security property that one, which is the random permutation. Okay. So for join, we basically just, for join what we want is we want to have equality checks between two columns. Okay. So why can't we use that for deterministic encryption. Why do we need the new encryption scheme? Well, here's the problem. We don't know ahead of time what columns will be joined. So there are two possibilities. One is we encrypt all the columns with the same data that key, in which case sort of we'll be able to do the joins. But that leaks more than we intend to because there may be times when the user doesn't request joins between two columns. And CryptDB's goal is only to release those relations among data items that are for the types of queries issued by the user. That would leak more. The other choice is to have every column encrypted with a different key, but then when a join is requested the server can undo it. Instead our scheme allows us to initially encrypt the columns for different keys for security, but then when the application requests a join between two columns, the proxy can give a token to the database server and using that token the server can adjust the encryption of the two columns to an encryption with a common key. And then the server can just see equality matches. Okay. So let's see in more detail what kind of encryption scheme do we need. And this encryption scheme has four algorithms. The first is a key gen. Oh using visual proxy can obtain the secret key, then the encryption algorithm allows the proxy to encrypt value message M for certain column I using the secret key. Then when a join is requested, the proxy can compute a token for two columns, column I and J. And then it gives this token to the server. The server now uses the fourth algorithm, the just algorithm, with the token from the client to transform the encryption of a column to an encryption with a shared key as the figure shows. So now since the two columns are containing the same key the encryption key is deterministic the database server can figure out equality matches and can process the join. So our joint scheme, we have a report online. We secured definitions and proof. But intuitively the security says that the database server cannot learn joint relations without knowing the token. And we implemented the schema actually in elliptic curves, which means that the ciphertext are rather short, considering it's a public key scheme. So the 192 bits long, and the time to encrypt and adjust is half a millisecond which is reasonable. So the second encryption scheme we open up is order encryption which I think is the more interesting one, and remember that order preserving encryption aims to reveal the order, but ideally should not think beyond order. So this is formalized [inaudible] Lee and O'Neal in 2009 and the security notion was called in the OCPA, indistinguishability under chosen plain text attack. Basically this security notion says is that no adversary can distinguish between encryptions of two sequences of values that have the same order relation. Well, it turns out there's been more than ten order preserving encryptions proposed and they actually have more than ideal security. They link more than order. And a part of the reason why this has been so difficult is that [inaudible] showed in 2009 such secure definitions invisible and more concretely what they showed is that the size of the ciphertext has to be exponential in the size of the plain text. So if you want to encrypt 32 bit values in the plain text size will have to be 2 to 32 bits long, which is huge. In fact, we show that even stronger possibility there's simply no NDCOPA even if you want to encrypt three values. In that case two to three wouldn't be so large as ciphertext but it's actually not possible. >>: [inaudible] model. >> Raluca Ada Popa: Impossibility. Yes. >>: [inaudible]. >> Raluca Ada Popa: I doubt it, because it's the property of the resulting function. So basically okay so basically what we show is that their exists an adversary such that the advantage of the adversary is 1 over the ciphertext size. So no matter what scheme you come up with, advantage of adversary will be 1 over ciphertext size if the scheme is order preserving. So as a result, the BC allow paper, they settled on a weaker secure definition that later turned out to actually leak half of the plain text bits. So not only linked order but actually given a ciphertext you can tell the higher most half of the plain text bits. And the order leaks quite a bit we didn't really like this. We really wanted to have ideal security. And the observation we made to achieve that is that in fact in real system such as database, the model is less restrictive than the one of the encryption scheme. In particular, you can update ciphertext. So you can go, if you encrypt a certain value and place it in a database, the later time when you encrypt another value, you can go back to the database and update that value in a real system you can do that. You call those [inaudible] ciphertext. So it turns out that with a small number of ciphertext, if a small number of ciphertext are allowed to change then we can achieve ideal security. And in fact we also show that this mutability is in fact necessary because any -- we show that any NDCPA scheme is invisible with mutable ciphertext, even in stateful meaning that the algorithm can look at all the values even before the encryption algorithm can run for a long time, not even polynomial. It's still infeasible without mutable ciphertext. If you allow a little bit of mutability, you can achieve ideal security. And in fact you can achieve an even stronger secure definition that we call same time security. What we mean by this is that the order should only -- should only leak among items that are currently in the database. For example, if an item is resolved and was discarded new item was inserted the order item should not leak proprietary whereas with NDCPA the order leaks among any items ever encrypted. So and also in the real database really there's no, the database server only needs to know the order relations among items currently in the database. Doesn't need to learn more than that. So there's no reason to tell him more order information than that. Okay. So let me tell you briefly the gist of our scheme. >>: [inaudible] for an adversary watches as you're changing, changing [inaudible]. >> Raluca Ada Popa: Yes. >>: So the entire process. >> Raluca Ada Popa: Yes, yes. And it's actually -- it's simulation security as well. It's not indistinguishability. Yes. It's adaptive in that sense. >>: Okay. And it will be clear. We want to show you the gist of the scheme. So the server stores binary search tree of values. So each node contains the deterministic encryption of a value. The deterministic encryption basically for cryptographers is as strong to the random permutation basically. So it strictly less than order. So what is the -- so in this binary search tree the values are sorted based on their underlying plain text. So the left child of a node has a smaller plain text than the parent and the right child of a node has a larger plain text than the parent. Now the order preserving encoding in our case is actually given by the path in the tree and the reason is that the path naturally indicates the relative order between the items. So, for example, the path of 12 is 00. The path of 48 is 01. Now we actually have to pad this path because the root has an NT path so how does the NT path compare to 00 or 1. So we actually pad it out by one and as many 0s as needed to pad the value. So let's see. Therefore the OP encoding of the value 12 is 0010 of the value 48 is 010. And the root is just basically the padding, because the path is empty. And we can see that these values actually preserve the order. So how does the proxy and crypto value say 32. It first provides a dynamic encryption of 32 to the server. And the server says, well, server doesn't know what the values are so instead it replies with the root of the tree. It says 32 is to the left of 50, it says the server go left. Now, the server gives the client deterministic encryption of 23. The client decrypts it. Again sees that 32 is to the right of 23 so it says the server okay give me the value on the right. And so fourth until they find a place in the tree that's empty and in that place the server encrypts, inserts the new encryption. All right. So we can see that all the information in the client provides the server in this case is just left to right. So it's just order relation. Nothing in addition to order relation. So intuitively you can see why we achieved the definition. But I won't go formally into it. Okay. But what happens if the client keeps asking for values to encrypt along a certain path. Path grows really large which means that the ciphertext size becomes large, and this starts to remind us of the invisibility results we talked about. But that's precisely where mutability comes in. We rebalance the tree. When we rebalance the tree, certain ciphertext may move in the tree. Now, if they move, they have a different OP path, which means they have a different OP encoding, and we have to go in and update the database. That's how mutability makes us avoid infeasibility of large ciphertext. But it turns out that the number of ciphertext we update per encryption is actually small. It's logarithmic. And we implemented the scheme, and surprisingly it was one to two orders of magnitude faster than the BCLS scheme, which was the most secure encryption scheme previous to ours and it linked half of the plain bits more than us. All right. >>: [inaudible] BCL scheme has these round trips right? >> Raluca Ada Popa: Absolutely. They're included. >>: So you can include that ->> Raluca Ada Popa: Yeah. Of course. The question is how much network do we include and we have a graph in our paper showing that basically the dependence on the network and the point at which BCL items becomes faster, but one thing to say about BCL if you encrypt values beyond 32 bits, the scheme becomes extremely slow. The performance of the scheme degrades a lot by the number of bits you encrypt. So for massive size of 128 bits it's not even, you have to have really large network. Another thing I want to add about network is our scheme actually is paralyzable. So you can encrypt things, you can encrypt things in parallel. And basically network cost is not really factored in at that point. >>: What happens if you relate this to the real world, let's say I track you down. I know there's a lot of -- and you're in front of them. There's some [inaudible]. And then maybe I called an update on Pomerance and [inaudible] and [inaudible] right? >> Raluca Ada Popa: Right. So basically order preserving encryption links order. If you can use order to learn what an item is, for example, if as you say you preface things then yes you can learn. Order encryption links order, links something. I'll show you one thing in practice actually very interestingly very sensitive field with remain encrypted with RAND links virtually nothing. I'll get to that show you real applications and what happens in that case and the OP really is used for lessons in the fields. In fact, if you're concerned as the owner of data you can always put thresholds saying don't go to OP in -- don't use OP for this data. But we're going to get there. >>: The server ->> Raluca Ada Popa: It should be on the database server as well. >>: Sort of [inaudible]. >> Raluca Ada Popa: backed by disk. It is stored as a bit tree, yes. Memory and >>: [inaudible] what's stopping them from creating an island proxy. >> Raluca Ada Popa: So the proxy is if you remember the model, the proxy is considered to be untrusted on the application side. Here. >>: [inaudible]. >> Raluca Ada Popa: Nothing is trusted here. >>: But the server is actually trying to ->> Raluca Ada Popa: Yes, it's a passive adversary. Passive meaning that I'm trying to learn as much information as I can but I want to do things incorrectly. I want to change queries, I won't change database content. >>: So have access to the server. the server. I can create a proxy? Proxy upon >> Raluca Ada Popa: You cannot create -- okay. So this proxy is on the trusted side and it has the master key. Master key. Now, if you put create a new proxy in the server side you have to give it some key. It's incompatible with the other key. You get junk back. But the proxy's on the application side is trusted. As I mentioned we have a second part which we deal with the attacks to proxy. And can talk to you more off line about that. >>: [inaudible]. >> Raluca Ada Popa: Yeah. >>: So encryption, so here encryption of 32 here is 01/01. >> Raluca Ada Popa: Yeah. >>: But here all these also actually you're already encrypted 32 with deterministic encryption and then you ->> Raluca Ada Popa: Yes, yes. >>: My question [inaudible] the database administered to change the schema? >> Raluca Ada Popa: Uh-huh. >>: Is the proxy involved in that or is it just cash for schema. >> Raluca Ada Popa: If you want to change the schema, I guess it depends how you want to change the schema. If you want to add another column, that's sensitive. You've got to encrypt it. So you've got to get the key or proxy to do that. So it really depends on ->>: Shouldn't the proxy be [inaudible] in that case. >> Raluca Ada Popa: Okay. So there are really two types of administrator, one system administrator maintains system server, manages load, server crashes, boots up another one. That's separate from the database administrator. Another kind of database administrators and depends on how much trust you are willing to give them. Depends on how much work you're willing to give them. If it's really crucial to allow them to perform all kinds of queries and see even to see the database in the clear then sure. But if they're only for certain tenants reasons there then you can even protect against those kind of administrators. >>: Where is the key stolen [inaudible]. >> Raluca Ada Popa: In the proxy only. >>: And proxy is running along ->> Raluca Ada Popa: Yes. Yes. On the trusted side. Okay. So let's go to the fun part. Now we have all these six encryption schemes, and the question is how do we use them. Well, one possibility is we can encrypt all the data with all of them. Because it's important that the queries that come in. So the encryption scheme we use depend on the queries that come in. Certain queries mean certain types of encryption schemes. The problem is we may not know the queries ahead of time. Therefore one naive solution would be to encrypt the data with each of the encryption schemes so we can support all the query s that would come in. But that would be a space but not so much that as much as the fact that each column is encrypted with OP which leaks order. And in fact an application may never perform an order operation in a certain column so in that case according to CryptDB's goal we should not leak the order relations on that column. So instead our idea is to start an encryption scheme to onions of encryption. So each value becomes encrypted with three onion. The first onion encrypts the value with join and resulting encryption with RAND. So we can see that this onion is used for equality type of operations. Now we can also see that you go down in the onion, the functionality strictly increases. So join can toss it the same way but can also join with a different column. Now, the second onion is onion order, I didn't tell you about, OP and RAND. Useful for order type of operations and the third onion depends on the type of the field. So if it's attacks for search, keyword search, or integer for homomorphic addition. Each of these all the values in the column are encrypted with the same key but the key is different across different layers of the onion, across different onion, across different fields. Also notice that initially when the onions are in this state the outer layer of the onion is RAND, search, semantic security. So basically it leaks nothing about the data in this state when we started the database. But then as queries come in we need to adjust the encryption scheme to support those queries and this happens naturally with our onions. We just peel off layers of the onions. So the proxy gives a key to the server using, a SQL user defined function. These are functions that the SQL interface allows the user to define and they can be invoked from within a query. Now, the proxy remembers the resulting onion layer for every column, and we do not put back that onion layer. So, for example, the first time we do an equality and the layer's taken to deterministic encryption, then all future equalities in that column don't need any further decryption they can process directly. >>: So you hear function that's access to the key, administrator can run the profiler, keeps a lot of queries get access to the keys. >> Raluca Ada Popa: So there's a key -- so getting access to the key for a layer says you nothing about all the other keys. So there's a key per layer that is different for every layer. So once we give that key to the server, server removes that layer that key is used for any other purpose. In fact the server can look at it, use it but it's useless. Okay. So let's see a concrete example. Again, on the employee's table. So each column becomes three columns. One for each onion. Within each column of values involved with onion of encryption. CEO down in equality is encrypted with join, debt and RAND. Now consider that the query comes with a proxy requesting all values where rank equals CO. The proxy says okay, it's rank. Let's look at the first column. It's equality. Let's look at onion equality. And the proxy remembers that outer layer for that column is RAND. For RAND we cannot do equality. We need to adjust it to deterministic encryption. For that the proxy issues an update query to the database server invoking decryption UDF and giving the key only for the RAND layer, so for that layer alone. Now, executing this update query, the database server removes the RAND layer and outer most layer now becomes that. Now the query is processed as before. Proxy encrypts the value CO with join and on top of that with that. And now database server can perform the equality on the encrypted data as before, an unencrypted data. And return the results to the application, decrypted. All right. So we saw how CryptDB works. Let's talk about the security guarantees. So we saw that depending on the queries we take with encryption schemes and they have in each encryption scheme may have a different kind of leakage. CryptDB makes two kinds of guarantees. The first is that the system design of CryptDB with the onions guarantees that the encryption scheme exposed for every column is the strongest encryption scheme from our encryption set that enables the query. Now, overall intuitively CryptDB only reveals the data items needed for a type of query. Okay. So the way we formalize this cryptographically is similar to secure multi-party computation you have a real ideal setting and ideal world there is Oracle that helps the server process the queries. For example, the server asks the Oracle questions such as is the item in the second item in the first column equal to the third item in the second column. And the Oracle examines whether the answer to that question is needed for the query, the server processing and if so it says the server the answer. So clearly in this ideal world all that leaks, all that the server learns is what is the relationship is allowed to learn. Now we've proved that the real world with CryptDB and ideal world are computationally indistinguishable. >>: What's the database can be optimized various ways. You ask different questions for each of them. You have any proof that ->> Raluca Ada Popa: No. The proof is if you look at our technical report, we prove that for each specific operator the way that the query is written out, each require certain known operation. No, not the way you write it. For the specific query we get as input. Let me show you some natural examples. If we perform an equality predicate on a column, then that is exposed, which means that repetitions within that column leak. The server cannot see whether the third item is equal to the fifth item but not that they're actual values. Now if no aggregation is performed on a column -- now if an aggregation is performed, the column no equality or inequality, homomorphic remains as outer later semantic security virtually leaks nothing about the data. Also if we perform no filter in the column don't do equality don't do inequality just fetch each of the data the outer layer remains RAND virtually leaks nothing about the data. And actually turns out that this is very common in practice. We send a field as we will show in some application examples. And the bottom line is that we never decrypt the lowest layer of the onion OP we don't reveal plain text to the server. >>: What's the one key layer. >> Raluca Ada Popa: Yes. >>: Inserts. >> Raluca Ada Popa: Yes. >>: Going to be able to build the same layer or we set ->> Raluca Ada Popa: On the same layer. >>: So you stay there. >> Raluca Ada Popa: We stay there. You can envision obviously security optimizations by refreshing [inaudible] but it's cleaner to describe it this way now. Anyone can prove it to anyone. >>: [inaudible] if you let's say process [inaudible]. >> Raluca Ada Popa: And -- >>: [inaudible]. >> Raluca Ada Popa: Basically you're saying if someone -- someone leaks something, another onion leaks something else can someone correlate and learn something, depends on the setting. Maybe you could. The point is that we leak equality for deterministic and recorder for -- but as I said, I'm going to show what happens in practice. In fact for most of the fields remains encrypted RAND which you cannot correlate with anything. Basically these two cannot be correlated with anything. >>: Can influence the type of query. >> Raluca Ada Popa: It's passive adversary, yes. But let me say that -- in our second part CryptDB we actually consider active adversaries, adversaries that attack the proxy. And there we're able to provide guarantees of the following. If a user is not online at the time of an attack then his data is not compromised but if he's on line at the time of the attack his data can be compromised. So we limit compromise in the situation when everything is compromised and actively as well. >>: I was think of an adversary force, clearly -- the database and then all the layers [inaudible]. >> Raluca Ada Popa: Right. So that's why we say this scheme is for a passive adversary. Now, if you consider an active adversary, basically our second part of [inaudible] requires more explanation. But it guarantees, too, that if you are off line doing an attack then your data doesn't get affected, meaning that an adversary could not, could not get your onions down, because when you're off line your data basically your key will not be available. Will not be even on the proxy at all. When you're online you proxy, adversary can lower now to the onion level. And we don't guarantee. >>: Naively I would expect lots of queries to have some sort of average. And I would expect homomorphic encryption to be a public key to be pretty expensive. How do you keep things down to 26 percent penalty? >> Raluca Ada Popa: We'll see the exact breakdown in costs. I will show you the exact breakdown for every query and cost you'll write down the most expensive operation. But we do have some optimizations and we'll see the exact numbers. >>: [inaudible] the data and the queries are chosen, in your definition, right, data are chosen by adversary. >> Raluca Ada Popa: Passive. >>: It's passive in the sense that -- so number of phrasing is guarantees are hold for any dataset, queries, right, when you define ->> Raluca Ada Popa: Right. >>: So you're saying for all databases, all queries, right? some distribution on the data? Assuming >> Raluca Ada Popa: No. So basically -- no. We're not saying -- yes. Okay. So our security guarantee doesn't say CryptDB doesn't leak anything. It doesn't say that. In fact our security guarantee says CryptDB leaks only what is needed to process the query. Equality if you need it, order if you need it. If you don't need a certain column then nothing. >>: This is for any -- there's no such -- no distribution. >> Raluca Ada Popa: No distribution over data. distribution at all. Exactly. No You had a question? >>: [inaudible] you chose this layering model. encrypt it and set for using rather sort of use the data and maintain different copies. So use encryption of the data, random one and join one Why didn't you actually each layer encrypted deterministic and keep them separate. >> Raluca Ada Popa: Right. So the reason we didn't do that is because then each data would be encrypted with OP, which leaks order. And we don't want to leak order for a column unless order is needed for that column. And we don't know ahead of time what are the queries we don't know ahead of time if the order will be needed. Therefore we want to start initially with RAND being outer most layer leaking nothing. Then if the user actually needs order, then we peel off the level. >>: What I'm saying is keep an OP data, keep a RAND data as well. >> Raluca Ada Popa: Not on top. >>: Not on top, separate. So when the proxy issues a query offers this query don't need the encryption, only need the RAND one, just go to the RAND one. >> Raluca Ada Popa: What does the server know if the server has the OP encryptions of the data then he knows the order. >>: I see. >> Raluca Ada Popa: So if the server has the OP encryption of a column then he knows the order. And we don't want him to know that until we need the order for sure. >>: But he also -- I understand -- you're right. line. Let's take this off >> Raluca Ada Popa: So implemented CryptDB on top of MySQL, and one of the cool parts of CryptDB is that we didn't make any change to the DBMS. And the reason is that, the reason we could go, we could not make a change is because we user defined functions. So basically whenever we wanted to change the behavior of the database server, for example, decrypt something and we invoke one of the user defined functions. This makes it be forcible. In fact, initial implementation presentation was in [inaudible] and we made it to MySQL with six lines of code change to CryptDB and mostly to the interface of DBS and talk to the server not to the core of CryptDB. >>: [inaudible]. >>: Seemed to require [inaudible]. >> Raluca Ada Popa: Actually, no, because -- so everything all tree looking up can happen UDF. User defined function. Yes. DBMS doesn't change at all. In fact, it doesn't even have to be restarted because you can load UDF libraries dynamically. >>: [inaudible]. >> Raluca Ada Popa: No, it's -- >>: [inaudible]. >> Raluca Ada Popa: Yeah. [inaudible]. >>: [inaudible]. >> Raluca Ada Popa: Excuse me? >>: [inaudible]. >> Raluca Ada Popa: Yes. Okay. So also there's no change needed to applications because the CryptDB proxy exports SQL interface for existing applications on top of CryptDB unchanged. So we valid CryptDB and in doing so we try to answer three questions. That's CryptDB support real questions and applications. Real queries and applications and what's the resulting confidentiality in terms of the onion levels, for example, the questions these guys have been asking me and what is the performance overhead. So in terms of real queries and operators we don't support those queries. For example, we don't support complex operators such as trigonometry, and sometimes we don't support combinations of encryption schemes that we support. For example, A plus B greater than C because we support to do A plus B we need to do homomorphic encryption and compared to C you need encryption but homomorphic encryption does not preserve order. There are things you can do to support it split it in two query have the client encrypt it or compute columns or use FHE for specific types of computation. In fact, there's a project at MIT that's follow up to CryptDB that does all these things and able to support virtually all the queries, at the cost of some computational client side. In terms of real applications in query, so we look at seven real applications out of which I'm showing you five. So PHPB is an open source form software in which we would like to secure private messages, for example, or private posts. Hot CRP is a conference management website for OSB for talk in which we would like to hide reviews, paper authors if it's anonymous and so forth. What applies MIT graduation database where we'd like to hide student grades, letters of recommendations. TCPC is a industry benchmark and SQL MIT view is a large trace of queries we got from a popular SQL server at MIT that hosts thousands of applications and traces over 126 million queries and more than 120,000 columns because we wanted to see really what could be supported or not. >>: [inaudible] columns. >> Raluca Ada Popa: Right. So in these applications we only encrypted the columns we didn't send. So we evaluate what were sensitive such as posts such as secret posts, secret messages, things that [inaudible]. For traces we said let's encrypt everything even if they didn't seem sensitive at all, just to see how CryptDB would perform in that situation. But the realistic situation is the one in which you only encrypt the things sensitive. Basically for the large traces almost nothing was sensitive for TCPC but let's see what happens if everything is encrypted. And actually the good news is that for the applications we supported every single encrypted data on the sensitive fields and four TCPC supported all the queries on the fields because we encrypted them all. Now for the large query we didn't support one percent, once or less than one percent less queries and those were queries like this doing mathematics in the SQL query. Select one over log of -- so those we can support. >>: [inaudible] number of [inaudible]. >>: Yes. Number of columns, yes. >>: What about all these 500 unencrypted columns of the database? You're happy in plain text? >> Raluca Ada Popa: So, yes. We examined which ones were sensitive. In fact, we were even exaggerating, for example, if the posts are private we were keeping it private but we keep private data posted everything and this is the number of values we got. For example, there are a lot of fields such as auto increments that the database obviously knows what they are. There's no point to hide. So how about resulting confidentiality level. So examine the min level for every column. What the min level is the weakest encryption scheme exposed for that column. So we can see that in fact most of the fields remain in RAND, which means that nothing leaks about this field. The reason they remain in RAND there's no equality or inequality performed on them. Basically just inserted retrieved, maybe based on other fields or maybe summation again that one doesn't leak anything as well. So that's the good news about CryptDB. Now, some fields were DB the one that worries the most as well because it leaks order but very few fields were DB. And in fact we examine manually to see what those fields were. And some of -- they look to be less sensitive. For example, the time, for example, the contents of a post were all that RAND, the time when the post was made was top P. >>: Based on you examine a trace of queries over time to see ->>: We examine all the possible queries the application can issue, which is easy because it's a fixed set because these are Web applications. And also for TPCP. For SQL we examined 126 million queries in a stream. And we couldn't examine whether these were sensitive or not, because there were too many. >>: [inaudible] the SQL queries, suggest [inaudible] TPCP has not just SQL queries, it has a bunch of logic around it. >> Raluca Ada Popa: >>: TPCP. TPCP or TCPH? TCPH has -- >> Raluca Ada Popa: We take benchmark SQL is benchmark for TPCP on hub and we run it on top of CryptDB. >>: [inaudible]. >> Raluca Ada Popa: Just look at the queries. >>: [inaudible] like additions, I would expect to see [inaudible] but I don't see any columns [inaudible]. >> Raluca Ada Popa: Using the -- >>: Using the [inaudible] addition failure ->> Raluca Ada Popa: Oh, yeah, yeah, because the min level in that case is home, which is secure as RAND. I'm including them all under the column. I think I even said that all these items here either are we don't require inequality or you just do addition. Include here the same sensitivity. Maybe I should have said more clearly. CryptDB, TPCP certainly has those. >>: [inaudible] on sections during [inaudible]. >> Raluca Ada Popa: Yes. Queries per second. Okay. In terms of performance, let's compare the performance of CryptDB of applications running on top of CryptDB, to applications running on uncrypt DB MySQL. We looked at a lot of metrics in our paper. Here I'll present you the two we consider most interesting. One is latency. That is the time from when an application sends a query timing gets response and the second is server throughput. The number of queries per second the server can process. So in terms of latency CryptDB adds on everything 0.62 milliseconds per query for TPCP. >>: [inaudible]. >> Raluca Ada Popa: Because the workload is -- >>: No, it's because of encryption. the size of the data. The delta -- [inaudible] what's >> Raluca Ada Popa: The size of the database it's in memory. memory. It's encryption. It's encryption cost. It's in >>: The space is not ->> Raluca Ada Popa: Right. And I think for -- >>: You can see larger difference between your text and encryption. >> Raluca Ada Popa: Yes, probably. >>: [inaudible] it's fitting. >> Raluca Ada Popa: It's all fitting memory. And the encryption overhead in terms of space expansion I think is three times. Three times. We have more precise valuation on paper. Okay. So in terms of throughput, yes. >>: In terms of operations you have to go through multiple iterations between the proxy and the encrypted database. What is the assumption if you go to the previous slide? What is the assumption you made on the latency between the CryptDB proxy and CryptDB database? >> Raluca Ada Popa: So the application. to do a bunch more -- So you're saying that it has >>: So, for example, encrypted database is held externally and the CryptDB proxy is the same machine as the application, then there may be a large latency between [inaudible]. >> Raluca Ada Popa: So our setup is the following. Application. CryptDB proxy and database all on the same machine. One core. Restrict the machine to one core so we don't see other behavior. Okay. So in terms of throughput, this graph shows the queries per second as depending on the number of server cores. For TPCP more MySQL and CryptDB on TPCP. And maximum throughput loss is of 26 percent. So let's understand why. Yes. >>: [inaudible]. >> Raluca Ada Popa: It's all of the TPCP which includes updates, deletes, everything. Okay. So this graph shows you the throughput for different kind of operations of TPCP. We said delete, insert for MySQL and CryptDB. So we considered it actually for the first part of the query is actually the throughput loss is less than 26 percent. And the reason is that in this case the server doesn't do any cryptography in this state. Here's why once you do equality and you lower the level from RAND to that future equalities process directly on the data they don't need any additional encryption. In that sense the server here does the same work as the unencrypted database except TUPLs are a bit larger because of encryption. So the cost you're seeing is the cost of expansion because the values are a bit larger because of encryption. >>: But there is an overhead inasmuch as the levels of the onion have to be undone. So there's the proxy DB still has to do multiple encryption. Not just a single layer. It has to be every layer underneath. >>: [inaudible]. That difference -- >> Raluca Ada Popa: This is server throughput, which means that it's number of cores per second at the server. It has nothing to do with the proxy. >>: [inaudible]. This doesn't represent. >> Raluca Ada Popa: This does not represent what -- >>: [inaudible]. >> Raluca Ada Popa: What represents the proxy is actually the latency. The latency difference, because this latency includes the cost of the proxy and everything. Good question. >>: You say that was [inaudible]. >> Raluca Ada Popa: That was average latency. If you want to know the exact breakdown, our paper has the breakdown. So the second part of the question, the second part actually we can see the update with increments and summation actually have a larger throughput loss than 26 percent, roughly 50 percent. And that's because the server is doing homomorphic edition [inaudible] instead of adding to values it multiplies some larger cryptographic numbers. But overall TPCP the throughput loss is of 26 percent, which we think overall is practical. Yes. >>: Very small. I guess [inaudible] transaction. >> Raluca Ada Popa: I don't think so. >>: [inaudible] database transaction larger so updates, even if your data -- update the cost -- the cost would be a lot more for ->> Raluca Ada Popa: We didn't disable, we did not disable the log. We did not disable the log and MySQL setup was the same for CryptDB as for playing MySQL. >>: So what were you using DB? >> Raluca Ada Popa: I think we were using [inaudible] for DB actually. >>: [inaudible]. >> Raluca Ada Popa: The point is that we used the exact same setup for both of them. So we did not make any changes. >>: [inaudible] data size. That affected it the most [inaudible] item and logs have to be persistent, independent of the data. >>: Database of the system. >> Raluca Ada Popa: And we didn't disable the log. >>: [inaudible] encryption. [inaudible] we are doing rebalancing. Rebalancing that means [inaudible]. >> Raluca Ada Popa: Yes. >>: And [inaudible]. >> Raluca Ada Popa: Okay. So basically this CryptDB. These are two papers mushed into one talk. So CryptDB paper contains scheme of [inaudible] that's the one. That's the one it includes. So basically there's no [inaudible] balancing but at the same time the encryption cost is larger. These results are for CryptDB. The scheme I told you about is actually a paper follow-up. It would be interesting to put them all together and see. >>: What happens if there's a way [inaudible] proxy? >> Raluca Ada Popa: There's a wide -- >>: A wide proxy -- probably the most secure setup, if it's to the same -- same datacenter. [inaudible] have access [inaudible]. >> Raluca Ada Popa: Right. So the proxy and application are supposed to be not accessible to the database administrators. So ->>: The length of the proxy and server, wide link, and latency would increase, and if you have multiple round trips to latency. >> Raluca Ada Popa: But the latency should not affect server throughput. We're talking about how many queries per second can the server, while the server is support -- so I have a demo for you guys. Short DB demo. I hope it displays properly. Because we had some problems with the projector. Seems like it does. So on the -- get my cursor. Okay. So on the left side I have a shell CryptDB. This is, for example, an application use. Export SQL interface. So things should work exactly the same. On the right side I have basically, I have shell MySQL server so you get to see exactly what gets stored in the database. So let's create a table. And I'm also printing out messages to see what's happening in CryptDB. So I create the table that has two fields, name text and age integer. Now this gets transformed into create table 0 and you can see that P onion, the search, there's some salt, because we use a yes salted and that's a field in itself. For example, we can check the database server and indeed that's the table that gets created. So three onions for each field and a salt. So now let me insert into the table Alice agent Bob age 21, Chris age 20. So we can see that in fact what the CryptDB, what the CryptDB proxy produces is really a query with encrypted values. And let's make sure that that's what's getting stored to the database server. So indeed we can see database server contains encrypted data. So now if the user wants to see what's inside the table, we can see that he still gets access to the CryptDB data. And we see what actually the proxy does behind the scene it says the server give me all the equality onions, because those are easy to decrypt and to salt. And then gets back the encrypted results from the server and the crypto gives them back to the user. So let's do a line adjustment. We can say select star from T where age equals 19. So we can see that the first, as I said the layer is RAND initially, so the process equality you have to go down to that. So we can see that an update of the onion equality that's issued and the level therefore becomes that. Then the actual query is issued. We can see where that field is equal to the encryption of 19. Then the crypto results are received, the encrypted results are received from the database server, and the proxy decrypts them and sends them to the user. Okay. So now I'm going to show you a more interesting query. So I'm going to select sum of greatest of age and 20, 20. So basically what the greatest operator does is takes the maximum from age and 20. So the ages we have is 19, 20, 21, so the greatest operator will return 20, 21. And passes them to the sum operator who is supposed to add them up. But the first side you say wait didn't you tell us CryptDB cannot combine encryption schemes. It turns out that it's really smart and in certain cases not actually combining encryption schemes. You can use the greatest operator to figure out whether age is greater than 20 but then if that's the case you return the homomorphic encryption as opposed to returning the order preserved encryption. So we can see as far as it works and second we can see how the query was rewritten. So basically the greatest transform into if the OP encryption of age is greater than OP encryption of 20, then give me back homomorphic field, give me homomorphic encryption of 20. These are passed as inputs to the aggregate user defined function which basically that's homomorphic addition. And the encrypt result is sent back to the user and the decrypt 261. All right. So in conclusion, if I can conclude, yes, I can conclude, CryptDB provides the first practical DBMS more running most standard queries on encrypted data. Has modest overhead and makes no changes to DMMS. This is the website of CryptDB. It has the papers and the source code to play with, if you're interested in. And thank you. [applause] >>: I have a question about usability. So encryption is kind of hard to understand for even cryptographers, especially when you throw in deterministic encryption and OP encryption with some cryptographers probably wouldn't even say is encryption. So what kind of administers, do you think that add administrators can understand the implications of using deterministic encryption or God help them OP encryption. >> Raluca Ada Popa: I personally think it's not that hard because these things are just three very simple things to know. One, nothing leaks. One, you can tell him histogram equality and one order. So there's really just three things they have to get their head around. And one way to do them, we've been thinking about it, is to have this nice user interface basically showing each onion basically three gradient colors, which one is OP, which one is one equality and one RAND, and they can understand based on that based on security. I think the safer thing to say is that whenever they have some column, they know it's sensitive. Then in CryptDB they can set a threshold in the proxy don't go below RAND period. If they're really worried about the certain encryption scheme, certain column they can only set this threshold say I don't want to, this is secure, I don't want you to -this is really worrisome, I don't want you to go beyond a certain onion level. >>: Might be more -- I guess I'm -- you have hubcap as an example. I'm concerned a steering committee maybe who isn't in cryptography might think we have two alternatives. We could use CryptDB which I've read about, encrypts things, it's secure, or we could require the PC chair to have his papers in a separate database. And it seems like ->> Raluca Ada Popa: Separate database on the same server. >>: You talk in the paper how some hot crap installations go to extreme lengths to keep the PC chair from seeing his conflicts, extreme lengths of having a different database for those. How does the steering committee make those choices? >> Raluca Ada Popa: I don't think you have -- so I think besides CryptDB you have no choice. You either keep the data, you process unencrypted data or you use CryptDB. So I don't think there's really a choice. There's no alternative. >>: All your PC conflicts are managed by the co-chair in a separate database. >> Raluca Ada Popa: Right. So that database itself compute on unencrypted data. That one can be attacked as well. >>: You don't give PC chair administrative rights to that. >> Raluca Ada Popa: Right. But you can still potentially have attacks to that database. Now, we -- right now if you're thinking of -- you can always have some sort of attackers in anything. >>: Particular attackers. The PC chair might accidentally or might arrange to figure out information about ->> Raluca Ada Popa: If the whole world is vulnerable to specific attacks maybe you can have solutions for those specific attacks. But I think the whole world is more complicated. And then I just seek solutions one encrypted data in which case DB2 is the only practical DBMS for that or you don't, in which case you don't have the security of computing the data. >>: One more follow-up. Usability might it better not to have OP at all, because doesn't it give the illusion of security where no security -- it seems like it would be better just to say, hey, administrator, you know put a plain text, put an exclamation point saying administrator, anybody could decrypt this column. Saying it's encrypted with order preserving encryption kind of provides a misleading sense ->> Raluca Ada Popa: Basically you're saying you could have a CryptDB that only contains random and [inaudible] leak nothing and you can have deterministic and join the class and that's it. >>: It seems like the benefit of OPE is -- it seems like it's like giving a handgun to a child. It may be more dangerous than not ->> Raluca Ada Popa: Because you're saying the administrators may not understand. Okay. Maybe then that's a good idea for the administrators to just tell them, just to tell them to think about nothing or that or in fact you can make it even simpler for them depends on what you're willing to assume they can do, for example, they can say if it's something that's very secure, very worrisome mark it as such. For those, behind the scenes DB proxy is going to make sure that RAND nothing else for the others it's going to do OP that's better than nothing, better than not encrypting the OP. But maybe for indeed for administrators they could just point out what's secret and what's worrisome and what's not. That will make it very simple. >>: Have you tried other benchmarks other than TCPH? >> Raluca Ada Popa: Uh-huh. >>: [inaudible]. >> Raluca Ada Popa: Good question. >>: Any or very few. >> Raluca Ada Popa: Good question. has lots of complex -- TCPH analytics query, TCPH >>: [inaudible]. >> Raluca Ada Popa: Exactly. And whereas CryptDB is more for royalty type benchmark CryptDB is like gold. It wouldn't be fit for TCPH. But actually there was work at MIT following up work CryptDB that specifically looked at TCPH. And had clever database techniques such as speaking queries and maybe materializing certain queries they wrote a smart query planner to figure out how to split dynamically. They were split all TCPH and the overhead was twice at most less than twice basically overhead was less than twice in terms of throughput. And again and they had huge database they went to disk. So it was a purely database work. >>: For get being the data administered come to me some data about me what promise can you give me about them not leaking out to the wrong people? And was OB void the [inaudible] crazy statistics and that follows the whole database to be exposed. >> Raluca Ada Popa: I come and I tell you the following. I say you know you have two choices. One, you compute unencrypted data you don't have anything or two you use CryptDB, and basically for field [inaudible] nothing. And for that you can leak something and for OP you can leak something else. Basically there's the choice is nothing versus the security DB provides. And I think it's worth it. >>: Did you look at the amount of data that actually gets transferred between the encrypt DB and the proxy DB as a comparison in the encrypted case versus the unencrypted case? >> Raluca Ada Popa: So we have a measure that's specifically. But that really is just, what gets transferred between the two query and query results. So the query gets, contains encryptions of value so it's slightly larger and the results, we don't return any additional results besides the ones that -- we just return the actual results, CryptDB doesn't return the results. The other results larger because they're encrypted but no we didn't look at the actual expansion factor and I think that's because that's covered by other measurements, for example, it's covered by large how the story becomes that gives you an expansion, sense of expansion for the results. >>: OP sessions you're doing [inaudible]. >> Raluca Ada Popa: As I said OP is a different paper. For that paper did we look at message sizes? I don't think we didn't but we looked at expansion storage and we looked at ciphertext sizes. So you could reconstruct those maybe from the micro benchmarks. But it is an interesting -- yeah. >>: [inaudible] because now if your proxy is doing [inaudible] their database throughput with use I don't mean database server but the application looking at the database would say proxy and database ->> Raluca Ada Popa: I agree with throughput, but we measure throughput. I showed the throughput. >>: That was the database server. >> Raluca Ada Popa: Yes. >>: Application would be living at the database as database server plus proxy. >> Raluca Ada Popa: Yes. We have experiments for that as well. We actually took PHPB and we looked at the throughput of the application itself, which includes everything. Proxy database, everything. And actually there the throughput was actually loss was even less because of all the overhead of PHP, I think. So throughput of something like four percent, five percent. That's because of Web applications are so slow the throughput was. But there we counted everything, including ->>: That was the point that was being made which is if you had unencrypted data you don't have proxy [inaudible] but you have the proxy might be doing multiple rounds overall throughput might reduce. Ignoring current Web applications PHP. >> Raluca Ada Popa: Right. I agree, but you can always have proxies in parallel and I guess the database server at least the way we look at the database server is the bottleneck, because multiple applications share the database, the same database server where each database can have its own proxy so it's not as important as the throughput of the database server. But I agree with you. >> Maybe we can stop here and take questions off line. >> Raluca Ada Popa: [applause] I'll hang out. I'll be around.