The Role of Cryptography in the Database as a Service Model Hemavathy Alaganandam Contents The Role of Cryptography in the Database as a Service Model ................................................................... 1 Contents ........................................................................................................................................................ 2 1. INTRODUCTION .................................................................................................................................... 3 2. DAS Scenario ......................................................................................................................................... 3 3. Data Privacy 1st challenge ...................................................................................................................... 4 3.1 3.2 3.3 4. Data Privacy 2nd Challenge .................................................................................................................... 6 4.1 4.2 4.3 4.4 5. 6. Software level encryption ............................................................................................................................................... 5 Hardware level encryption .............................................................................................................................................. 5 Encryption Penalty ......................................................................................................................................................... 5 Relation Encryption and Storage Model ......................................................................................................................... 6 Mapping Conditions (MapCond) ..................................................................................................................................... 8 Implementing Relational Operators over Encrypted Relations ........................................................................................ 9 Problems with the Strategy .......................................................................................................................................... 10 CONCLUSIONS ................................................................................................................................... 10 REFERENCES ..................................................................................................................................... 11 2 1. INTRODUCTION "Database as a Service" model provides users power to create, store, modify, and retrieve data from anywhere in the world, as long as they have access to the Internet. It introduces several challenges, an important issue being data privacy. There are two main privacy issues. First, the owner of the data needs to be assured that the data stored on the service-provider site is protected against data thefts from outsiders. Second, data needs to be protected even from the service providers, if the providers themselves cannot be trusted. In this paper, I focus on the research made towards the first and second challenge. I specifically focused on techniques to execute SQL queries over encrypted data. The strategy in the papers I read was to process as much of the query as possible at the service providers' site, without having to decrypt the data. Decryption and the remainder of the query processing are performed at the client site. The basic idea was similar in most papers with each paper trying to overcome the drawback in the other solutions. The rest of the paper is organized as follows. Section 2 presents the Database as a Service Scenario. Section 3 & Section 4 discusses the data privacy challenges and solutions. I conclude the paper in Section 5. 2. DAS Scenario The DAS scenario involves mainly four entities (see Figure1): Data owner: an organization that produces data to be made available for controlled external release; User: human entity that presents requests (queries) to the system; Client : front-end that transforms the user queries into queries on the encrypted data stored on the server; Server: an organization that receives the encrypted data from a data owner and makes them available for distribution to clients. Clients and data owners are assumed to trust the server to faithfully maintain outsourced data. Specifically, the server is relied upon for the availability of outsourced databases. However, the server is assumed not to be trusted with the confidentiality of the actual database content. The server should be prevented from making unauthorized access to the data stored in the database. To this purpose, the data owner encrypts her data and gives the encrypted database to the server. The end users, instead, are trusted to access the database, according to the data owner’s policy. 3 Figure 1: The service-provider architecture. 3. Data Privacy 1st challenge If database as a service is to be successful, and customer data is to reside on the site of the database service provider, then the service provider needs to find a way to preserve the privacy of the user data. There needs to be security measure in place so that even if the data is stolen, the thief cannot make sense of it. Encryption is the perfect technique to solve this problem. There are two dimensions to encryption support in databases. One is the granularity of data to be encrypted or decrypted. The field, the row and the page, typically 4KB, are the alternatives. The field may appear to be the best choice, because it would minimize the number of bytes encrypted. However, practical methods of embedding encryption within relational databases entail a significant start up cost for an encryption operation. Row or the page level encryption amortizes this cost over larger data. The second dimension is software versus hardware level implementation of encryption algorithms. 4 3.1 Software level encryption First of all symmetric ciphers do much better than asymmetric. However for example if we use Blowfish which is a 64-bit block cipher, which means that data is encrypted and decrypted in 64-bit chunks. This has implication on short data. Even 8-bit data, when encrypted by the algorithm will result in 64 bits. In the paper [4] Blowfish implementation was registered into the database as a user defined function (UDF). Once it was registered, it could be used to encrypt the data in one or more fields whenever data was inserted into the chosen fields, the values are encrypted before being stored. On read access, the stored data is decrypted before being operated upon. For example, if we were to encrypt the column discount of a table called lineitem using the user defined function called ”encrypt”, and decrypt it by the user defined function ”decrypt” one would use the following SQL command to insert data into the table lineitem: insert into lineitem (discount) values (encrypt(10,key)) The statement to select the encrypted field is given next: select decrypt(discount,key) from lineitem where custid = 300 In this approach the creator of the encrypted data supplies the key, and the database provides the encryption function. Only those users who are given the key can decrypt the data using the decryption algorithm. Since the key is owned by the creator, and not stored at the site of the database service provider, unauthorized person who may get hold of disk files can not get hold of the key. In fact, even employees of the database service provider do not have access to the encryption key. The full security provided by the encryption algorithm is inherited by the data in the database. 3.2 Hardware level encryption Specialized encryption hardware, the IBM S/390 Cryptographic Coprocessor, is available under IBM OS/390 environment with Integrated Cryptographic Service Facility (ICSF) libraries. IBM DB2f or OS/390 provides a facility called ”editproc” (or edit routine), which can be associated with a database table. An edit routine is invoked for a whole row of the database table, whenever the row is accessed by the DBMS. An encryption/decryption edit routine can be registered for the tables. When a read/write request arrives for a row in one of these tables, the edit routine invokes encryption/decryption algorithm, which is implemented in hardware, for whole row. In[4], they used the algorithm option for encryption hardware. 3.3 Encryption Penalty The response time for a query on encrypted data will increase due to both the cost of decryption as well as routine and/or hardware invocations in DB2. This increase is referred to as the encryption penalty. The software field level encryption was found to be particularly CPU intensive. As the number of rows increases, query execution time grows very sharply in software level encryption. On the other hand, hardware level encryption showed almost perfectly linear increase. 5 4. Data Privacy 2nd Challenge The second challenge is that of "total" data privacy, which is more complex since it includes protection from the database provider. The requirement is that encrypted data may not be decrypted at the provider site. A straightforward approach is to transmit the requisite encrypted tables from the server (at the provider site) to the client, decrypt the tables, and execute the query at the client. But this approach mitigates almost every advantage of the service-provider model, since now primary data processing has to occur on client machines. For this reason the encrypted database is augmented with additional information (index) allows certain amount of query processing to occur at the server without jeopardizing data privacy. The client also maintains metadata for translating user queries to the appropriate representation on the server, and performs post-processing on server query results. Based on the auxiliary information stored,in [3] Hacigumus et al develop techniques to split an original query over unencrypted relations into (1) a corresponding query over encrypted relations to run on the server, and (2) a client query for post-processing results of the Server query. 4.1 Relation Encryption and Storage Model For each relation R(Ai, A 2 , . . . , An), an encrypted relation:Rs (etuple, A1s, A2s, …, Ans) is stored on the server. The attribute etuple stores an encrypted string that corresponds to a tuple in relation R. Each attribute Ais corresponds to the index for the attribute Ai that will be used for query processing at the server. For example, consider a relation emp below that stores information about employees. eid 23 860 320 875 Ename Tom Mary John Jerry Salary 70K 60K 50K 55K Addr Maple Main River Hopewell Did 40 80 50 110 The emp table is mapped to a corresponding table at the server: emp s ( etuple, eid s, ename s, salary s, addr s, did s) It is only necessary to create an index for attributes involve in search and join predicates. In the above example, if it is known that there would be no query that involves attribute addr in either a selection or a join, then the index on this attribute need not be created. Partition Functions The domain of values (Di) of attribute R.Ai are mapped into partitions { p1 , . . . ,pi}, such that (1) these partitions taken together cover the whole domain; and (2) any two partitions do not overlap. As an example, consider the attribute eid of the emp table above. Suppose the values of domain of this attribute lie in the range [0, 1000]. Assume that the whole range is divided into 5 partitions: partition(emp.eid) ={[0, 200], (200,400], (400,600], (600,800], (800, 1000]} Different attributes may be partitioned using different partition functions. It should be clear that the partition of attribute Ai corresponds to a splitting of its domain into a set of buckets. Any histogram-construction technique, such as MaxDiff, equi-width, or equi-depth, could be used to create partitioning of attributes. Identification Functions 6 The identification function assigns an identifier identR.Ai (pj) to each partition pj of attribute Ai. For instance, as shown below, an identifier is assigned to each range of emp id’s 2 7 5 1 4 [0,200] (200,400] (400,600] (600,800] (800, 1000] Partition and identification functions of emp ID The ident function value should be unique, so a collision free hash function is a good choice. For example in the case where a partition corresponds to a numeric range, the hash function may use the start and / or end values of a range. Mapping Functions The mapping function MapR.Ai takes care of mapping a value v in the domain of attribute Ai to the identifier of the partition to which v belongs: MapR.Ai(V) = identR.Ai (pj), where pj is the partition that contains v. In the example above, the following table shows some values of the mapping function for attribute emp.eid. For instance, Mapemp.eld(23) = 2, Mapemp.eid(860) = 4 There are two types of mapping functions: 1. Order preserving: A mapping function MapR.Ai is called order preserving if for any two values vi and vj in the domain of Ai, if vi < vj, then MapR.Ai(Vi) < MapR.Ai (Vj). 2. Random: A mapping function is called random if it is not order preserving. A random mapping function provides superior privacy compared to its corresponding order preserving mapping. The choice, whether a mapping function is order preserving or not, affects query translation. Query translation is simplified using an order-preserving mapping function. Storing Encrypted Data For each tuple t = ( a1 , a2 , . . . ,ai) in R, the relation R s stores a tuple: (encrypt( {al, a2, . . . , an}), MapR.A1 (a1),MapR.A2 ( a 2 ) , . . . , MapR.Ai (ai)) where encrypt is the function used to encrypt a tuple of the relation. For instance, the following is the encrypted relation emp s stored on the server: Etuple 1100110011110010… 1000000000011101… 1111101000010001… 1010101010111110… eids 2 4 7 4 Enames 19 31 7 71 Salarys 81 59 7 49 addrs 18 41 22 22 Dids 2 4 2 4 Corresponding Employee table in the server The first column etuple contains the string corresponding to the encrypted tuples in emp. For instance, the first tuple is encrypted to "1100110011110010..." that is equal to encrypt(23, Tom, 7OK, Maple, 40). Any block cipher technique such as AES, RSA , Blowfish , DES etc., can be used to encrypt the tuples. The second column corresponds to the index on the employee ids. For example, value for attribute eid in the first tuple is 23, and its corresponding partition is [0, 200]. Since this partition is identified to 2, we store the value "2" as the identifier of the eid for this tuple. 7 In general, the notation "E" ("Encrypt") maps a relation R to its encrypted representation. That is, given relation R( A1, A2, . . . , A,~), relation E( R) is RS (etuple, A1 s,A2 s , . . . , An s). In the above example, E(emp) is the table emp s . Decryption Functions Given the operator E that maps a relation to its encrypted representation, the inverse operator D maps the encrypted representation to its corresponding unencrypted representation. That is, D(Rs) = R. In the example above, D(emp s) = emp. That is, D (temp s) will decrypt all of the encrypted columns in temp s and drop the auxiliary columns corresponding to the indices. 4.2 Mapping Conditions (MapCond) For each relation, the server side stores the encrypted tuples, along with the attribute indices determined by their mapping functions. Meanwhile, the client stores the metadata about the specific indices, such as the information about the partitioning of attributes, the mapping functions, etc.The client utilizes this information to translate a given query Q to its server-side representation Qs, which is then executed by the server. Let us consider different query conditions : Condition Attribute op Values (op can be =,<,>,<=,>=) 1) Attribute = Value: Here the mapcond function would just map the value to a partition identifier. eg. MapCond(eid = 860) = eid s = 4 since eid = 860 is mapped to 4 by the mapping function of this attribute. 2)Attribute < or >Value: Depending upon whether or not the mapping function MapAi of the attribute is order-preserving or random, different translations are possible • Order preserving: In this case, the translation is straight forward: Mapcond(Ai < v) => Ai s <= MapAi(V) • Random: Mapcond(Ai > v) => Ai s in Map>Ai(v). Mapcond(eid < 280) => eid s in {2, 7} since all employee ids less than 280 have two partitions [0,200] and (200,400], whose identifiers are {2, 7}. Condition Attribute op Attribute (op can be =,<,>,<=,>=) 1) Attribute1 = Attribute2: Such a condition might arise in a join or selection . We consider all possible pairs of partitions of Ai and Aj that overlap. Partions [0,100] (100,200] (200,300] (300,400] Ident(empdid) 2 4 3 1 Partition [0,200] (200,400] Ident(mgrdid) 9 8 For instance, the table above shows the partition and identification functions of two attributes emp.did and mgr.did. Then condition emp.did = mgr.did is translated to the following condition C1: C1 : (emp s.did s = 2 AND mgr s.did s = 9) V (emp s.did s = 4 AND mgr s.did s = 9) 8 V (empS.did s = 3 AND mgrS.did S = 8) V (empS.did s = 1 AND mgrS.did s = 8). 2) Attribute1 < Attribute2 can be dealt in the same way 4.3 Implementing Relational Operators over Encrypted Relations This section describes how individual relational operators (such as selections, joins and grouping operators) can be implemented in the proposed database architecture. I‘ve shown a few examples below. The strategy is to partition the computation of the operators across the client and the server. Specifically, we will attempt to compute a superset of answers generated by the operator using the attribute indices stored at the server. These answers will then be filtered at the client after decryption to generate the true results. Work done at the client is tried to minimize as much as possible. The Selection Operator: Consider a selection operation O'c(R) on a relation R, where C is a condition specified on one or more of the attributes A1, A2,.. •, An of R. A straightforward implementation of such an operator in our environment is to transmit the relation R s from the server to the client. Then the client decrypts the result using the D operator, and implements the selection. This strategy, however, pushes the entire work of implementing the selection to the client. In addition, the entire encrypted relation needs to be transmitted from the server to the client. An alternative mechanism is to partially compute the selection operator at the server using the indices associated with the attributes in C, and push the results to the client. The client decrypts the results and filters out tuples that do not satisfy C. Eg. Selection(eid<395 AND did=140 (emp)) is translated to Selection on client side[Decryption(Selection on server side based on Mapcond (empS))], where the condition on the server is: Mapcond(C) = (eid s in [2, 7] AND did s = 4) The query is first executed at the server based on the corresponding Mapcond. The results are decrypted at the client and then the selection is once again done to eliminate the spurious rows. The Join Operator would be similar as above. The join will again be done on set of rows returned by the server which eliminates the false rows. The Grouping and Aggregation Operator : The basic idea is this : the grouping is done at the server side (ofcourse it will have rows that shouldn’t belong in there). The server does not perform any aggregation since it does not have any values for those attributes. The results are returned to the client, which performs the grouping operation again. This operation can be implemented very efficiently, since every tuple belonging to a single group say x will be in a single x s, group computed by the server. As a result, the client only needs to consider tuples in a single x s, group when computing the groups corresponding to x. Of course, the aggregation functions specified will be computed at the client, since their computation requires that tuples be first decrypted. Sorting Operation : It is similar to the grouping operation. The amount of work done at the client in post-processing depends upon whether or not the attributes listed in the sort have orderpreserving mappings. If the attributes have order-preserving mappings, then the results returned by the server are presorted upto within a partition. Thus, sorting the results is a simple local operation over a single partition. Alternatively, even if the mapping is not order preserving, it is useful to compute grouping at the server to reduce the amount of client work. Since the tuples have been grouped by the server, sorting can be implemented efficiently using a merge-sort algorithm. 9 Thus Given a query Q a strategy can be developed to split the computation of Q across the server and the client. The server will use the implementation of the relational operators discussed in the previous section to compute as much of the query as possible, relegating the remainder of the computation to the client. The objective is to come up with the "best" query plan for Q that minimizes the execution cost. 4.4 Problems with the Strategy I will briefly go through in this section some of the problems with the Hacigumus et al approach and list a few solutions proposed in other papers. 1) It assumes that the client has complete access to the query result. However, this assumption does not fit real world applications, where different users may have different access privileges. In [1] Damiani et all investigated a solution for implementing through cryptography a selective access policy where they introduced a method to exploit a tree hierarchy for key management. 2) A major challenge in this scenario is how to compute and represent indexing information. Two conflicting requirements challenge the solution of this problem: on one side, the indexing information should be related with the data well enough to provide for an effective query execution mechanism; on the other side, the relationship between indexes and data should not open the door to inference and linking attacks that can compromise the protection granted by encryption. In [2] Damiani et all proposed to use as index the result of a secure hash function over the attribute values rather than straightforwardly encrypting the attributes; this way, the attribute values’ distribution can be flattened by the hash function. 3) Another serious weakness of the approach is that it will output false joining records, which leads to the greatly increased cost of decrypting records and the enormously degraded performance of query. In [2] Damiani et all proposed constructing a encrypted B+ tree before encrypting the field which will help fetch the exact rows that the query needs and decrypting only those needed. The problem with this approach is the deterioration in performance that it can introduce, due to need to execute of a series of queries to navigate the B+-tree in order to identify the tuples belonging to the interval. Thus we see that every approach has a specific problem and there is still room for progress in this area. 5. CONCLUSIONS Application Service Provider (ASP) model for enterprise computing has emerged with the rise of Internet technologies. In the ASP model, a service provider can provide software as a service to a very large client-base over the Internet. Unlike many other services, however, databases are special. Data is a precious resource of an enterprise. As a result, privacy mad security of data at the service-provider site is paramount. Data privacy can be achieved by using a suitable encryption algorithm. The solution was to store the data at the service provider after encrypting it, which can only be decrypted by the owner. Several techniques have been developed using which the bulk of the work of executing the SQL queries can be done by the service provider without the need to decrypt the stored data. The technique deploys a "coarse index", which allows partial execution of an SQL query on the provider side. The result of this query is sent to the client. The correct result of the query is found by decrypting the data, and executing a compensation query at the client site. The service provider retains the responsibility to manage the persistence of the data. The client gets total privacy, and the cost of cooperating in query execution with the service provider. The client does not need to manage data persistence, thus continues to benefit from the system management service of the database service provider. Thus with Cryptography, database 10 as a service is a viable model and has a good chance of emerging as a successful commercial offering for some applications. 6. REFERENCES [1] Key Management for Multi-User Encrypted databases – Damiani et al [2] DAMIANI, E., DE CAPITANI DI VIMERCATI, S., JAJODIA, S., PARABOSCHI, S., AND SAMARATI, P. 2003. Balancing confidentiality and efficiency in untrusted relational dbmss. In Proceedings of the 10thACMConference on Computer and Communications Security, Washington, DC. ACM Press, NewYork. [3]HACIGUMUS, H., IYER, B., LI, C., AND MEHROTRA, S. 2002a. Executing SQL over encrypted data in the database-service-provider model. In Proceedings of the ACM SIGMOD’2002, Madison, WI ACM Press, New York. [4]HACIGUMUS, H., IYER, B., ANDMEHROTRA, S. 2002b . Providing database as a service. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA. IEEE Computer Society Press, Los Alamitos, CA. 11