SDB: A Secure Query Processing System with Data Interoperability Zhian He† † Wai Kit Wong* Rongbin Li§ Ben Kao§ Siu Ming Yiu§ David Wai Lok Cheung§ Eric Lo† Department of Computing, The Hong Kong Polytechnic University * Department of Computing, Hang Seng Management College § Department of Computer Science, The University of Hong Kong {cszahe, ericlo}@comp.polyu.edu.hk, wongwk@hsmc.edu.hk, {kao, dcheung, rbli, smyiu}@cs.hku.hk ABSTRACT We address security issues in a cloud database system which employs the DBaaS model — a data owner (DO) exports data to a cloud database service provider (SP). To provide data security, sensitive data is encrypted by the DO before it is uploaded to the SP. Compared to existing secure query processing systems like CryptDB [7] and MONOMI [8], in which data operations (e.g., comparison or addition) are supported by specialized encryption schemes, our demo system, SDB, is implemented based on a set of datainteroperable secure operators, i.e., the output of an operator can be used as input of another operator. As a result, SDB can support a wide range of complex queries (e.g., all TPC-H queries) efficiently. In this demonstration, we show how our SDB prototype supports secure query processing on complex workload like TPC-H. We also demonstrate how our system protects sensitive information from malicious attackers. 1. INTRODUCTION Advances in cloud computing has recently led to much research works on the technological development of cloud database systems that deploy the Database-as-a-service model (DBaaS). Commercial cloud database services, such as Amazon’s RDS1 and Microsoft’s SQL Azure2 , are also available. Under the DBaaS model, a data owner (DO) uploads its database to a service provider (SP), which hosts high-performance machines and sophisticated database software to process queries on behalf of the DO. The SP thus provides storage, computation and administration services. There are numerous advantages of outsourcing database services, such as highly scalable and elastic computation to handle bursty workloads. Also, with multi-tenancy, cloud databases can greatly reduce the total cost of ownership. 1 2 http://aws.amazon.com/rds/ https://sql.azure.com/ This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain permission prior to any use beyond those covered by the license. Contact copyright holder by emailing info@vldb.org. Articles from this volume were invited to present their results at the 41st International Conference on Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast, Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 12 Copyright 2015 VLDB Endowment 2150-8097/15/08. An important issue of cloud database applications is data security. To protect sensitive data, the plain values of data should not be revealed even to the SP [5]. The common practice is to encrypt sensitive data before it is uploaded to the SP. The SP thus provides a reliable repository with storage and administration services (such as backup and recovery). To process queries, the encrypted data has to be shipped back to the DO, which has to process the sensitive data by itself. The powerful computation services given by the SP is mostly lost. In order to leverage the computation resources of the SP in query processing, a few secure query processing systems, like CryptDB [7] and MONOMI [8], have been developed. A weakness of these systems is that each data operation (e.g., comparison or addition) is supported by a specialized encryption scheme. These schemes are generally not data interoperable, i.e., the output of an operator cannot be used as input of another because different operators employ different encryption methods. As a result, existing approaches provide limited native supports to complex queries that involve multiple types of operators. For instance, CryptDB can only support 4 out of 22 TPC-H queries without significantly involving the DO or extensive precomputation in query processing. Trusted DB [3] and Cipherbase [2] take a hardware approach to provide data security. Since specific hardware is required, these approaches are generally more expensive than software approaches, which can be implemented using off-the-shelf machines. A detailed discussion of the above systems can be found in [9]. Our system, SDB, takes another approach that is based on the Secure Multiparty Computation (SMC) model [11]. With secret sharing, each plain value is decomposed into several shares and each share is kept by one of multiple parties. While no party can recover the plain values by its own shares, a protocol executed by the parties can be defined to compute a deterministic function of the values. To implement the protocol, SDB provides a set of userdefined functions (UDFs) at the SP such that secure query processing can be performed on any relational engine that supports UDF (e.g., PostgreSQL, Vertica, Hive, Spark SQL [1], etc.). A distinguishing feature of SDB is that all UDFs operate on secret shares and so they all work on data in the same encrypted space. SDB thus provides data interoperablility, which allows a wide range of complex queries to be expressed and processed. As an example, all TPC-H queries can be natively processed by SDB. 2. TECHNICAL OVERVIEW In this section, we provide a brief overview of the SDB system [9]. We first describe our secret sharing scheme and then talk about 1876 exponent of the above formula is computed in modular φ(n). So, we write, the system architecture. 2.1 Secret Sharing We employ a secret sharing scheme between the DO and the SP. Each sensitive data item v is split into two shares, one kept at the DO and another at the SP. We use JvK to denote a sensitive value (the JK symbolizes that the value is secret and should be kept in a safe). We call the share of JvK to be kept at the DO, denoted by vk , the item key of JvK. The share of JvK kept at the SP is denoted by ve , which is regarded as the encrypted value of JvK. The security goal is to prevent attackers from recovering the set of sensitive data JvK’s given their encrypted values ve ’s. The whole procedure is described below. The DO maintains a secret numbers g and a public key n. The number n is generated according to the RSA method, i.e., n is the product of two big random prime numbers ρ1 , ρ2 . The number g is a positive number that is co-prime with n.3 Define, φ(n) = (ρ1 − 1)(ρ2 − 1). (1) We have, based on the property of RSA, (aed mod n = a) ∀a, e, d such that ed mod φ(n) = 1. Consider a sensitive column A of a relational table T of N rows t1 , . . . , tN . The DO assigns to each row ti in T a random row id ri . Moreover, the DO randomly generates a column key ck A = hm, xi, which consists of a pair of random numbers. We require 0 < ri , m, x < n. To simplify our discussion, let us assume that the schema of T is (row-id, A). (Additional columns of T , if any, can be handled similarly.) The idea is to store table T encrypted on the SP. This consists of two parts: (1) Sensitive values in column A are encrypted using secret sharing based on the column key ck A = hm, xi and the row ids. (2) Since the row ids are used in encrypting column A’s values, the row ids have to be encrypted themselves. In our implementation, row ids are encrypted by an existing encryption scheme SIES [6]. The reason why row-id and A are encrypted differently is that row ids are never operated on by our secure operators (i.e., we assume row ids are not part of user queries). Hence, a simpler encryption method suffices. On the other hand, sensitive data is encrypted using secret sharing so that computational protocols can be defined to implement our secure operators. The secret sharing encryption process consists of two steps: Step 1 (item key generation). Consider a row with row id r and a sensitive value (of A) JvK. Under secret sharing, our objective is to split JvK into an item key vk and an encrypted value ve . Conceptually, ve is kept at the SP and vk is kept at the DO. Since we want to minimize the storage requirement of the DO, the item key vk is materialized on demand and is generated from the column key ck A (which is stored at the DO) and the row id r, which is stored encrypted at the SP. Specifically, § Definition 1. (Item key generation) Given a row id r and a column key ck A = hm, xi, the item key vk is given by, vk = gen(r, hm, xi) = mg (rx mod φ(n)) mod n. For simplicity, in the following discussion, we sometimes omit “mod φ(n)” in various expressions with an understanding that the vk = gen(r, hm, xi) = mg rx mod n. (2) Step 2 (Share computation). Shares of JvK are determined by a multiplicative secret sharing scheme. While vk is one of the share, the other share ve , which is considered the encrypted value of JvK, is computed by the following encryption function E. § Definition 2. (Encrypted value) Given a sensitive value JvK and its item key vk , the encrypted value ve is given by, vk−1 ve = E(JvK, vk ) = JvKvk−1 mod n, (3) where denotes the modular multiplicative inverse of vk , i.e., vk vk−1 mod n = 1. To recover JvK, one needs both shares vk and ve and compute JvK = D(ve , vk ) = ve vk mod n. (4) Figure 1 summarizes the whole encryption procedure and illustrates how sensitive data (e.g., a column A) is transformed into encrypted values ve ’s. It also shows that the DO only needs to maintain a column key, while the SP stores the bulk of the data. row-id row-id E(r) item key vk encrypted value ve E(1) 8 9 4 E(2) 32 22 3 E(8) 32 34 A [[v]] A [[v]] r 2 1 2 4 2 3 8 secret sharing vk = gen(r,<m,x>) E(r) DO ve column key ckA=<m,x> E(1) 9 <2,2> E(2) 22 E(8) 34 SP Figure 1: Encryption procedure (g = 2, n = 35) 2.2 System Architecture The original architecture of SDB that we published in [9] consists of a standalone secure query processing engine that is built on top of a traditional relational engine. That architecture simply treated the relational engine as a data store and did not leverage any of its access methods and fault-tolerance components. In this demonstration, we present our new SDB architecture. Figure 2 shows the new architecture of SDB. The architecture consists of two parts: (1) a lightweight SDB proxy at the DO and (2) a relational engine with a set of UDFs provided by SDB at the SP. In our prototype, we integrate SDB with Spark SQL by implementing the secure operators in a set of Hive UDFs. SDB can easily support any other relational engine by implementing a set of UDFs that work with that particular system. This new architecture pushes all the computations back to the underlying engine through UDFs. Consequently, SDB now enjoys all the benefits such as fault-tolerance, parallel-execution, and scalability provided by the underlying Spark SQL engine. The SDB proxy is responsible for: • Storing column keys for sensitive data in its key store. 3 In our implementation, ρ1 and ρ2 are 1024-bit numbers and so n is 2048-bit. • Accepting SQL queries from the application. 1877 Data Owner (DO) Query Rewrite 2 1 Service Provider (SP) 3 SDB UDFs Rewritten Query Query Application SDB Proxy Result key store 5 Encrypted Result SparkSQL SparkSQL Spark SQL 4 Figure 2: SDB Architecture • Rewriting the SQL operators that involve sensitive columns to their corresponding UDFs, and then submitting rewritten queries to the SP. • Receiving encrypted results from the SP and decrypting them using their corresponding column keys. • Finally, sending the decrypted results back to the application. The SP provides an unmodified relational engine with a set of SDB UDFs. The engine at the SP is responsible for: • Storing the plain values of insensitive data and the secrete shares of sensitive data. • Processing rewritten queries. • Returning encrypted results back to the SDB proxy. Let us illustrate query rewriting by an example. Consider two sensitive columns A and B of a table T , whose column keys are ckA = hmA , xA i and ckB = hmB , xB i , respectively. Suppose an application issues the following query: SELECT A × B AS C FROM T The SDB proxy will rewrite the above query as below: SELECT row−id, sdb multiply(Ae , Be , n) AS Ce FROM T where Ae , Be and Ce stand for the encrypted values of columns A, B, and C, respectively. The number n is the public key stored at the DO as mentioned before. According to the protocol of multiplication described in [9], the actual computation carried out by sdb multiply is: Database (DB) Knowledge — The attacker sees the encrypted values ve ’s stored in the DBMS of the SP. This happens when the attacker hacks into the DBMS and gains accesses to the diskresident data. Chosen Plaintext Attack (CPA) Knowledge — The attacker is able to select a set of plaintext values JvK’s and observe their associated encrypted values ve ’s. For example, an attacker may open a few new accounts at a bank (the DO) with different opening balances and observe the new encrypted values inserted into the SP’s DB. We remark that while CPA knowledge is easy to obtain for public key cryptosystems, it is much harder to get under the cloud database environment. This is because the attacker typically does not have control on the number, order, and kinds of operations that are submitted by other users of the system and so it is difficult for it to associate events on plain values with the events on the encrypted ones. Query Result (QR) Knowledge — The attacker is able to see the queries submitted to the SP and all the intermediate (encrypted) results of each operator involved in the query. QR Knowledge may be obtained in a few ways. For example, the attacker could have compromised the SP to inspect the instructions the client sends to the SP and the computations carried out by the SP. Or the attacker could intercept messages passed between the client and the server over the communication channel. We remark that it is typically more difficult to obtain QR Knowledge than DB Knowledge. This is because data in computation is of transient existence in memory while data on disk persists. The window of opportunity for an attacker to observe desired queries and their (encrypted) results is thus limited. Moreover, there are sophisticated industrial standards to make a communication channel highly secure. Our security goal is to prevent an attacker from recovering plaintext values JvK’s given that the attacker has acquired certain combinations of knowledge listed above. First, we argue that DB knowledge is typically easier to obtain than the others and so we assume that the attacker has DB knowledge. Second, it has been proven that no schemes are secure against an attacker that has both CPA knowledge and QR knowledge [4]. Therefore, we assume that the attacker does not have both of these knowledges. Fortunately, as we have explained, CPA and QR knowledges are typically difficult to obtain in a cloud database environment and so the chances of an attacker having both is small. Our system, SDB, is designed to be secure against the following threats: • DB+CPA Threat: The attacker has both DB knowledge and CPA knowledge. • DB+QR Threat: The attacker has both DB knowledge and QR knowledge. sdb multiply(Ae , Be , n) = Ae × Be mod n Recall that to decrypt the result Ce , we need the item keys Ck of C (Equation 4), which are generated by row ids and the column key ckC of C (Equation 2). Hence, the row-id is added in the rewritten query. Besides, in the case of multiplication, the SDB proxy computes the corresponding column key ckC as below. ckC = hmA × mB , xA + xB i Note that the SP cannot get any secret keys in the process. The computation is thus secure. Readers are referred to [10] for details about the proof of correctness for the above computation. 2.3 Security We consider three kinds of knowledge that an attacker may obtain by hacking the SP. We explain our security levels against those attackers’ knowledge. Readers are referred to [10] for proofs of SDB being secure against the above threats. 3. DEMONSTRATION In the demonstration, we will show a prototype of SDB that is integrated with Spark SQL. The system will also be instrumented from the perspective of an adversary — an administrator can get access to the disk and memory at any instant. We will use two machines in the demonstration. Machine MDO demonstrates the client machine running the SDB proxy. Machine MSP demonstrates the server machine running Spark SQL with the set of SDB UDFs loaded. The steps of the demonstration is presented as follows. 1878 1. An attendee chooses the secure column, upload a dataset to the SP, examining the key store in the SDB proxy. First, we will invite an attendee to use MDO , which has a sample local database D1 . D1 has not been encrypted and is supposed to be the original dataset owned by the attendee. Next, the attendee can select D1 and go into a setting page that allows the attendee to choose the attributes that need to be protected. In this step, we let the attendee choose any attributes that she deems appropriate. After the security settings, the attendee will be invited to click the “Upload” button to upload D1 to the SP (which is operated by SDB). After the uploading is complete, she shall see another database of the same name (i.e., D1 ), but with a little security lock shown next to its icon. That is the key store of the original database D1 . The attendee will be invited to check the size of the key store and also the content. Figure 4: Machine MSP : Memory dump at the service provider Execution at the SP. Sensitive data remains encrypted during the entire computation. 3. The attendee observes the memory dump of SDB at the server side When the attendee is submitting queries from the client machine to SDB (Step 2), she will also be invited to look at the memory dump of SDB at MSP (Figure 4). This step aims to tell the attendee that the query processing step does not expose any sensitive information at any point of the computation. Acknowledgements The paper is supported by GRF Grant 17201414 and FDS grant (UGC/FDS14/E05/14) from Hong Kong Research Grant Council. We would also thank the anonymous reviewers for their comments and suggestions. 4. Figure 3: Machine MDO : The data owner sends a query to the SDB Proxy 2. The attendee submits queries to SDB from the data owner side. In this step, the attendee will be invited to use MDO to submit queries from the data owner side to SDB. The attendee will first open a query view of the secured database D1 (Figure 3) and then she can pose any SQL queries. The page will show the rewritten query that is actually executed by the server. The attendee will also be invited to note the query execution time that is broken down into a client cost and a server cost. Specifically, the client cost is composed of query parsing time, query rewriting time, and result decryption time. In the demonstration, we show that these costs are subtle compared with the total cost. REFERENCES [1] Spark SQL https://spark.apache.org/sql/. [2] A. Arasu, S. Blanas, K. Eguro, M. Joglekar, R. Kaushik, D. Kossmann, R. Ramamurthy, P. Upadhyaya, and R. Venkatesan. Secure database-as-a-service with cipherbase. In SIGMOD, 2013. [3] S. Bajaj and R. Sion. Trusteddb: a trusted hardware based database with privacy and data confidentiality. In SIGMOD, 2011. [4] A. Boldyreva, N. Chenette, and A. O’Neill. Order-preserving encryption revisited: Improved security analysis and alternative solutions. In CRYPTO, 2011. [5] A. Chen. GCreep: Google engineer stalked teens, spied on chats. Gawker, September 2010 http://gawker.com/5637234/. [6] S. Papadopoulos, A. Kiayias, and D. Papadias. Secure and efficient in-network processing of exact sum queries. In ICDE, 2011. [7] R. A. Popa, C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan. Cryptdb: processing queries on an encrypted database. CACM, 2012. [8] S. Tu, M. F. Kaashoek, S. Madden, and N. Zeldovich. Processing analytical queries over encrypted data. In PVLDB, 2013. [9] W. K. Wong, B. Kao, D. W. L. Cheung, R. Li, and S. M. Yiu. Secure query processing with data interoperability in a cloud database environment. In SIGMOD, 2014. [10] W. K. Wong et al. Secure query processing with data interoperability in a cloud database environment. Technical Report TR-2014-03, Department of Computer Science, University of Hong Kong, 2014. [11] A. C. Yao. Protocols for secure computations (extended abstract). In FOCS, 1982. 1879