The rest of the paper is organized as follows

advertisement
The Role of Cryptography in the Database as a Service
Model
Hemavathy Alaganandam
Contents
The Role of Cryptography in the Database as a Service Model ................................................................... 1
Contents ........................................................................................................................................................ 2
1. INTRODUCTION .................................................................................................................................... 3
2. DAS Scenario ......................................................................................................................................... 3
3. Data Privacy 1st challenge ...................................................................................................................... 4
3.1
3.2
3.3
4.
Data Privacy 2nd Challenge .................................................................................................................... 6
4.1
4.2
4.3
4.4
5.
6.
Software level encryption ............................................................................................................................................... 5
Hardware level encryption .............................................................................................................................................. 5
Encryption Penalty ......................................................................................................................................................... 5
Relation Encryption and Storage Model ......................................................................................................................... 6
Mapping Conditions (MapCond) ..................................................................................................................................... 8
Implementing Relational Operators over Encrypted Relations ........................................................................................ 9
Problems with the Strategy .......................................................................................................................................... 10
CONCLUSIONS ................................................................................................................................... 10
REFERENCES ..................................................................................................................................... 11
2
1. INTRODUCTION
"Database as a Service" model provides users power to create, store, modify, and retrieve data
from anywhere in the world, as long as they have access to the Internet. It introduces several
challenges, an important issue being data privacy. There are two main privacy issues. First, the
owner of the data needs to be assured that the data stored on the service-provider site is
protected against data thefts from outsiders. Second, data needs to be protected even from the
service providers, if the providers themselves cannot be trusted.
In this paper, I focus on the research made towards the first and second challenge. I specifically
focused on techniques to execute SQL queries over encrypted data. The strategy in the papers I
read was to process as much of the query as possible at the service providers' site, without
having to decrypt the data. Decryption and the remainder of the query processing are performed
at the client site. The basic idea was similar in most papers with each paper trying to overcome
the drawback in the other solutions.
The rest of the paper is organized as follows. Section 2 presents the Database as a Service
Scenario. Section 3 & Section 4 discusses the data privacy challenges and solutions. I conclude
the paper in Section 5.
2. DAS Scenario
The DAS scenario involves mainly four entities (see Figure1):




Data owner: an organization that produces data to be made available for controlled
external release;
User: human entity that presents requests (queries) to the system;
Client : front-end that transforms the user queries into queries on the encrypted data
stored on the server;
Server: an organization that receives the encrypted data from a data owner and makes
them available for distribution to clients.
Clients and data owners are assumed to trust the server to faithfully maintain outsourced data.
Specifically, the server is relied upon for the availability of outsourced databases. However, the
server is assumed not to be trusted with the confidentiality of the actual database content. The
server should be prevented from making unauthorized access to the data stored in the database.
To this purpose, the data owner encrypts her data and gives the encrypted database to the
server. The end users, instead, are trusted to access the database, according to the data owner’s
policy.
3
Figure 1: The service-provider architecture.
3. Data Privacy 1st challenge
If database as a service is to be successful, and customer data is to reside on the site of the
database service provider, then the service provider needs to find a way to preserve the privacy
of the user data. There needs to be security measure in place so that even if the data is stolen,
the thief cannot make sense of it. Encryption is the perfect technique to solve this problem. There
are two dimensions to encryption support in databases. One is the granularity of data to be
encrypted or decrypted. The field, the row and the page, typically 4KB, are the alternatives. The
field may appear to be the best choice, because it would minimize the number of bytes encrypted.
However, practical methods of embedding encryption within relational databases entail a
significant start up cost for an encryption operation. Row or the page level encryption amortizes
this cost over larger data. The second dimension is software versus hardware level
implementation of encryption algorithms.
4
3.1
Software level encryption
First of all symmetric ciphers do much better than asymmetric. However for example if we use
Blowfish which is a 64-bit block cipher, which means that data is encrypted and decrypted
in 64-bit chunks. This has implication on short data. Even 8-bit data, when encrypted by the
algorithm will result in 64 bits.
In the paper [4] Blowfish implementation was registered into the database as a user defined
function (UDF). Once it was registered, it could be used to encrypt the data in one or more fields whenever data was inserted into the chosen fields, the values are encrypted before being stored.
On read access, the stored data is decrypted before being operated upon. For example, if we
were to encrypt the column discount of a table called lineitem using the user defined function
called ”encrypt”, and decrypt it by the user defined function ”decrypt” one would use the following
SQL command to insert data into the table lineitem:
insert into lineitem (discount)
values (encrypt(10,key))
The statement to select the encrypted field is given next:
select decrypt(discount,key)
from lineitem
where custid = 300
In this approach the creator of the encrypted data supplies the key, and the database provides
the encryption function. Only those users who are given the key can decrypt the data using the
decryption algorithm. Since the key is owned by the creator, and not stored at the site of the
database service provider, unauthorized person who may get hold of disk files can not get hold of
the key. In fact, even employees of the database service provider do not have access to the
encryption key. The full security provided by the encryption algorithm is inherited by the data in
the database.
3.2
Hardware level encryption
Specialized encryption hardware, the IBM S/390 Cryptographic Coprocessor, is available under
IBM OS/390 environment with Integrated Cryptographic Service Facility (ICSF) libraries. IBM
DB2f or OS/390 provides a facility called ”editproc” (or edit routine), which can be associated
with a database table. An edit routine is invoked for a whole row of the database table, whenever
the row is accessed by the DBMS. An encryption/decryption edit routine can be registered for the
tables. When a read/write request arrives for a row in one of these tables, the edit routine invokes
encryption/decryption algorithm, which is implemented in hardware, for whole row. In[4], they
used the algorithm option for encryption hardware.
3.3
Encryption Penalty
The response time for a query on encrypted data will increase due to both the cost of decryption
as well as routine and/or hardware invocations in DB2. This increase is referred to as the
encryption penalty. The software field level encryption was found to be particularly CPU intensive.
As the number of rows increases, query execution time grows very sharply in software level
encryption. On the other hand, hardware level encryption showed almost perfectly linear
increase.
5
4. Data Privacy 2nd Challenge
The second challenge is that of "total" data privacy, which is more complex since it includes
protection from the database provider. The requirement is that encrypted data may not be
decrypted at the provider site. A straightforward approach is to transmit the requisite encrypted
tables from the server (at the provider site) to the client, decrypt the tables, and execute the query
at the client. But this approach mitigates almost every advantage of the service-provider model,
since now primary data processing has to occur on client machines. For this reason the
encrypted database is augmented with additional information (index) allows certain amount of
query processing to occur at the server without jeopardizing data privacy. The client also
maintains metadata for translating user queries to the appropriate representation on the
server, and performs post-processing on server query results. Based on the auxiliary information
stored,in [3] Hacigumus et al develop techniques to split an original query over unencrypted
relations into (1) a corresponding query over encrypted relations to run on the server, and (2) a
client query for post-processing results of the Server query.
4.1
Relation Encryption and Storage Model
For each relation R(Ai, A 2 , . . . , An), an encrypted relation:Rs (etuple, A1s, A2s, …, Ans)
is stored on the server. The attribute etuple stores an encrypted string that corresponds to a tuple
in relation R. Each attribute Ais corresponds to the index for the attribute Ai that will be used for
query processing at the server. For example, consider a relation emp below that stores
information about employees.
eid
23
860
320
875
Ename
Tom
Mary
John
Jerry
Salary
70K
60K
50K
55K
Addr
Maple
Main
River
Hopewell
Did
40
80
50
110
The emp table is mapped to a corresponding table at the server:
emp s ( etuple, eid s, ename s, salary s, addr s, did s)
It is only necessary to create an index for attributes involve in search and join predicates. In the
above example, if it is known that there would be no query that involves attribute addr in either a
selection or a join, then the index on this attribute need not be created.
Partition Functions
The domain of values (Di) of attribute R.Ai are mapped into partitions { p1 , . . . ,pi}, such that
(1) these partitions taken together cover the whole domain; and
(2) any two partitions do not overlap.
As an example, consider the attribute eid of the emp table above. Suppose the values of domain
of this attribute lie in the range [0, 1000]. Assume that the whole range is divided into 5 partitions:
partition(emp.eid) ={[0, 200], (200,400], (400,600], (600,800], (800, 1000]}
Different attributes may be partitioned using different partition functions. It should be clear that the
partition of attribute Ai corresponds to a splitting of its domain into a set of buckets. Any
histogram-construction technique, such as MaxDiff, equi-width, or equi-depth, could be used
to create partitioning of attributes.
Identification Functions
6
The identification function assigns an identifier identR.Ai (pj) to each partition pj of attribute Ai.
For instance, as shown below, an identifier is assigned to each range of emp id’s
2
7
5
1
4
[0,200]
(200,400]
(400,600]
(600,800]
(800, 1000]
Partition and identification functions of emp ID
The ident function value should be unique, so a collision free hash function is a good choice. For
example in the case where a partition corresponds to a numeric range, the hash function may use
the start and / or end values of a range.
Mapping Functions
The mapping function MapR.Ai takes care of mapping a value v in the domain of attribute Ai to
the identifier of the partition to which v belongs: MapR.Ai(V) = identR.Ai (pj), where pj
is the partition that contains v. In the example above, the following table shows some values of
the mapping function for attribute emp.eid.
For instance, Mapemp.eld(23) = 2, Mapemp.eid(860) = 4
There are two types of mapping functions:
1. Order preserving: A mapping function MapR.Ai is called order preserving if for any two values
vi and vj in the domain of Ai, if vi < vj, then MapR.Ai(Vi) < MapR.Ai (Vj).
2. Random: A mapping function is called random if it is not order preserving.
A random mapping function provides superior privacy compared to its corresponding order
preserving mapping. The choice, whether a mapping function is order preserving or not, affects
query translation. Query translation is simplified using an order-preserving mapping function.
Storing Encrypted Data
For each tuple t = ( a1 , a2 , . . . ,ai) in R, the relation R s stores a tuple:
(encrypt( {al, a2, . . . , an}), MapR.A1 (a1),MapR.A2 ( a 2 ) , . . . , MapR.Ai (ai))
where encrypt is the function used to encrypt a tuple of the relation. For instance, the following is
the encrypted relation emp s stored on the server:
Etuple
1100110011110010…
1000000000011101…
1111101000010001…
1010101010111110…
eids
2
4
7
4
Enames
19
31
7
71
Salarys
81
59
7
49
addrs
18
41
22
22
Dids
2
4
2
4
Corresponding Employee table in the server
The first column etuple contains the string corresponding to the encrypted tuples in emp. For
instance, the first tuple is encrypted to "1100110011110010..." that is equal to
encrypt(23, Tom, 7OK, Maple, 40). Any block cipher technique such as AES, RSA , Blowfish ,
DES etc., can be used to encrypt the tuples. The second column corresponds to the index on the
employee ids. For example, value for attribute eid in the first tuple is 23, and its corresponding
partition is [0, 200]. Since this partition is identified to 2, we store the value "2" as
the identifier of the eid for this tuple.
7
In general, the notation "E" ("Encrypt") maps a relation R to its encrypted representation. That is,
given relation R( A1, A2, . . . , A,~), relation E( R) is RS (etuple, A1 s,A2 s , . . . , An s). In the
above example, E(emp) is the table emp s .
Decryption Functions
Given the operator E that maps a relation to its encrypted representation, the inverse operator D
maps the encrypted representation to its corresponding unencrypted representation. That is,
D(Rs) = R. In the example above, D(emp s) = emp. That is, D (temp s) will decrypt all of the
encrypted columns in temp s and drop the auxiliary columns corresponding to the indices.
4.2
Mapping Conditions (MapCond)
For each relation, the server side stores the encrypted tuples, along with the attribute indices
determined by their mapping functions. Meanwhile, the client stores the metadata about the
specific indices, such as the information about the partitioning of attributes, the mapping
functions, etc.The client utilizes this information to translate a given query Q to its server-side
representation Qs, which is then executed by the server.
Let us consider different query conditions :
Condition  Attribute op Values (op can be =,<,>,<=,>=)
1) Attribute = Value: Here the mapcond function would just map the value to a partition identifier.
eg. MapCond(eid = 860) = eid s = 4 since eid = 860 is mapped to 4 by the mapping function of
this attribute.
2)Attribute < or >Value: Depending upon whether or not the mapping function MapAi of the
attribute is order-preserving or random, different translations are possible
• Order preserving: In this case, the translation is straight forward:
Mapcond(Ai < v) => Ai s <= MapAi(V)
• Random: Mapcond(Ai > v) => Ai s in Map>Ai(v).
Mapcond(eid < 280) => eid s in {2, 7} since all employee ids less than 280 have two partitions
[0,200] and (200,400], whose identifiers are {2, 7}.
Condition  Attribute op Attribute (op can be =,<,>,<=,>=)
1) Attribute1 = Attribute2: Such a condition might arise in a join or selection . We consider all
possible pairs of partitions of Ai and Aj that overlap.
Partions
[0,100]
(100,200]
(200,300]
(300,400]
Ident(empdid)
2
4
3
1
Partition
[0,200]
(200,400]
Ident(mgrdid)
9
8
For instance, the table above shows the partition and identification functions of two attributes
emp.did and mgr.did. Then condition emp.did = mgr.did is translated to the following condition
C1:
C1 : (emp s.did s = 2 AND mgr s.did s = 9)
V (emp s.did s = 4 AND mgr s.did s = 9)
8
V (empS.did s = 3 AND mgrS.did S = 8)
V (empS.did s = 1 AND mgrS.did s = 8).
2) Attribute1 < Attribute2 can be dealt in the same way
4.3
Implementing Relational Operators over Encrypted
Relations
This section describes how individual relational operators (such as selections, joins and grouping
operators) can be implemented in the proposed database architecture. I‘ve shown a few
examples below. The strategy is to partition the computation of the operators across the client
and the server. Specifically, we will attempt to compute a superset of answers generated by the
operator using the attribute indices stored at the server. These answers will then be filtered at the
client after decryption to generate the true results. Work done at the client is tried to minimize as
much as possible.
The Selection Operator: Consider a selection operation O'c(R) on a relation R, where C is a
condition specified on one or more of the attributes A1, A2,.. •, An of R. A straightforward
implementation of such an operator in our environment is to transmit the relation R s from the
server to the client. Then the client decrypts the result using the D operator, and implements the
selection. This strategy, however, pushes the entire work of implementing the selection to the
client. In addition, the entire encrypted relation needs to be transmitted from the server to the
client. An alternative mechanism is to partially compute the selection operator at the server using
the indices associated with the attributes in C, and push the results to the client. The client
decrypts the results and filters out tuples that do not satisfy C.
Eg. Selection(eid<395 AND did=140 (emp)) is translated to
Selection on client side[Decryption(Selection on server side based on Mapcond (empS))], where
the condition on the server is: Mapcond(C) = (eid s in [2, 7] AND did s = 4)
The query is first executed at the server based on the corresponding Mapcond. The results are
decrypted at the client and then the selection is once again done to eliminate the spurious rows.
The Join Operator would be similar as above. The join will again be done on set of rows
returned by the server which eliminates the false rows.
The Grouping and Aggregation Operator : The basic idea is this : the grouping is done at the
server side (ofcourse it will have rows that shouldn’t belong in there). The server does not
perform any aggregation since it does not have any values for those attributes. The results are
returned to the client, which performs the grouping operation again. This operation can be
implemented very efficiently, since every tuple belonging to a single group say x will be in a single
x s, group computed by the server. As a result, the client only needs to consider tuples in a single
x s, group when computing the groups corresponding to x. Of course, the aggregation functions
specified will be computed at the client, since their computation requires that tuples be first
decrypted.
Sorting Operation : It is similar to the grouping operation. The amount of work done at the client
in post-processing depends upon whether or not the attributes listed in the sort have orderpreserving mappings. If the attributes have order-preserving mappings, then the results returned
by the server are presorted upto within a partition. Thus, sorting the results is a simple local
operation over a single partition. Alternatively, even if the mapping is not order preserving, it is
useful to compute grouping at the server to reduce the amount of client work. Since the tuples
have been grouped by the server, sorting can be implemented efficiently using a merge-sort
algorithm.
9
Thus Given a query Q a strategy can be developed to split the computation of Q across the
server and the client. The server will use the implementation of the relational operators discussed
in the previous section to compute as much of the query as possible, relegating the remainder of
the computation to the client. The objective is to come up with the "best" query plan for Q that
minimizes the execution cost.
4.4
Problems with the Strategy
I will briefly go through in this section some of the problems with the Hacigumus et al approach
and list a few solutions proposed in other papers.
1) It assumes that the client has complete access to the query result. However, this assumption
does not fit real world applications, where different users may have different access privileges. In
[1] Damiani et all investigated a solution for implementing through cryptography a selective
access policy where they introduced a method to exploit a tree hierarchy for key management.
2) A major challenge in this scenario is how to compute and represent indexing information. Two
conflicting requirements challenge the solution of this problem: on one side, the indexing
information should be related with the data well enough to provide for an effective query
execution mechanism; on the other side, the relationship between indexes and data should not
open the door to inference and linking attacks that can compromise the protection granted by
encryption. In [2] Damiani et all proposed to use as index the result of a secure hash function
over the attribute values rather than straightforwardly encrypting the attributes; this way, the
attribute values’ distribution can be flattened by the hash function.
3) Another serious weakness of the approach is that it will output false joining records, which
leads to the greatly increased cost of decrypting records and the enormously degraded
performance of query. In [2] Damiani et all proposed constructing a encrypted B+ tree before
encrypting the field which will help fetch the exact rows that the query needs and decrypting only
those needed. The problem with this approach is the deterioration in performance that it can
introduce, due to need to execute of a series of queries to navigate the B+-tree in order to identify
the tuples belonging to the interval.
Thus we see that every approach has a specific problem and there is still room for progress in
this area.
5. CONCLUSIONS
Application Service Provider (ASP) model for enterprise computing has emerged with the rise of
Internet technologies. In the ASP model, a service provider can provide software as a service to a
very large client-base over the Internet. Unlike many other services, however, databases are
special. Data is a precious resource of an enterprise. As a result, privacy mad security of data at
the service-provider site is paramount. Data privacy can be achieved by using a suitable
encryption algorithm. The solution was to store the data at the service provider after encrypting it,
which can only be decrypted by the owner. Several techniques have been developed using which
the bulk of the work of executing the SQL queries can be done by the service provider without the
need to decrypt the stored data. The technique deploys a "coarse index", which allows partial
execution of an SQL query on the provider side. The result of this query is sent to the client. The
correct result of the query is found by decrypting the data, and executing a compensation query
at the client site. The service provider retains the responsibility to manage the persistence of the
data. The client gets total privacy, and the cost of cooperating in query execution with the service
provider. The client does not need to manage data persistence, thus continues to benefit from the
system management service of the database service provider. Thus with Cryptography, database
10
as a service is a viable model and has a good chance of emerging as a successful commercial
offering for some applications.
6. REFERENCES
[1] Key Management for Multi-User Encrypted databases – Damiani et al
[2] DAMIANI, E., DE CAPITANI DI VIMERCATI, S., JAJODIA, S., PARABOSCHI, S., AND
SAMARATI, P. 2003. Balancing confidentiality and efficiency in untrusted relational dbmss. In
Proceedings of the 10thACMConference on Computer and Communications Security,
Washington, DC. ACM Press, NewYork.
[3]HACIGUMUS, H., IYER, B., LI, C., AND MEHROTRA, S. 2002a. Executing SQL over
encrypted data in the database-service-provider model. In Proceedings of the ACM
SIGMOD’2002, Madison, WI ACM Press, New York.
[4]HACIGUMUS, H., IYER, B., ANDMEHROTRA, S. 2002b . Providing database as a service. In
Proceedings of the 18th International Conference on Data Engineering, San Jose, CA. IEEE
Computer Society Press, Los Alamitos, CA.
11
Download