Accelerating Multilevel Secure Database Queries using P-Tree Technology 1 Imad Rahal and William Perrizo {imad.rahal, william.perrizo}@ndsu.nodak.edu Computer Science Department, North Dakota State University Fargo, ND 58105-5164, USA Abstract In the early 90s, a lot of research was done in the area of multi-level secure database systems. Most of the work was directed towards the security aspect without much concentration on query acceleration. In this paper, an algorithm, based on the P-tree technology [2,10], following the Sea View model [1] of multilevel relations is presented to accelerate queries in multilevel secure database systems. This algorithm recovers query output from single level relations in a fast and very spaceefficient manner. The algorithm does not employ temporary data structures extremely nor does it produce spurious tuples while recovering, which has always been a major problem in this area. Keywords Multilevel secure, query acceleration, P-tree, Peano ordering, multilevel relations, Sea View Model 1. INTRODUCTION Security in database systems, especially when confidential data is involved, has always been a great concern to enterprises. Database managers need to make sure that every user operates on the data she/he is given access to while guarding against malicious interaction between users having access to different kinds of data. While a lot of work in the area of multilevel security has been done to solve this problem, most of it was directed towards the security aspect without much regard to query efficiency. The Sea View model [1], one of the initial attempts to model multilevel relations, presented two algorithms: a decomposition algorithm to generate single level relations from a multilevel relation and a recovery algorithm to regenerate a multilevel relation from a set of single level relations. The decomposition algorithm is a very successful and efficient1 and will be used by our 1 Patents are pending for Peano Count Tree (P-tree) and bSQ technology. This work is partially supported by GSA Grant ACT#K961130308. algorithm; however, we will devise a new recovery algorithm to overcome the inefficiency and incorrectness – in terms of creating spurious tuples – of the current recovery algorithm. A successful attempt to replace the algorithm is presented in [3]; however, this attempt suffers from the inefficient use of bit vectors to represent highly sparse data. Our approach uses a quadrant-based loss less tree representation, Peano Count Tree (P-tree) [2,10], to represent this data and overcome the sparsity problem. The rest of the paper is organized as follows. In section 2 we give a brief description of the P-tree data representation, its initial application, and how it is used in our approach. Section 3 starts with a description of some access control mechanisms used to maintain data confidentiality [8] in secure database systems. Then, the concept of polyinstantiation [3,5,6] and its effects on multilevel recovery algorithms is described, followed by an explanation of multilevel relations [1,4]. In section 4, a brief preview of some previous work in this area is provided to give way to our algorithm, which is presented with an example in section 5. The algorithm is then formally presented. In section 6 a performance analysis is presented and finally the paper ends with a conclusion, section 7, comparing our approach to the other work, namely, the Sea View Model [1] and the Domain vector accelerator algorithm, DVA [3]. 2. DATA STRUCTURES Initially, P-trees [2,10] were used for data representations that are data-mining ready and efficient especially in the case of spatial data. A spatial image can be viewed as a 2-dimensional array of pixels. Associated with each pixel are various descriptive attributes, called “bands”. For example, visible reflectance bands (Blue, Green and Red), infrared reflectance bands (e.g., NIR, MIR1, MIR2 and TIR) and possibly some bands of data gathered from ground sensors (e.g., yield quantity, yield quality, and soil attributes such as moisture and nitrate levels, etc.). All the values have been scaled to range between 0 and 255, for simplicity. The pixel coordinates in raster order constitute the key attribute. One can view such data as a table in relational form where each pixel is a tuple and each band is an attribute. There are several formats used for spatial data, such as Band Sequential (BSQ), Band Interleaved by Line (BIL) and Band Interleaved by Pixel (BIP). In our previous work [9], we proposed a new format called bit Sequential Organization (bSQ). Since each intensity value ranges from 0 to 255, which can be represented as a byte, we split each bit in each band into a separate file, called a bSQ file. Each bSQ file can be reorganized into a bitSequential (bSQ) or quadrant-based tree (P-tree). The example in Figure 1 shows a bSQ file and its P-tree. 11 11 11 00 11 11 10 00 11 11 11 00 11 11 11 10 11 11 11 11 11 11 11 11 11 11 11 11 01 11 11 11 55 16 8 3 0 1 1 level=3 4 1 1 0 0 0 1 0 15 4 16 level=2 4 3 4 level=1 1 1 0 1 level=0 Figure 1. 8 by 8 image and its p-tree In this example, 55 is the count of 1’s in the entire In In In this example, 55 is the count of 1’s in the entire image (called root count), the numbers at the next level, 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrants are made up of entirely 1-bits (called pure-1 quadrants), we do not need sub-trees for them. Similarly, quadrants made up of entirely 0-bits are called pure-0 quadrants. This pattern is continued recursively. Recursive raster ordering is called Peano or Z-ordering in the literature. The process terminates at the leaf level (level-0) where each quadrant is a 1-row-1-column quadrant. If we were to expand all sub-trees, including those pure quadrants, then the leaf sequence is just the Peano space-filling curve for the original raster image. For each band (assuming 8-bit data values), we get 8 basic P-trees, one for each bit positions. For each band Bi we will label the basic P-trees: Pi,1, Pi,2, …, Pi,8; thus, Pi,j is a lossless representation of the jth bits of the values from the ith band. However, the Pij’s provide more information and are structured to facilitate data mining processes. Some of the useful features of P-trees can be found later in this paper or our earlier work [2,10,9]. The basic P-trees defined above can be combined using simple logical operations (AND, OR and COMPLEMENT) to produce P-trees for the original values (at any level of precision, 1-bit precision, 2-bit precision, etc.). We let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be expressed in 1bit, 2-bit,.., or 8-bit precision. For example, P b,110 can be constructed from the basic P-trees as: Pb,110 = Pb,1 AND Pb,2 AND Pb,3’ where ’ indicates the bit-complement (which is simply the count complement in each quadrant). This is called the value P-tree. The AND operation is simply the pixel wise AND of the bits. The data in the relational format can also be represented as P-trees. For any combination of values, (v1, v2, …,vn), where vi is from band-i, the quadrant-wise count of occurrences of this tuple of values is given by: P(v1,v2,…,vn) = P1,V1 AND P2,V2 AND … AND Pn,Vn This is called a tuple P-tree. If some of the values are expressed with bit-based precision, it is actually an interval P-tree. Note that the basic P-trees can be generated quickly and it is only a one-time cost. The logical operations are also very fast [2,10]. So this structure can be viewed as a “data mining ready” and lossless format for storing spatial data. Lastly, we note that the spatial dimensions can be 1, 2, 3 or higher. The P-tree data structure is also a great tool to represent large sparse bit vectors on which we perform logical operations (AND, OR, and COMPLEMENT) [9]. We can view a bit vector as a single file represented in bSQ format and thus represent it as we explained previously in this section. This representation is lossless, compressed and can be used to perform logical operations in a very efficient manner [2,10]. 3. SECURITY CONSIDERATIONS 3.1 Security Policies Many types of security policies or access control mechanisms have been integrated into database systems to provide data security. Two well-known access control mechanisms are the MAC and DAC mechanisms. The DAC mechanism, or Discretionary Access Control mechanism, is usually implemented in most commercial products. In this mechanism, the owner of an object (i.e., file, directory, etc…) decides who can access his/her object and in what manner (i.e., Read-only, Write-only, etc…). The Mandatory Access Control or MAC mechanism, developed by Bell and LaPadula, does not leave protection decisions of objects to the discretion of the owners. The system enforces the protection decisions. This mechanism defines a database by its subjects and objects. A subject is an active entity such as a process and an object is a passive entity such as a data item or table. Subjects have clearance levels and objects are assigned sensitivity levels (i.e., Top Secret, Secret, Confidential and Unclassified). In order for a certain subject to access an object, one of following two conditions must be satisfied (depending on the type of access): 1- The Simple Security Policy: A subject X can read access object Y, if X’s clearance level dominates (is greater than or equal to) Y’s sensitivity level. 2- The *-Policy: A subject X can write access object Y, if X’s clearance level is dominated by (is less than or equal to) Y’s sensitivity level. In short, the MAC policy states that reads should propagate downwards and writes should propagate upwards. All security authentication functions are stored in a TCB (Trusted Computing Base) away from the DBMS. When a request for object X by subject Y is issued, it is first authenticated in the TCB. If Y can access X then the request is forwarded to the DBMS; if not, then the request is rejected. 3.2 Polyinstantiation Due to security and access control mechanisms, some tuples in multilevel relations may be polyinstantiated. A polyinstantiated tuple is a tuple that exists more than once in a relation with the same apparent key (refer to section 3.3) but with some other attribute values being changed. This is due to the fact that different subjects are authorized to update or view different data. For example, suppose that a subject X with clearance C (Classified) is attempting to write a new value to a data item Y with sensitivity S (Secret). The old value in item Y is not viewable by subject X but the new value written into Y by X is viewable (it has X’s clearance level which C). To preserve the old value of Y, a new tuple is inserted into the relation with same apparent key (and same attribute values except for Y in this case). Now X can view the new value inserted into Y and Secret and Top Secret users will view the two values (the old value with sensitivity S and the new value with sensitivity C). 3.3 Multilevel relations A multilevel relation is a relation of the form R (A1, C1, …, An, Cn, TC) where Ai is any attribute and Ci is its classification (or sensitivity level). TC is the classification of the tuple. Ci belongs to the domain of classifications for data items. We denote A1 to be the apparent key of R. The concept of a key is a little bit different in multilevel relations because keys can be duplicated; this is why we refer to them as apparent keys instead of just keys. The reason behind this duplication of keys is polyinstantiation, which was explained previously (refer to section 3.2). 4. RELATED WORK The following are two approaches for maintaining and recovering multilevel relations from single-level relations. Each has its pros and cons. 4.1 The Sea View (SEcure dAta VIEW) Model The Sea View model [1] –a joint effort by the Stanford Research Institute and Gemini Computers– is considered as one of the most important moves towards multilevel secure relations. It consists of two algorithms: a decomposition algorithm and a recovery algorithm. The decomposition algorithm divides a multilevel relation R into a set of single level relations. The multilevel relation exists only at logical level; single level relations are stored physically. For every query, an output multilevel relation is reconstructed from the single-level relations using the recovery algorithm. Unfortunately, the recovery algorithm of the Sea View model suffers from the following [3]: 1- Creation of spurious tuples in the output (due to polyinstatiation). 2- Space inefficiency due to the use of many temporary data structures. 3- Time inefficiency due to unions and joins, which are two of the most expensive database operations. 4.2 The Domain (DVA) Algorithm Vector Accelerator The DVA algorithm [3], in short, is an algorithm motivated by the recovery algorithm of the Sea View Model based on domain vector accelerators, DVAs, to accelerate the recovery of multilevel relations from single-level relations. DVAs accelerate joins between relations and thus lead to reducing the response time of queries requiring many joins. The DVA algorithm solves the problems of the Sea View Model’s recovery algorithm (refer to section 4.1); it doesn’t create spurious tuples in the output table and is space and time efficient. It shows significant improvement especially in environments where queries involve selections on some (one or more) attributes of the multilevel relations. In spite of these facts, this algorithm uses a lot of temporary data structures, some of which can be omitted to improve the algorithm’s space efficiency without negatively affecting its overall performance or its functionality. In addition, the bit vector representation used is very sparse and is inefficient especially when the database relations are very large (in terms of the number of records). 5. OUR APPROACH We will first present our algorithm, accompanied by an example, followed by a formal description. In our algorithm, we will assume that the decomposition algorithm of the Sea View Model [1,5] was used to decompose a multilevel relation into single level relations. Our algorithm will be applied to recover the multilevel relation from those single level relations. 5.1 *Name Aspide Roland1/ 2/3 Roland1/ 2/3 Rname,c AS30L AS-9 Kyle Al Hussein Aspide Roland1/2/3 Rdevby,u,u AS30L AS-9 Kyle Al Hussein Rdevby,u,c France Russia Iraq Aspide Italy Algorithm We will apply the algorithm on the following multilevel relation (represents all missiles deployed in Iraq during the gulf war): AS30L AS-9 Kyle Al Hussein Rname,u U R = Iraqi Missiles Developed by Length (devby) (M) Range (KM) France 10 U 3.65 U U TC U U Russia U 6 U 90 C C U Iraq U 12.2 C 650 C C U Italy C 3.7 C 35 C C C France C 2.4 C 6.3 C C C Germany S NIL S 8 S S Figure 2. A multilevel relation Rdevby,c,c Roland 1/2/3 Rlength,u,u AS30L AS-9 Kyle Rdevby,c,s France 3.65 6 Rlength,c,c Roland 1/2/3 2.4 Rrange,u,c AS-9 Kyle Al Hussein Aspide 90 650 35 Roland 1/2/3 Germany Rlength,u,c Al Hussein 12.2 Aspide 3.7 Rrange,u,u AS30L 10 Rrange,c,c Roland 6.3 1/2/3 Rrange,c,s Roland 8 1/2/3 Figure 3. Single level relations created by the Sea view model’s decomposition algorithm All relations containing the key only - i.e., Rname,u (base single level relation containing all keys at classification level u) and Rname,c - are referred to as base relations. Ai denotes any attribute and A1 denotes the apparent key. We say that a classification x is greater than classification y (i.e. x>y) iff x dominates y in the classification hierarchy. For simplicity, assume that all relations sort their entries in an order consistent with base relations – otherwise we would need indexes to keep track of key entries. Suppose we want to recover the output of the query “Select name, devby, length from R where range<>35 ”. 1- For every relation RAi,x,y (single level relations containing all entries from multilevel relation having keys at classification level x and Ai attribute values at classification level y), excluding base relations, create a P-tree, PAi,x,y, denoting the presence or absence of the keys at level x. PAi,x,y can have a maximum record count [2,10] equal to the number of keys at level x which can be found in relation Rkey,x. Keys at level u = {AS30L, AS-9 Kyle, Al Hussein, Aspide} Keys a level c = {Roland 1/2/3} Pout,u = 3 1110 (the first 3 keys in Rname,u will appear in the output) Pout,c = 1 (the first and only key in Rname,c will appear in the output) Figure 5. Output P-trees Prange,u,u = 1 1000 Prange,u,c = 3 0111 Prange,c,c = 1 Pdevby,u,u = Prange,c,s = Pdevby,c,s = 1 3 1110 Pdevby,u,c = 1 0001 Pdevby,c,c = 1 1 3- Create the polyinstantiated P-trees, Ppoly,x,y, for all x, y combination such that x<y (contains all polyinstantiated keys at level x by subjects at level y). a. For all attributes Ai requested in the output of the query (name, devby and length) except for the key (name) create the following temporary p-trees. 1000 Plength,u,u = 2 1100 Plength,u,c = 2 0011 Plength,c,c = 1 Figure 4. P-trees for length and range attributes 2- Create the output P-trees, Pout,x, at every level x (contains all keys with classification x that will appear in the output table) as follows: a. Read all relations having an attribute participating in the selection criteria of the query (range <> 35). We should read Rrange,u,u and Rrange,u,c at level u and Rrange,c,c and Rrange,c,s at level b. Get all entries from those relations satisfying the selection criteria at each level x (i.e. all keys associated with range entries <> 35 in relations from step a) At level u we have AS30L, AS-9 Kyle, and Al Hussein; and at level c we have Roland1/2/3 c. Create Pout,x at level x where a 1 value is given to those keys succeeding from step b and 0 otherwise. Temp-Ai,x,y= AND all PAi,x,y where x<=y b. To get Ppoly,x,y, OR all Temp-Ai,x,y Ppoly,x,y = ORing Temp-Ai,x,y for all Ai Temp-devby,u,c = Temp-devby,c,s = Temp-devby,u,s = Temp-length,u,c = Temp-length,c,s = Temp-length,u,s = 0 1 0 0 0 0 Figure 6. Temporary P-trees Ppoly,u,c = 0 Ppoly,c,s = 1 Ppolyu,s = 0 Figure 7. Polyinstantiated P-trees If we view the leaf nodes as a bit vector, a 1-bit in position n in any Ppoly,x,y signifies that the nth entry in Rkey,x is polyinstantiated . Therefore, entry 1 in Rname,c which is Roland1/2/3 is polyinstantiated. 4- Create the polyinstantiated output P-trees Ppoly,out,x,y as the ANDing of Ppoly,x,y and Pout,x (where x<y). i. Ppoly,out,u,c= Ppoly,u,c AND Pout,u = 0 AND 1 1000 =0 ii. Ppoly,out,c,s = Ppoly,c,s AND Pout,c = 1 AND 1 =1 Ppoly,out,u,s= Ppoly,u,s AND Pout,u = 0 AND 1 1000 =0 Figure 8. Polyinstantiated output P-trees A 1-bit in position n in any Ppoly,out,x,y signifies that the nth entry in Rkey,x is polyinstantiated and appears in the output of the query . Therefore, entry 1 in Rname,c which is Roland1/2/3 is polyinstantiated and appears in the output. 5- Create Output table as follows: a. Having a number of columns equal to the number of fields, Ai, requested in the output of the query If Ai is the key (i=1) then get the nth record from Rkey,x and store it under Ai column in output table. This entry has classification x. Else, go to the nth entry in PAi,x,z where z = y initially . If a 1 bit is found in position n then get the value of the nth entry from RAi,x,z and store it under Ai column in output table. This entry has classification z. Else (if 1 bit is not found in position n of PAi,x,z then) decrement z to the next lower level and repeat this step. Output multilevel relation Developed by *Name (devby) Length (M) AS30L U France U 3.65 U AS-9 Kyle U Russia U 6 U Al Hussein U Iraq U 12.2 C Roland1/2/3 C France C 2.4 C Roland1/2/3 C Germany S 2.4 S Figure 9. Final output multilevel relation 5.2 Formal Description The following is a brief formal description of the algorithm. Refer to section 5.1 for a detailed description. Select name, devby and length (3 columns). b. Scan Pout,x for 1-bit entries. If a 1 bit appears in position n do the following for all Ai attributes requested in the output: i. If Ai is the key (i = 1) then get the nth record from Rkey,x and store it under Ai column in output table. This entry has classification x. ii. Else, go to the nth entry in PAi,x,z where z = x initially . If a 1 bit is found in position n then get the value of the nth entry from RAi,x,z and store it under Ai column in output table. This entry has classification z. Else (if 1 bit is not found in position n of PAi,x,z then) increment z to the next higher level and repeat this step. c. Scan Ppoly,out,x,y for 1 bit entries. If a 1 bit appears in position n do the following for all Ai attributes requested in the output: 1- If the entries in the single levels are not sorted similar to the base relations, create a mapping between the entries in those relations and keys in base relations [3] 2- Create the P-trees for all single level relations, PAi,x,y for all Ai 3- Create the output P-trees, Pout,x, for all levels x 4- Create the polyinstantiated P-trees, Ppoly,x,y, for all x,y combination such that x<y 5- Create the polyinstantiated output P-trees Ppoly,out,x,y as the ANDing of Ppoly,x,y and Pout,x (where x<y). 6- Create Output table by scanning all Pout,x (from 3) and Ppoly,out,x,y (from 5) Figure 10. Formal algorithm 6. PERFORMANCE ANALYSIS For the purpose of performance analysis, we compare the cost associated in answering a query with two different techniques: one with our algorithm and the other without any join accelerator (we refer to the P-tree as a join accelerator). It turns out that our algorithm matches the DVA algorithm [3] in this performance aspect. We only consider the secondary storage page I/O cost while ignoring CPU cost since the latter is negligible when compared to the former. The total number of page accesses needed in order to retrieve two, four, and eight attributes of the multilevel relation is calculated and presented in figures 12 and 13 (Unacc stands for an algorithm that does not use accelerators and P-Acc stands for our P-tree based accelerator. The number after Unacc or P-Acc represents the number of attributes accessed). These cost figures were calculated using the Yao’s formula [11]: k Yao (k, m, n) = m – m* II((n-(n/m)-i+1)/(n-i+1)) i=1 Given that a page is accessed at most once, this formula gives the number of page accesses needed to get k records randomly distributed in a file of n records in m pages. To retrieve records from the base relations at different levels, an algorithm that does not use an accelerator accesses the index of each base relation whether the tuples exist in each of them or not. But our algorithm does not access the index unless a record exists in the corresponding base relation. The cost of page access associated with reading the index of a relation having n records in m pages, when k records are retrieved is calculated by Yao (Yao (k, m, n), m/FO, m) where FO is the average fan out of an index node in a B+ tree. result (the value may exist at a higher level than accessed) is taken as 0.1. The savings in using our algorithm increases as the number of levels increases, as the number of categories increases, and also as the number of attributes required for output increases. As the selectivity reaches 0.2, the number of pages required to retrieve the records exceeds the total number of pages in the base relations. The reason for this is that there are many page accesses to the indices. Number of pages (Thousands) Selectivity Figure 12. Two classification levels accessed Number of pages (Thousands) Page size = 4 KB Number of tuples in each base relation = 100,000 Number of pages in main memory = 1,000 Size of an attribute = 8 bytes Size of a value identifier = 4 bytes Average fan out of an index B+ tree = 287 Average page occupancy factor = 0.7 Figure 11. Parameters for the cost model Selectivity The following two figures (figures 12 and 13) show the results when the query retrieves the records from two and three different security levels respectively. The polyinstantiation rate is taken as 10 percent and the probability that a record has a null value in the query Figure 13. Three classification levels accessed 7. Proc. ACM SIGMOND International Conference on Management Data, Denver,Colorado, May 29-31, 1991, pages 50-59. CONCLUSION The P-tree based algorithm that we presented in this paper is an enhancement over the DVA algorithm [3]. Compared to the recovery algorithm of the Sea View model, which is one of the earliest and most important attempts towards multilevel database security, our algorithm has the following advantages: 1. No spurious tuples in the output table due of polyinstantiation. 2. No time inefficiency because our algorithm does not depend on the use of joins and unions like the Sea View model algorithm to create the output table. 3. All of the other advantages of the DVA algorithm over the recovery of the Sea View model (in terms time and space efficiency) [3]. In comparison to the DVA algorithm, not only does our algorithm eliminate many temporary data structures employed by the DVA, it also solves the problem of representing very large sparse bit vector efficiently. Those advantages mainly stem from the fact that our P-tree data structure is a lossless space efficient representation on which logical operations are very easy and efficient to perform. 7. REFERENCES [1] T. F. Lunt, D. Dennis, R. R. Schell, M. Heckman and V. R. Shockleg, “The Sea View Security Model”, IEEE Trans on Software Engineering (TOSE). Vol. 16, No. 6 (1990), pages 593-607. [2] Qin Ding, Maleq Khan, Amalendu Roy and William Perrizo, “The P-Tree Algebra”, ACM SAC2002. [3] William Perrizo and Brajenda Panda, “Query Acceleration in Multilevel Secure Distributed Database Systems”, 16th National Computer Security Conference, Baltimore, Maryland, September 20-23, 1993. [4] Sushile Jajodia and Ravi Sandhu, “A Novel Decomposition of Multilevel Relations Into Single-Level Relations”, Proceedings of IEEE Symposium on Security and Privacy, Oakland, California, May 20-22,1993, pages 300-313 [5] Sushile Jajodia and Ravi Sandhu , “Toward a Multilevel Secure Relational Data Model”,. [6] Ravi Sandhu and Sushile Jajodia,“Restricted polyinstatntiation or how to close signaling channels without duplicity”, Proc. 3rd RADC Workshop on Multilevel Database Security, Castile, New York, June 1990, pages 7-12. [7] Ravi Sandhu , “Design and Implementation of Multilevel Databases”, Proc. 6th RADC Workshop on Multilevel Database Security, Southwest Harbor, Maine, June 1994. [8] Jeromie Jackson “MAC & DAC Brief”, www.garrison.com/html/docmacdac.html [9] William Perrizo, Qin Ding, Amalendu Roy, “Deriving High Confidence Rules from Spatial Data using Peano Count Trees”, SpringerVerlas, LNCS 2118, July 2001. [10] William Perrizo, "Peano Count Tree Technology Lab Notes", Technical Report NDSU-CSORTR01-1, 2001. [11] S. B. Yao, “Approximating Block Accesses in Database Organizations”, CACM, (20) 4, p.260261, Apr. 1977.