School of Science and Engineering Master of Science in Computer Networks Advanced Database Systems and Data Warehouses CSC 5301 SPRING 2003 By : Benmammass ElMehdi Supervised by : Doctor Hachim Haddouti INTEGRITY Achieving integrity in a database system : The performance of a database system depends on many features, among them accuracy, correctness and validity of the data provided by the database. These features introduce the integrity of a database. Integrity constraints guard against accidental damage to the database, by ensuring that authorized changes to the database do not result in a loss of data consistency. We can guarantee the integrity of a database in different ways. We will see now that a DBMS has an integrity subsystem component and how this component can achieve data integrity. Integrity subsystem : The role of an integrity subsystem is : Monitoring transactions and detecting integrity violation. Take appropriate actions given a violation. The integrity subsystem is provided with a set of rules that define the following : the errors to check for; when to check for these errors; what to do if an error occurs. This is an example of an integrity rule called RULE#1 that forces the value of the sold quantity to be greater than 0. RULE#1 : AFTER UPDATING sales.quantity : sales.quantity > 0 ELSE DO ; set return code to “RULE#1 violated” ; REJECT ; END ; In general, an integrity rule is defined with the following structure : Trigger condition (after updating, inserting…) Constraint (sales.quantity >0) Violation response (else do…) The integrity rules are compiled and stored in the system dictionary by the integrity rule compiler (which is a component of the integrity subsystem). At each time we add a new rule, the integrity subsystem checks if this new rule violates any of the old rules in order to accept it, or reject it. Using the integrity subsystem, all the integrity rules and stored in the same location (system dictionary). We can query them using the system query language, and their role is more efficient. The integrity rules can be divided into three main parts : The domain integrity rules; The relation integrity rules; The fansets integrity constraints. Domain integrity rules : Integrity constraints guard against accidental damage to the database, by ensuring that authorized changes to the database do not result in a loss of data consistency. The domain constraint is the most elementary form of integrity constraint. It specifies the set of possible values that may be associated with an attribute. Such constraint may also prohibit the use of null values for particular attributes. The domain integrity rules test values inserted in the database, and test queries to ensure that the comparisons make sense. New domains can be created from existing data types or “base domains”. This is an example of some domain integrity rules definitions. DCL S# PRIMARY DOMAIN CHARACTER (5) SUBSTR (S#,1,1) = ‘S’ AND IS_NUMERIC (SUBSTR (S#,2,4)) ELSE DO; Set return code to “S# domain rule violated” ; REJECT ; END ; DCL SNAME DOMAIN CHARACRTER (20) VARYING ; DCL STATUS DOMAIN FIXED (3,0) STATUS > 0 ; DCL LOCATION DOMAIN CHARACTER (15) VARYING LOACTION IN (‘LONDON’,’PARIS’,’ROME’); DCL P# PRIMARY DOMAIN CHARACTER (6) SUBSTR (P#,1,1) = ‘P’ AND IS_NUMERIC (SUBSTR (P#,2,5)) DCL PNAME DOMAIN CHARACRTER (20) VARYING ; DCL COLOR DOMAIN CHARACRTER (6) COLOR IN (‘RED’,’GREEN’,’BLUE’,’YELLOW’); DCL WEIGHT DOMAIN FIXED (5,1) WEIGHT > 0 AND WEIGHT < 2000 ; DCL QUANTITY DOMAIN FIXED (5) QUANTITY >= 0 ; Let’s see definition by definition the meaning of the syntax : S# is a string of 5 characters. Te first character is an S and the last 4 characters are numeric. SNAME is a string of 20 characters maximum. STATUS is a integer of three digits. It can be negative. LOCATION is a string. It can take the values defined (Paris, London…) P# is a string of 6 characters. Te first character is a P and the last 5 characters are numeric. PNAME is a string of 20 characters maximum. COLOR is a string. It can take the values defined (green, blue, yellow…) WEIGHT is a number of 5 digits that takes one decimal. Its value is always between 0 and 2000. QUANTITY is an integer of 5 digits. It is always greater or equal to zero. We can define other domain integrity rules like : Composite domains : a domain DATE which is composed of three domains DAY, MONTH and YEAR User-Written Procedures. Interdomain Procedures : some conversion rules (procedures) may help for example to compare two values from two distinct domains (distance expressed in kms and miles). Relation integrity rules : The RULE#1 defined in the introduction of this paper is an example of Relation Integrity Rules. Relation integrity rules promote integrity within a relation, allowing the comparison of multiple attributes in a given table. For example: 044 and 045 area codes in phone number imply that the city of residence Marrakech. Can potentially act as a substitute for normalization in situations where creating 3NF tables seems overkill (e.g., ZIPCODE → CITY) This is the set of different constraints or rules that we can define as Relation integrity rules : Immediate record state constraints After updating or inserting sales.quantity, verify : sales.quantity > 0 Immediate record transition constraints Before updating an attribute, restrict the new value to be greater than the old one : New_date > sales.date Immediate set state constraints This class includes two important cases that are : defining a key uniqueness and enforcing non-null values of the key (Entity Integrity Rule), imposing referential integrity (Foreign Key Integrity Rule) , and other functional dependencies. Immediate set transition constraints This constraint is the same as the immediate set state constraint, but there is a difference between them: the constraint in the first case is applied after updating, inserting or deleting while in the second case, it is applied before. Deferred record state constraints This rule is useless for a record state checking. It doubles the processing time of transactions, because the constraint must be applied twice. Deferred record transition constraints This type of constraints can require to access the log file to get the old value of the updated record. This is, the implementation is expensive and the processing is much higher. Deferred set state constraints The deferred set state constraint is applied at the end of the transaction (WHEN COMMITING). In some cases, we need this kind of constraints because sometimes the set of updates in a transaction violates temporarily the rule. Deferred set transition constraints This constraints are difficult to implement and expensive because they need to access the entire set of a transaction before and after updating it. Fansets integrity constraints : The fanset constraints are used in network databases. They prevent integrity violations by providing referential integrity. Triggered Procedures : We have seen that integrity rules are in fact a special case of triggered procedures. We can define a set of actions to be executed When insert, update, delete, Before insert, update, delete, After insert, update, delete. Triggered procedures are used to achieve the following tasks : Prevent the user that deleting a client will delete all its sales. Prevent that a record cannot be changed in this table, but must be changed in another table (when a set of records is duplicated in two tables) . Compute fields instead of having a field in the database that does that operation. Access security. Performance measurement of the database. Program debugging. Controlling stored record (compressing and decompressing data when storing and retrieving data). Exception reporting (expiry date for medicaments) CONCURRENCY Before going deeply in the explanation of concurrency control solutions, let ‘s review some important definitions : Contention occurs when two or more users try to access simultaneously the same record. Concurrency occurs when multiple users have the ability to access the same resource and each user has access to the resource in isolation. Concurrency is high when there is no apparent wait time for a user to get its request. Concurrency is low when wait times are evident Consistency occurs when a user access a shared resource and the resource exhibits the same characteristics and satisfies all the constraints among all operations. The problem of concurrency is major in a database system because when several transactions execute concurrently in the database, the consistency of data may no longer be preserved. It is necessary for the system to control the interaction among the concurrent transactions, and this control is achieved through one of a variety of mechanisms call concurrency-control schemes. Concurrency control achieves preventing loss of data integrity caused by interference between multiple users in a multi-user environment. When several transactions are processed concurrently, the result must be the same as if the transactions are processed in a serial order. There are several mechanisms which are used to accomplish concurrency control and that will be presented later, but now let’s take a look at this example : We have two bank accounts, and their balances are A and B. Let’s assume A = 100 DH and B = 100 DH. We have also two transactions, T1 transfers 100DH from B to A, and T2 adds 5% interest to both accounts : T1 : start, A=A+100, B=B-100, COMMIT T2 : start, A=A*1.05, B=B*1.05, COMMIT We can have the following sequences of execution : A=A+100, B=B-100, COMMIT A=A*1.05, B=B*1.05, COMMIT or Time A=A+100 A=A*1.05, B=B*1.05, COMMIT B=B-100, COMMIT The two sequences give different results (A=210 and B=0 for the first case, A=210 and B=5 for the second case). Our concurrency control mechanisms must ensure that the second sequence does never happen. In this part of the report, I will describe the most common concurrency control mechanisms and give examples to understand how they implement concurrency. 1. Lock-Based Protocols : An important component in the DBMS is the Lock Manager. This component can help us to accomplish two types of lock protocols : exclusive (X) mode: data can be both read as well as written. X-lock is requested using lock-X instruction. shared (S) mode: data can only be read. S-lock is requested using lock-S instruction. Transaction A Shared Lock Exclusive Lock Exclusive Lock Accounts Table Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Row 8 Transaction B Shared Lock Exclusive Lock Exclusive Lock In this figure, transactions A and B can both read simultaneously from the table (shared lock authorizes the read-only , but can only update specific records to each transaction. The table below shows if a transaction can require a lock for a given object (row or table) knowing that it is already locked by another transaction. We see that if a row (or table) is already exclusively locked by a transaction, the system cannot allow a new shared/exclusive lock for a new transaction. Shared lock Exclusive lock Shared lock True False Exclusive lock False False When a transaction asks for a row lock, it waits until the concerned row is not locked by any of the other transactions. A transaction can proceed only if its locks requests are granted. PX Protocol : In this protocol, if a transaction needs to update a record, it asks for an exclusive lock. The following example shows how the concurrency control manager handles locks. We define two transactions that ask for either shared lock (lock-S) or exclusive lock (lock-X): Transaction 1: Transaction 2 : lock-X(B) lock-S(A) read(B) read(A) B = B -50; unlock(A) write(B); lock-S(B) unlock(B); read(B); lock-X(A); unlock(B); read(A); display(A+B); A = A + 50; write(A); unlock(A); The sequence of execution of these transactions is as follows : Transaction 1 Transaction 2 Concurrency control manager lock-X(B) grant-X(B, T1) Read(B) B = B -50; write(B); unlock(B); lock-S(A) grant-S(A, T2) read(A) unlock(A) lock-S(B) grant-S(B, T2) read(B); unlock(B); display(A+B); lock-X(A); grant-X(A, T2) read(A); A = A + 50; write(A); unlock(A); Here comes the notion of serializability. The goal is to find an interleaved execution sequence of a set of transactions that will obtain the same results as if the transactions are processed serially. For example, if we have two transactions, we have to look at the lock requests of each transaction and to find an order to execute them without any interference between then. The resulting sequence, if there is one, implies that the two transactions are serializable. Using the lock-based mechanism, deadlock and starvation can occur. The table below shows how deadlock occurs : Transaction A Shared Lock Already X-Locked Asks for an X-Lock Accounts Table Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Row 8 Transaction B Shared Lock Asks for an X-Lock Already X-Locked Neither transaction A nor transaction B can proceed. We can avoid this deadlock situation by releasing the locks earlier. I will discuss deadlock avoidance more in details later on in this report. PS Protocol : In this protocol, any transaction that updates a record must firstly ask for a shared lock of that record. And during the transaction, before the update command comes a request of changing the lock-S to lock-X. Some problems due o deadlocks persist in this protocol. PU Protocol : This protocol uses a third lock state : the update lock. Any transaction that intends to update a record is required to ask for lock-U of that record. A U-lock is compatible with an Slock but not with another U-lock. We use this compatibility matrix. X S U X False False False S False True True U False True False This protocol is more efficient than the previous ones, because limits considerably deadlock. The Two-Phase Locking Protocol This is a protocol which ensures conflict-serializable schedules. As it is named, TPLP includes two phases : Growing phase : transaction may obtain locks and may not release locks, Shrinking phase : transaction may release locks and may not obtain locks. The protocol assures serializability. It can be proved that the transactions can be serialized in the order of their lock points (i.e. the point where a transaction acquired its final lock). Transaction 1 Lock-X(A) Read(A) Lock-S(B) Read(B) Write(A) Unlock(A) Transaction 2 Transaction 3 Lock-X(A) Read(A) Write(A) Unlock(A) Lock-S(A) Read(A) Unlock(A) This example shows the serializability using 2PL. There are many protocol that are derive from 2PL : Strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts. Rigorous two-phase locking is even stricter: here all locks (shared and exclusive) are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit. Another alternative for 2PL is the graph-based protocol. In this protocol, we fix an order of accessing data. If a transaction has to update d2 and read d1, it has to access these data in a predefined order. The tree-protocol is a kind of graph-based protocols. Deadlock Avoidance : A system is deadlocked if there is a set of transactions such that every transaction in the set is waiting for another transaction in the set. Deadlock prevention protocols ensure that the system will never enter into a deadlock state. It can be achieved using different strategies : Require that each transaction locks all its needed items before it begins execution (predeclaration). Impose partial ordering of all data items and require that a transaction can lock needed items only in the order specified by the partial order (graph-based protocol). The following schemes use transaction timestamps, that I will present in the next part, for achieving only deadlock prevention : wait-die scheme : older transaction may wait for younger one to release data item. Younger transactions never wait for older ones; they are rolled back instead. And a transaction may die several times before acquiring needed data item wound-wait : older transaction forces rollback of younger transaction instead of waiting for it. Younger transactions may wait for older ones. There are fewer rollbacks than wait-die scheme. Both in wait-die and in wound-wait schemes, a rolled back transaction is restarted with its original timestamp. Older transactions have precedence over newer ones in these schemes, and starvation is hence avoided. Timeout-Based Schemes : a transaction waits for a lock only for a specified amount of time. After that, the wait times out and the transaction is rolled back. In this case, deadlocks are not possible. This scheme is simple to implement; but starvation is possible. Also difficult to determine good value of the timeout interval. Deadlocks can be described as a wait-for graph, which consists of a pair G = (V,E), V is a set of vertices (all the transactions in the system) E is a set of edges; each element is an ordered pair Ti Tj. If Ti Tj is in E, then there is a directed edge from Ti to Tj, implying that Ti is waiting for Tj to release a data item. When Ti requests a data item currently being held by Tj, then the edge from Ti to Tj is inserted in the wait-for graph. This edge is removed only when Tj is no longer holding a data item needed by Ti. The system is in a deadlock state if and only if the wait-for graph has a cycle. The system must invoke a deadlock-detection algorithm periodically to look for cycles. What to do when a deadlock is detected ? Some transaction will have to roll back to break deadlock. Select that transaction as victim that will incur minimum cost. We have to determine how far to roll back the transaction. We can either carry out : Total rollback: Abort the transaction and then restart it. Partial rollback: it is more effective to roll back transaction only as far as necessary to break deadlock. Locking Granularity: Another possibility of lock-based protocols is that allow data items to be of various sizes and define a hierarchy of data granularities, where the small granularities are nested within larger ones. Then, we can represent them graphically as a tree. When a transaction locks a node in the tree explicitly, it implicitly locks all the node's children in the same mode. Granularity of locking (level in tree where locking is done): - fine granularity (lower in tree): high concurrency, high locking overhead - coarse granularity (higher in tree): low locking overhead, low concurrency The highest level in the example hierarchy is the entire database. The levels below are of type area, file and finally record. In addition to S and X lock modes, there are three additional lock modes with multiple granularity: intention-shared (IS): indicates explicit locking at a lower level of the tree but only with shared locks. intention-exclusive (IX): indicates explicit locking at a lower level with exclusive or shared locks shared and intention-exclusive (SIX): the subtree rooted by that node is locked explicitly in shared mode and explicit locking is being done at a lower level with exclusive-mode locks. Here comes the notion of intent locking : Intention locks allow a higher level node to be locked in S or X mode without having to check all descendent nodes. This is the compatibility matrix for an intent locking protocol with 5 states. IS ( IX ( S ( S IX X S IX IS ( ( S IX ( X ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( Timestamp Technique In the timestamp protocol, each transaction is issued a timestamp when it enters the system. If an old transaction Ti has time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti) <TS(Tj). The protocol manages concurrent execution such that the time-stamps determine the serializability order. In order to assure this behavior, the protocol maintains for each data Q two timestamp values: W-timestamp(Q) is the largest time-stamp of any transaction that executed write(Q) successfully. R-timestamp(Q) is the largest time-stamp of any transaction that executed read(Q) successfully. The timestamp ordering protocol ensures that any conflicting read and write operations are executed in timestamp order. The timestamp-ordering protocol guarantees serializability. Thus, there will be no cycles. Timestamp protocol ensures also freedom from deadlock as no transaction ever waits. But the schedule may not be cascade-free, and may not even be recoverable. Optimistic Concurrency Control : This kind of concurrency control is a validation-based protocol. Execution of or example transaction Ti is done in three phases. 1. Read and execution phase: Transaction Ti writes only to temporary local variables 2. Validation phase: Transaction Ti performs a ``validation test'' to determine if local variables can be written without violating serializability. 3. Write phase: If Ti is validated, the updates are applied to the database; otherwise, Ti is rolled back. The three phases of concurrently executing transactions can be interleaved, but each transaction must go through the three phases in that order. Optimistic means that each transaction executes fully hoping that all will go well during the validation. Each transaction Ti has 3 timestamps Start(Ti) : the time when Ti started its execution Validation(Ti): the time when Ti entered its validation phase Finish(Ti) : the time when Ti finished its write phase Serializability order is determined by timestamp given at validation time, to increase concurrency. Thus TS(Ti) is given the value of Validation(Ti). This protocol is useful and gives greater degree of concurrency if probability of conflicts is low. That is because the serializability order is not pre-decided and relatively less transactions will have to be rolled back.