i28 ^o^\ OCT 201987 ALFRED P. WORKING PAPER SLOAN SCHOOL OF MANAGEMENT HIERARCHICAL TIMESTAMPING ALGORITHM Meichun Hsu Stuart E. Madnick Aoril 1987 1876-87 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 HIERARCHICAL TIMESTAMPING ALGORITHM Meichun Hsu Stuart E. Madnick ADril 1987 1876-87 .I.T. OCT LIBRARIES 2 1987 RECEIVED Hierarchical Timestamping Algorithm By Meichun Hsu Harvard UniversityCambridge MA 02138 and Stuart E. Madnick Massachusetts Institute of Technolog^y Cambridge, MA 02139 Acknowledgement: Work reported herein has been supported, in part, by the Naval Electronic Systenis Command through contracts N0039-83-C-0463 and N0039-8o-c-0571. ABSTRACT The Hierarchical Timestamping Algorithm is proposed for handling database concurrency By analyzing transaction conflicts and partitioning the database into hierarchical parti- control. which transactions will access discriminantly using different synchronization protocols, the algorithm can offer significant performance gain. It also reduces the need for transactions to leave traces (e.g.. locks, timestamps) when accessing a data element. tions to It is shown that the algorithm is correct in terms of transaction serializability. This is done by showing that the algorithm enforces a topological order among transactions. This research advocates the potential benefit of application analysis in enhancing performance of concurrency control algorithms where the level of concurrency is vital to system performance. Introduction 1. The two basic approaches to database concurrency control are the two-phase locking tech- nique [Eswaran76j and the timestamp ordering technique (BernsteinSO, Reed78]. endorse serializability as their criterion of correctness. deadlock free and in drawback of assigned to transactions without among interferences is While having the advantages of being no need of unlock instructions when a transaction ing algorithm has the rigidly much transactions. Both techniques is timestamp- finished, the obeying the order of timestamps which have been consideration of the static or the dynamic nature of actual In this paper, the Hierarchical Timestamping (HTS) approach described which makes unique use of the nature of the application conflicts. By decomposing the database into partitions to which the transactions will access discriminantly, the approach can offer significant its prospect when performance gain handling database concurrency control. reducing the need for a transaction to leave a "trace" in (e.g., Equally important steinSO, BernsteinSlj a lock or a timestamp) among SDD-1 [Bern- as a vehicle to discover, a priori, certain (static) conflict patterns among transaction classes that may terminology) to be used. transactions has been proposed in the research of enable a more flexible timestamp protocol (e.g.. Protocol may more generalized timestamping algorithm that takes better advantage be provided via conflict analysis, presented in The SDD- .\long a difl^erent dimension, a novel at the of develop- of information that method has been [Viemont82' which blends timestamp ordering and two-phase locking chooses to switch to one or the other currency. in 1 However, with an orientation towards a distributed database system where timestamps are not stapled with data elements, the SDD-1 approach stops short ing a is accessing data elements. Conflict analysb I's in most opportune time so as to increase multi-version timestamping algorithm has also been developed and in one and level of con- shown to pro- vide a higher level of concurrency than the conventional single-version timestamping algorithm |Reed78, Reed79. Bernstein83 [Bernstein83. , Papadimitriou84J, Theoretical aspects of multi-version databases are discussed whilf one-pre\ious-version concurrency control methods in in _= 2 =— [BayerSO, and methods multiple-previous-version Chan82, [StearnsSl, in Simulation studies of multi-version methods have been reported Chan85]. and the Garcia-Molina82] in [Harder86, Carey86] results are generally favorable. Using conflict analysis and multi-version database as vehicles, the algorithm proposed HTS paper, called the Hierarchical Timestamping Algorithm, or the algorithm, allows a transac- tion to use an "array" of timestamps, rather than a single timestamp, to synchronize Which timestamp its accesses. a transaction will use to synchronize an access to a particular data element depends on which "data partition" the data element belongs hierarchical database decomposition proposed in [Hsu83j. Kedem83] protocol [SilberschatzSO, level of in this is to. The method is based on the In comparison, the structural locking a non-two-phase locking protocol which aims at increasing concurrency by reducing the amount of lime the locks on the high-level nodes of a tree must be held by each transaction. certain sequence and of the tree. it in this requires that the transactions access the nodes of the tree in involves lock and unlock protocols for each access to the high-level nodes The hierarchy chy discussed It referred to in their study is entirely different from the kind of hierar- paper. Before going into the detailed mechanism of the algorithm, the basic idea behind the algo- rithm can be illustrated by referring to an example schedule of read and write steps from three transactions t^.t^, and tr,'. io{\V,b), t^(R.a), t^R.b). ^(U•,6), t^(\V.c) where t^(R,a) refers to a request the basic timestamping algorithm, abort. b is stamped with a tr. t^ will to read data element a, t^ is and t^(W,b) to smaller than that of be denied permission to write data element read timestamp which Under our proposed method, however, related to that containing data element later), <, Suppose the timestamp of write data element 6, and so on. that time from transaction 6 in then the algorithm allows transaction if is t^'s), Using (since at and forced the data partition containing data element a certain tr. greater than that of b tr,. way to access b a to is (the exact nature to be explained with a pseudo' read timestamp TS earlier than aborted. t^'s timeslamp so that Furthermore, this write request can proceed without being delayed or /,'s pseudo read timestamp that no transactions will any longer write leave TS with 6 TS may be engineered to be early enough so with even earUer timestamps, eUminating the need to b as a read timestamp. Therefore this read access can be accomplished without having to leave any "trace" for concurrency control purposes. With these properties, the proposed algorithm has the potential of increasing the level of concurrency while reducing the costly over- head of leaving access traces. The organization tamping Algorithm section four, 2. we is of this paper described is as follows. In the following section, the Hierarchical Times- The proof in detail. of correctness discuss the optimality aspect of the algorithm. is given in section three. In Section five concludes the paper. The Hierarchical Timestamping Algorithm The tions. We HTS Algorithm requires the decomposition of a database into a number of data construct a data partition hierarchy which tions subject to certain constraints. basically a partial order of the data parti- is Hierarchical Timestamping Algorithm The parti- is a timestamp- based concurrency control algorithm parameterized by a chosen data partition hierarchy and corresponding transaction classification. data partition, the HTS When its the database decomposition consists of a single Algorithm degenerates to the conventional multi-version timestamping algorithm. We will in this section first discuss the for constructing a method database partition hierarchy. of transaction analysis Then in Section 2.2 and the mechanbm we present the algorithm The purpose of transaction itself. 2.1. Transaction Analysis Let a database be decomposed into a number of data partitions. analysis in our algorithm is to facilitate which the hierarchical timestamp protocol correct construction of a data partition hierarchy on is based. The data partition hierarchy, constructed off-line, is basically a partial order of the data partitions subject to additional graph-theoretic con- and constraints related straints and any set of transactions, a correct data partition hierarchy can be (e.g., can outperform the basic multi-version timestamp algorithm) depends on an intelligent it choice of database deomposition and data partition hierarchy. constructing a correct data partition hierarchy 2.1.1. found and the hierarchical However, performance and optimality of the protocol timestamp protocol applicable. whether Note that given any data base partition to transaction analysis. is mechanics of In this section, the discussed. Transaction analysis against a database partition Analysis of a set of update transactions T, against a database consisting of a set of data partitions results in a data partition graph, where the nodes are the data partitions, and the arcs are assigned in such a D- if and only elements in if way that there an arc from a data partition one can find a potential transaction and accesses Z), is (i.e., in Z), to another data partition the database system that updates data reads or writes) data elements in D . That is, D^—*D-, i^j, indicates that there exist transactions in the system that would potentially link updates of data elements in D, to the content of data elements Definition. Let be a decomposition of is 7", D a digraph denoted as set of directed arcs (f T", s.t. The data to . into data partitions Z)[,Z)o DPG(P .T^] jommg A D^. data partition graph of a( Of^Z) partition graph Note / is ^^i?, where w(t), r{t) (The access set a{t) is iff all ad hoc queries. Let w.r.t. P P T, and a there exists a transaction and a(t) are the write set, the read set the union of r(t) and w{t).) a tool for capturing the pattern of conflict that, in our algorithm, there is among transactions no need for read-only transactions participate in the transaction analysts, eliminating the difficulties of pinning nature of P with nodes corresponding to the data partitions of these nodes such that, for iV;, D^—'-D set of transaction to be run in the system. P be a set of update transactions to be performed on a database D. w(t)r^D^^4> and and the access in down, a priori, the Constructing a data partition hierarchy 2.1.2. Once a data partition graph is defined for a database, a data partition hierarchy, which assigns a partial order to the set of data partitions, can be derived as follows. partition hierarchy, denoted as in P , and arcs between DPH[P ,T^) (1) P Given a data decomposition Definition. is DPH[P ,T^), b partitions, denoted as a semi-tree (a semi-tree and a data partition graph a data any acyclic graph where nodes are data partitions —, is DPG(P ,T^), such that an acyclic digraph where there exists at most one path between any pair of nodes,) and If (2) there exists a directed path between D- and D^ directed path between D, and D^in Note that tive arcs in is P and T, , the existence of a P and a data partition graph DPH guaranteed. is G order of the data partitions satisfies the definition of DPH(P .T,)- remainder of the paper, the notation DPH than a data partition D^, denoted as DPH. or In general, and Z), 2.1.3. D we say D->D', if DPG(P ,T^). of the data partitions in refers to a particular chy chosen to base our hierarchical timesiamp protocol. We satisfy the above However, This directly follows from the fact that the transitive reduction of any acyclic digraph In the there exists a necessarily a transitive reduction: there are no transi- There may be multiple data partition hierarchies that definition given a database decomposition given any DPG{P,T^), then DPH{P.T,). a data partition hierarchy DPH{P .Tu). in P that a total is data partition hierar- say that a data partition D- there exists a directed path —*D- D^—* that data partition D, and data partition D^ are related if higher is either in D-=D^ are connected by a directed path Transaction Classification Given a data partition hierarchy the data partition(s) they write into exists a transaction t(T DPH , it is possible to classify transactions in T, based on Based on the definition of data partition hierarchy, which writes into more than one data partition, then all if there the data parti- tions that writes into it must be connected by a directed path Therefore, the rule for transaction classification Assign each update transaction We denote The I being assigned to as follows: to the highest data partition D- as /fD,, must be related hierarchy. and D- called the is (read or write) partition of t. to the home data partition of that transaction in the data in the next section. is related to the D, in t accesses DPH. The Hierarchical Timestamp Protocol The timestamp protocol (HTS) hierarchical by a chosen data partition hierarchy DPH is a concurrency control algorithm parameterized and the corresponding transaction the conventional multi-version timestamp algorithm t, home data Let tiD,. Then the data partition that contains a data element that Property. tamp writes into. it This property completes the background for establishing the hierarchical timestamp protocol to be described 2.2. the data partition hierarchy. following property establishes that the data partition that contains a data element that a transaction accesses partition Z), t is in to a transaction denoted as accesses may I[t]. t when is initiates. We will call this algorithm assigns a timesinitiation timestamp of be different from I(t). home In particular, the timestamp to access data elements partition will normally be true for accesses to lower partitions. somewhat smaller than The exact mechanism used is still I(t), in its data while the to calculate these preserved. Basic Definitions Two functions, called the Decrement function, or the function DEC. and the Increment function, or the function I.\'C. are devised to data partitions different from tamps HTS timestamp the "access timestamps" will be such that the overall serializability 2.2.1. the Like However, the timestamp the transaction actually uses to synchronize partitions higher than Ts reverse it (MVTS), classification. for its own home compute the timestamps a transaction uses partition. Function and for accessing higher data partitions, !.\'C DEC is to access used for computing times- lower partitions. These definitions assume that a data partition hierarchy Two values, and C(t), /(/) DPH are given. is in the system. We time, denoted as I(t). associated with each transsaction assume that the timestamp assigned to a transaction b its initiation t Therefore every data access request issued by a transaction must occur after tion also assigned a is actually finishes; saction is In typical cases, C(t) is Each transac- when the time t no data access request would normally be issued at time after C{t) by tran- i.e., However t. commit timestamp C(t)>I{t). I{t). there are exceptional occasions (as explained in procedure Clater) forced to be a smaller value than the time when t actually finishes. It is when C(t) sufficient for our pur- pose to have I(t)<C(t). m' , Definition. A Definition. The function i.e., transaction /, m'=/,(m), such that m' active at time is t m if I{t)<.m and C{t]'>m defined for a data partition D-, , is maps . a time m to another time the initiation time of the oldest active transaction assigned to m. Formally, D, at time 'm if m there exists no tiD^ active at time /.(m)=' .Min{I(t)) otheru'ise Z), and Z) for all ttO^ active Let the Decrement function Definition titions , , where w D£C._,(m) = if D DEC^ '>D,. DEC^ ^ recursively time m. at be a function defined for a pair of data par- maps a lime to another time as follows. D=D- Ij(m) \l D,^Dj IS m DPH DEC^-[DECij^{m)) otherwise, where D--*D^-*....-*D^ That is, cessively (i e., example, if We the function along the DEC^ critical ^ maps path of a time DPH) the critical path between D, and D^ will now inverse of functions m and DEC^^. to the initiation time, in DPH. DEC-j(m), of suc- the oldest active transaction assigned to D^. is describe the functions C, and 7, in Z), is D,—-Di^ ZA'C'^-,, — Dj, then For DEC, j{m)=I^{I^{m)). that can be considered conceptually the _= 8 =— Let C,:m Definition. data partition and C,(m) if is — m' be a function which maps a time m to another m' where D- is a determined as follows. m, there exists no t(D^ active at time Im Max(C{t)) otherwise, That C{m) is, is the latest for all teD- active commit time of all time m. at transactions assigned to Z), that started before or at time m. While the DEC maps function a time in a lower partition to the initiation time of INC transaction assigned to a higher partition, the the commit time of Definition. D yD , some transaction assigned The Increment denoted as INC^(m). tm INC-,{m) = if is function maps function, defined for a pair of data partitions P, and a function which maps Now we £)y, where a time value to another such that D.=D, C-{m) Hierarchical a time in a higher partition to to a lower partition: if D,— D^ ts in DPH INCi^-{INCji^(m)) otherwise, where £)•-.... 2.2.2. some — r'^^—Z)^ is in DPH Timestamp Protocol introduce the hierarchical timestamp protocol, where synchronization timestamps are calculated according to functions DEC and INC above. Hierarchical Timestamp Protocol For Update Transactions: For every database access request from an update transaction tcD^ the following protocol Protocol If (1) E D is data granule dtD-, observed; (for accessing a partition eqxial to =D^, then (equivalent If it is for a home partition) to the basic multi-version timestamping) a read request, then grant the latest version before I{t) of d, and leave I{t) as the read timestamp of this version of d if its current read timestamp is smaller than I(t). _= g =_ If it (2) is a write request, then the read timestamp of the latest version before I(t) of d if smaller than /(/), then create a new number version of d with version is Otherwise abort /(<) t. Protocol If H (For accessing Higher partitions) Dj>D- then grant access to the latest version before t DEC^ (lit)) of d. Protocol L (For accessing Lower partitions) If D,>D-. then If (1) it is a read request, then grant the latest version before INC^ j{I(t)) of d, and leave INC^j(I(t)) as the read timestamp of this version of d its if current read timestamp is smaller than IXC,j(I(t)). If it is (2) d is a write request, then than smaller /A'C, if (7(0). INC-JI(t)). Otherwise abort 2.2.3. Implementing To implement the HTS HTS C,(rn), and it 9, create a new version of d with version number t. protocol, we must We >0, generic specify will first ProcI^, used for (2) constant time quantum then {!{t)) of Protocols C-{m) are computed operationally computing the read timestamp of the latest version before INC- to a how C(t) assigned and how^ A('") ^^'^ is introduce two procedures: (1) ProcC^, used for computing I,{m). data partition remembers that the procedure has been invoked Z),, The procedure ProcC-(m) adds to the argument value. m for time value a In addition, so as to constrain future assignment of C(t) for teD,. The procedure Procl^(m) computes the initiation timestamp of the oldest active transaction in the and HTS D protocols are be D,—Dj— Proclj[... [Proclaim]...), Z), at time computed m With PtocC and Prod as follows Let the path in DPH —D,- The timestamp DEC^ ^(m) and that usi-d in Protocol L is defined, the timestamps used in used between two data partitions in computed These procedures are consistent with thf definitions of the functions Protocol as H is computed £), as ProcCi^{...(ProcC-(Tn)...). DEC and INC given in the ; _= =— 10 previous subsection. These procedures are described as follows: ProcC^(m): Procedure. If m+q- less is Mark m; /* than current time, then return ("Abort Requestor"); Create a pseudo transaction ^ Return <*(£>, s.t. I[t! )=m and computed is for liD^ there exists an invocation of ProcC\{m') minimum of m'+q^ over In other words, if mit timestamp of tiD^ all such m' simply Ts is noted that Z), the time If t is finish time. at least quantum 7, g, 's where I finishes at time <m — m then C(t) q^ as follows: asssigned the is Otherwise C(t)=m. . ProcC\ has not been invoked (which also implies that sactions in C{l!)=mJrq-. I{t)<m' s t. in £>, When them was invoked with an argument value which is */ (m-\-q-). The commit timestamp C(t) If m remembers that ProcC- has been invoked with argument is during the such invocations at least long), then t's exist, commit timestamp t and time units before q- are selected to be large time of life then the comat least time of It enough such that most of the tranq^ after they start, the chance of having to abort such transactions due to aborted all time <'s finishing pushed backwards. is that require accesses to lower data partitions would do so within cedures would be relatively small, and virtually one of units of time ProcC- pro- C{t) would be the same as the actual finish t. The procedure Proel^(m) is defined as follows: Proclaim): If there exists unfinished transaction finished at the time Proclaim) suspend requestor Compute till t is t at time rn invoked,) and it s.t. is its finish known time that C{t) ); unknown is m (i.e., samller than finishes; the initiation time of the oldest active transaction at time Return {m' is as m' t is not m. then . — = 11 =— It is Ci{m') in noted that Prod- would suspend requestor only PTOcI-{m) q^ s.t. for appropriately large 9,'s there exists a previous invocation of transactions started before m' have not finished by the time Note that Procl^(m) invoked. is <m — some m' for Z), if always invoked at a time later than m. is Therefore, the chance of an invocation of Froc/,(m) needing to block is very small. To to ensure that the procedure /,(m) compute an access timestamp, Since /.("i) is all implementable, we argue that when /.(m) is information needed to correctly compute /,(m) always invoked at a time later than information needed to correctly compute I,(m) At time m, we only need to consider mf inserted at time ing regular transactions in all >m is it only making Z), pseudo transactions. INC- {x) then there must unfinished as is if /('')>"'/ — 9, m it , with to the behavior of ProcC^ is Therefore >m— If 9,. Z), m. For all Ts actual are also finish known by time 2.2.4. s.t. t" is is inserted as a result of comput- I(f')=I(t') and at time at time mf . mf it is still Therefore the effect of which b known by time m. on the path between D^ and If D^ t' with I(t')<m must be such f is then due known by time m. regular and pseudo transactions that can influence the value of I,{m) are m all ProcC\ the argument that ProcC- receives must be greater than the time when transactions smaller than such f segment where D, invoked. Therefore a pseudo transaction all argue that at time m, Pseudo transactions f with I{f)<m would be a request to lower data y(x) available. I{t)<m would have been known. Therefore exist a transaction t" in computing ISC^. sufficient to is invoked available: the pseudo f on I,{m) would be dominated by that of inserted as a result of is is time. known to be unfinished at time known by time m due Therefore all m whose commit timestamps would be to the rule for assigning C{t) to values other than information on transactions needed to compute I,(m) m Remarks To summarize, known by several observations are made of this set of protocols: is — — = 12 = If (1) DPH the database partition hierarchy col E and the hierarchical timestamp algorithm will apply, MVTS consists of a single data partition, then only Proto- H E The key benefit of the Protocol H or Protocol L, since is used HTS H is partition, Protocol "cheaper" than either does not require timestamping the data element accessed. it algorithm comes from choosing a database partition such that much more than Protocol L. Since L\'C- j(I{t))'>I(t). Protocol L sactions. is tends to increase the chance for transactions larger there is is 9,, less probability for designers in the better off are transactions in D, them to abort; to lower data partitions. Offering tradeoffs may parameters as reflect their the most "expensive" perceived priorities the three, as it lower data partitions to abort. In fact, the when they and the worse is among access lower data partitions, as the transactions belonging off are intrinsic to hierarchical among all timestamping. System transactions by their choice of such q-. Proof of Correctness 3.1. Proof Structure To show We own home More importantly, Protocol needs to cover only read accesses. Protocol its Protocol L uses timestamps which are different from the timestamps of the accessing tran- (3) 3. reduced to the conventional algorithm. Since no transaction will write a data partition higher than (2) is first that the above algorithm id, A schedule If is one must show that a sequence of steps, each of action, version of a data granule sion of a data granule version. correct, formulate the problem of testing serializability, and then apply Definition. saction is the action bv the transaction. is is If denoted as d , which is in >. The action can be read serializability it to the HTS is algorithm. the form of a tuple <tran- (r) or write (w). is The ver- where d indicates the data granule and v indicates the write, then the version of the data granule included in the step the action enforced. read, then is created the transaction reads the version of the data _= granule indicated in An example the tuple. =— 13 of a schedule is <<,,w,(/'>, <t2,T,d'>, <t2,v/,d^ >, <t^,T,d'>. <<, Definition, .\ssume that a version order, denoted as element d that j is the predecessor of a version k of d << << A in (1) <(,,w,(f^ (2) <t^j.d' is > > TG(S) of a data j i of d such < t^.r.d' > and and <<2,w.(f following in is a directed graph, denoted as S and the arcs, representing direct dependencies iff a predecessor oi d The version exist according to the following rule: between transactions, is A k and there exists no version transaction dependency graph of a schedule S TG(S), where the nodes are the transactions r„— /; << j given. k. i Definition. if is if are in S for some d' , or > are in S for some d' < k and there exists no d j theorem b adapted from the d , where d' such that the predecessor of d \s j 1-Serializability < o < . {d' k.) Theorem proven in [Bern- stein83]: Theorem Given a version order Given the above correctness of the (1) HTS < <, a schedule S that direct dependencies partitions that are related (3) Show (ie,, cvcle. that if if TG(S) is acyclic. prove algorithm: Define a relation topologically follows, denoted as Show seriahzable rule for testing for serializability, the following steps are devised to are assigned to related data partitions in (2) is in if between any pair of transactions that DPH. may occur only between transactions that are assigned to DPH. a schedule S enforces the relation tr,—*t^(TG{S) only =>, => on all direct transaction dependencies ';=>',•) then the transaction dependency graph TG(S) has no — — = 14 = Show (4) HTS that the algorithm produces only schedules that enforce the relation => on all direct transaction dependencies. The proof HTS may protocols much structure of the proof INC, not is important in understanding how the implementation detaik of the be modified without having to reconstruct proofs from scratch. is In particular, accomplished by referencing solely the definitions of the functions their implementations. of the proof in this paper. The only exception Only a small portion of is its m the last step (i.e., step (4)), DEC and a key step proof directly cites the implementation of the protocol. This portion will be clearly identified. 3.2. The topologically-follows Relation and We now t^, say that A relation topologically follows, denoted as tr, where t^ topologically follows = t^^cD-, tntD^. /?, In D and (or Z>,. (2) if D,>D^ then !(t^)>DEC^,(I(tn)). (3) if i?,.<Z)^. then /(<„)<Z)£'C,^(/(^,)) related in is defined for a pair of transac- We iff then /(?,)>/(/o), if Z)^. =>, are related in the chosen data partition hierarchy. <j=>io) (1) Clearly, Properties define topologically follows, which assumes a given data partition hierarchy. Definition. tions its => is defined only between transactions that belong to data partitions that are DPH, because Lemma. Given a otherwise the DEC function DPH t^eO, and tncD^. and let is undefined If <2~*'p ^h^" ^, ^"'^ ^j ^'"^ related in DPH. Proof. Let d(Di^ be the contended data element that causes '2"^'] saction writes d. without loss of generality, definition of data partition graph Since tr, must at least read £),—( = )D^—( = )Z),£Z)56'. d. ba.'^.ed assume it to be *, that writes d. on transaction analysis, either therefore This means let's tiiat either if D^—* Df^cDSG . or Since at least one tran- Then by D^—*D^iDSG Dj=Df.. the or Dj^^D-. This means that D^v=^l\. then there exists a directed path between . . _= D- and D, DSG in . By path between D^ and D- =— 15 the derivation of the data partition hierarchy, there must be a directed in DP//, which means that D^ and theorem concerning the In [Hsu86] the following Z>, are related in DPH. Q.E.D. => partial ordering effect of the relation was proven: Theorem 1. Let TG(S) be a transaction dependency graph of a set of update transactions DPH, run on a database. Let S enforce the synchronization rule that, given a if t2=>ty. Then TG(S) has no In fact, a that if cycles. statement stronger than the above was proven S enforces the rule that <o—^i«TG(5) only ~ -^ T"^ AL ^ if (2~^TS in What was [Hsu86]. aL^i ^^^" TG(S) has no TS maps each proven was cycle, TS and defined to be topologtcally follows with respect to two functions the requirements that (j^^'i^^^l-S^) only transaction to a unique time value, and AL- .4L, where with , satisfies the following properties: (1) ComposabiUty: AL. (AL,^(m >D,> (2) ))=.4L,j(r7i ) , all all Z), m' where => defined here is has been assigned the initiation timestamping 3.3. . It is HTS m and for all Z), , Z?^ and D- where D- > and D^ where D^ Z), and for all times m >m' In essence, the relation DEC^ times D,. Non-decreasing: AL,j{in)>AL^^{m') for m for easy to verify that / and DEC\ an instance of / =>ts of transactions, al ^'here the function and .4L,y satisfy the requirements for TS has been assigned TS and AL Enforces topologically-follows Given the above theorems stating that the relation => can be used as a vehicle for order- ing transactions for concurrency control purposes, the following theorem completes our proof of correctness. Theorem 2. Let S be a schedule thai Then t^-^t^iTG{S) only if t^=>ty is permitted by the hierarchical timestamp protocol. . _= Let '2*^; ^"'^ 'i^A- Ptooj. '2"^'] '* =— 16 translated into the following three cases: (1) <o reads a data element itD^^ written (created) by l^ (2) fj writes - version of) a data element diD^^ whose predecessor was read by version of) a data element diD^^ whose predecessor was written hwrites (creates a ^2 (3) new creates a (i.e., (created) by What new and the version created by i^ has to be shown lemmas which bind that is the functions three cases in all DEC t^ is and AVC read by another transaction. i^^X-x 1. DEC,.^(INC^,(m])<m. Lemma 2. DEC- We Lemma m first active at time m in D, active at time m be D, , t^, let least as old as /,(C',(m))</(<^)<m between D definition of DPH in DPH, c>0. for all and D DEC t„. . . all i,r7i. If and I^{C^{m))=I^{m)=m . If the transaction with the largest then C^{m)=C(t^). saction active (since at least at >D, prove that /,(C',(m))<m for C^{m)=m in for all Z)^ in prove these two lemmas. then time , D^>D, =>: 1. We Proof. ^[m)+i)>m ,(INC, will first for all two together are used to transform the timing rela- tionship imposed by the protocol to that defined by the relation Ltmma In all cases, the following holds. t^ where is t„ active), was there exists no transaction active at there exists at least one transaction commit timestamp among Therefore, at time C^{m), there and the oldest active transaction active at time and D —*D,—-D^—>' —/)„—/),. L\C DEC,JL\r),{m))= . = . Q.E.D. 1,771. Let the must be Therefore critical path to the I,{IJ-\IJI„{C,,{C,,{...{C,^{C^[m)))...))))..]). -^;(A„(-(A2(^'.2(-(^',n(<^;("' )))•)• have DEC,.(L\'C^^(m ))<I^[C^(m ))<m at that time Then bv expanding according Let C,2(...(C',„(C;(m)))...)=m,2, then we have lJCJm,„))<m,.^. ^;(AJ-(A2('".2)) all that are at least one tran- m, which means I(t^)<m. Therefore we conclude that I,(C^(m))<.m for be is all Therefore DEC,^(LWC-,(m))< Continuing with the same reasoning, we . — = 17 =— Lemma 2. We Proof. (a) If first no transaction prove that /,(C,(»n ))+«)>»" for oldest active transaction, (b) If some transaction the proof of at time m Lemma is still concludes if is C;{m)=m. Therefore m active at time in D-, then let That Z),-D.,-D..-Z),3^ all «,m. IJC„(C,.(m,,))+e)>C,^{m,,). That is. is, "1,2=^,2(^1,3). By Now we DEC,.(L\C^,(m]+i)= above the I„(C,^(C,^{m,,))+i)>C,2{m,^)+i' to both sides, /,o(/,i(Qi('-'.2('".3))+0)^ A2(*''.2('".3)+^') >'".3ing one derives between path the Z), . Theorem 2. and and in one (b) be D^ and let we have conclusion, Therefore, applying /.^ Continuing with the same reason- I,(IJ.-(I„{C„(...(C,,{C.{m))...)+i)..)) are ready to complete the proof for (a) C,,{...{C,,(C.(m)))...)=m„ Let Thai )=rM,2. Let was as at time C(t^)+i, no transaction active , for C-{m)+i the way be defined the same t^ /,(C(/^)+f)>m or /,(C,.(m)+<)>m. Combining is, An-^r )))... at time Therefore /,(C,(m)+£)>m any, cannot have started on or before m. I-(C-(m))>m C,-2(Ct3..,(C,„(C(m then in Z), we have C^(m)=C{t^). Therefore 1, active; that m active at time Ls There are two cases. i,m. all >m. For the three cases Q.E.D.. identified pre- viously: (1) In this case. D^>D).. Since D and Z), must be related, we have three cases and each for case observing the hierarchical timestamp protocol leads to the conclusion that <2=>'i- D->Di>D^. In obeying this case, HTS leads to the assertion: ISC^j,(I{t^))>INCi,^[I{t^)) Therefore ISCji^(I{tn)]>INC^j^{I{l^))+i. (0.1). Applying function DEC^.^ to both sides we obtain DEC^.[ISC^j,{I(t^)))>DEC\,(INC,^(I[t,))+i) (0.2). right hand side of (0.2) is greater than I{t{). From Lemma Applying function DEC, j 2 above the to both sides have DEC,/DEC,,{L\'C^,{I{t^m>DEC,^(DEC,,(INC,,(I(t,))+i) >DEC,JI{ti)). ever, the left-hand side of the is less than or equal to D >D >D,. In above expression I(t,J. this ca^e (O.l) (a) = DEC/^ -{INC^ i,{I(t^))). How- Lemma which by Therefore I{t2)>DEC--(I(t^)), which means t.= >t^. holds also. Apply DECj,, DEC),{Aj(L\C)i^{I{tr,))))>DEC\J!XC,JI{t\))+(). to both side of (0.1) Right hand side is >/(/,). we 1 (b) we have Left hand ) _= side (2) <DECj,{I{t^)). is =— 18 Di>D^>D^. In this case INC\^{I(t^))<DEC-i,(I{t,)). Apply DEC,,^ to both sides D In this case, yOi^. By same reasoning the in D->D^>Di^ and D->Dj>Di.. Now we consider at which performs writing f„ P£'C-^(/(<,))<4(m)<m Combine ProcCj{I(tr,))). to both sides DEC;, Otherwise ^°^ with this we write the m get I{tr.)>DEC^^(I(t^)), have /(<,)<m and been rejected D-=Di,. (a.2) L Protocol the (in Apply i,(I{tr,))>DEC-i,(I(t^]). to=>ty i.e., i.e., , by case, would INC. have request to read d (a) If (j's this request we (0.3) In for the cases of Dj'>Df.>D-. Let the time had started at or before /[ Dj>D^. (a.l) (0.3). the cases where Consider two cases, . m. which means arrives at or before time '^'-^ *(^('2))^'" m be rf (c) '2=>'r i.e., we can show t^=>t^ (1) t^=>t^. INC-^{I{t^))+i<DEC.^(I[t^)). i.e., we have l{t^)<DECj^{I[t^)). means which DEC^,l{t„))>I(t^), Therefore In this case. J- INCfore Since i^{I(tr,))=I(tf,). /|fc( w )</('o)), makes a request when except ^o satisfies m pushed backwards to be before to be DEC- procedure for computing after is tr, m . . at time m, t^ is active at time conditions under which |^(I{t^)) for m' <m ^^'s would have to block <o already existed, and by Protocol <2 it (3) In this case, are already is it suffices to created by shown Proof of Theorem in (1) 2 is in HTS prove that t^. if ^j not to have Since in tr,= ^2=>^ >t^ when this case in to mdependent (2)(a). is ProcIi^(m' t^ finishes, m . By same reasoning which There- in (a.l) m, then the version created by chooses to read a version before that created by and therefore ^n=>^,. largely except for arguments protocol H d arrives after must be INC^j^{I{tr.))>DEC,i^(I{t^)). Therefore predecessor Prod (b) If <,'s request to read till request to read d arrives before fore I^{m)<I(t,). Therefore L\Z)i^(I[tr,))>I;^(m)>DEC,^{I{t^)). (and there- commit timestamp However when such conditions hold the contradictor}' to the given that we have <o=>^,. its m tn (b) creates a Z),>£>^and new D^>D^. the version of d whose two relevant cases Q.E.D. of the implementation procedures for The dependency leave read limef^laiu^r'^ arises If HTS ProcC and due to our desire to allow the H modified such that Protocol H is _= 19 =_ also required read timestamps, then the proof can be constructed entirely functions DEC and INC and from the definitions of their properties without resorting to implementations. Discussion of Performance 4. Performance of a concurrency control algorithm can be assessed along two dimensions. One is the computational overhead, such as setting locks and timestamping data elements. is the level of concurrency, or "optimality" of the algorithm. The other Along both dimensions, the algorithm potentially performs better than the NTV'TS algorithm, if HTS a data partition hierarchy can be found such that the frequency of accesses from transactions assigned to lower partitions to data elements contained other words, is used if higher data panition? dominates that of the reverse direction. in the data partition hierarchy much more is chosen such that Protocol often than Protocol L (the "expensive" protocol,) H In (the "cheap" protocol) then a net saving may be achieved to warrant the use of this algorithm. The HTS algorithm requires a transaction analysis and the overhead of maintaining selective information on transaction initiation and commit limes to enable computation of the B and enforcement of the However, the fact that Protocol function. tamping the data elements makes it is HTS function does not require times- very attractive for situations where transactions assigned to a lower partition need to access a certain quantity of data the H A algorithm can potentially produce a net savmg in in a higher data partitions. computational overhead if Therefore Protocol H used "often enough." Now we analyze the optimality aspect of the algorithm. currency control algorithm is In [Kung79j, optimality of a con- defined as the degree to which the algorithm allows for "serializable input schedules" to proceed without being mterrupted Intuitively, an input schedule, which interleaved sequence of read and write requests issued by a set of transactions, serializable (NrV'SR) if there exists a way is is an mulli-version- of assigning versions to these read and write requests such that the resulting transaction dip('n(i''n(> graph i- acyclic For reasons of implementability, . _= =— 20 no known multi-version concurrency control algorithm However, the degree proceed without interruption. H is HTS comparing the used much more measure of HTS extreme case when Protocol L Lemma acyclic, then S. If in is this number if of restarts Protocol (i.e., MVTS tran- algorithm. statement, the following theorem confirms that, not used at the data partition graph among all, the HTS algorithm dominates the NfV'TS DPH DPG{P ,T^) of some database decomposition P DPH{P ,T^). the set of correct data partition hierarchies such that Protocol L exists can be argued that it schedules terms of optimality. in assigning to algorithm, MVSR concurrency. algorithm would be smaller than that under the While further studies are needed to quantify algorithm N-fV^SR input schedules to level of frequently than Protocol L, then the expected saction aborts) under the in the MVTS algorithm to the all which an algorithm allows to to proceed without interruption can be used as a useful In will allow is not needed to process update transactions the partial order of data partitions exhibited in DPG{P ,T^) in T, . there exists is DPH (This can be shown by DPG{P ,T^), and since no path from a higher data partition to a lower data partition, there exists no accesses from transactions assigned to higher data partitions to the lower data partitions, and L therefore Protocol Theorem is Let 3. partition hierarchy never needed.) F DPG{P ,T,] be a database decomposition and DPFI be such that Protocol L is never used in Let the data be acyclic. processing transactions in T, Let 'S(.\f\'TS) denote the set of serializable input schedules that are allowed for execution by NPk'TS without interruption, and hierarchy DPH S(HTS) where both Protocol H that by the HTS algorithm under the data partition and Protocol E are used. Then S{\{\'TS) is a subset of S(HTS). Proof and (1) We must show that (2) there exists at least (1) one schedule allowed by Let a schedule ScS{M\'TS). Then any schedule allowed by NfXTS must be alloHt-d by HTS, Want there must exist a step {l^,w,d) to in show 6' HTS that that that is not allowed by S(S(HTS). HTS will abort. Suppose We will \r\TS. this is not true. show that this is _= Lei t^(D^. impossible. Since Protocol L HTS /(<,). Therefore HTS that will cause the is case, no read timestamp (2) equal to that is it suffices to {t^,r,d) all is left show that there exists no (tj.r,d)<(t-,w,d) where t^eD^, either 0^=0^ or I(ti) D^>D^. by HTS. In the former case, the read timestamp D,>D^ m DPH. and of S{.\fVTS). The above SiS(HTS) and S be achieved, but a design H if left DPG(P ,T^) is is HTS is afD,, <,eZ?,, member not a in is acyclic, then the terms of level of HTS P can be found such algorithm will definitely per- concurrency. When such design cannot found that enables construction of a data partition hierarchy where much HTS higher than that of Protocol L, the expected to perform better than the NATS algorithm can still rithm to a simple Iwo-data-partition database. In the analysis, it Ls HTS The city of the case enables the use of simple analysis to obtain formuli for the rate of blocking in the algo- assumed that the higher data heavily contended for by transactions than the lower data partition. two-phase locking (2PL) due to conflict be algorithm. [Hsu83b,, a simple analytical model was established for an application of the b more by the set of update transactions such a way that a non-trivial database decomposition in form better than the M\'TS algorithm usage of Protocol d. Q.E.D. that the data partition graph partition Clearly with 5 /(<,). result implies that, given a database application, r, can be designed In I{t^]<I[t^). in In the latter Consider the following schedule: S={t ^,r ,a),{tn.r .a),{t„,w ,a),{t„w ,,h), where t^iDj. biDj. before is never used, the timestamp used to synchronize (t^,w.d) by by M\''TS, which must be smaller than left .r,d) {t algorithm to leave a read timestamp greater than true because for This is =— Let (t^,r ,d)<{t,,w.d) denote that fact that step (tf,w,d) in S. must be 21 simpli- under higher data partition, algorithm and the rate of makes assumptions that are in general in favor of A abort under the the algorithm 2PL approach and the effect of the the HTS much more HTS The analysis discriminate^ against the algorithm in HTS approach. The purpose is to demonstrate relieving contention in the higher data partition, heavily contended data partition in our case. The result, reported presumably in [Hsu83b], _= shows that, in general, the rate of blocking where A/Fj is the number 22 =— under 2PL proportional to is +ar,*MP^*\fPr,. a ^*,\fP^ system at any time and type-1 transac- of type-1 transactions in the tions are those assigned to the higher data partition, A/P2 that of type-2 transactions transactions are those assigned to the lower data partition, and in D, accessed by a typical transaction assigned abort under HTS is proportional only to a, A/P,'. parameters, one can conclude that 5. to D, HTS , at where is the i'=l,2. number and type-2 of data elements In contrast, the rate of While the absolute values depend on other would be preferred when Oj'A/Pj is large. Conclusion The Hierarchical Timestamp Algorithm can be considered as a generalized timestamp rithm parameterized by a data partition hierarchy constructed via transaction analysis. It algo- relies on an understanding of the structural disciplines of an application system and represents an attempt to take advantage of these disciplines. By being tures and being tural approach to concurrency control holds promise flexible in manipulating the basic control tools databases where the level of concurrency The It sensitive to the existence of such struc- is vital to for (e.g., timestamps, locks) this struc- improving applications or activities in system performance. thrust of this paper thus goes beyond proposing a new concurrency control algorithm. demonstrates the potential benefit of exploiting the knowledge and the structure of the applica- tion systems to implement more efficient also points out an area of research and more tailored concurrency control mechanism. which examines the problem of how to It design transactions so that concurrency control can be more efficient without compromising the key requirements on the data. It is believed that transaction design performed with concurrency control problems in mind could produce a structure of applications that be implemented. is significantly less costly for concurrency control to _= 6. 23 =— References [BayerSO; Bayer, R., Heller, H., and Reiser, A. Trans. Database Syst., Parallelism and recovery in database systems. ACM June 1980. 5, 2, [BernsteinSO] Bernstein, PA., Shipman, and Rothnie, J.B. Concurrency control in a system Trans. Database Syst., 5, 1, March 1980. D.VV., for dis- ACM tributed databases (SDD-1). [BernsteinSl] Bernstein, ACM PA Comp. and Goodman, N. Concurrency control in distributed database systems. Surv., 13, 2, June 1981. [Bernstein83j Bernstein, PA. and Goodman ACM rithms. Trans Database N. Multi-version concurrency control Syst.. 8, 4, December - theory and algo- 1983. [Carey86; Carey, M. and Muhanna, W'.A The performance of multi-version concurrency control algorithms. To appear m Transactions on Computer Systems. ACM [Chan82; Chan. A, scheme. et, The implementation al. ACM SIGMOD of an integrated concurrency control and recovery Conference Proceeding?. 1982. [Chan85j Chan, A., Gray. Engr., SE-11, 2, R Implementing distributed read-only transactions. February 1985. IEEE Trans Softw. [Eswaran76] Eswaran, K.P., Gray, J.N., Lorie. R.A. and Traiger. I.L. The notions of consistency and predicate locks in a database systems Comm .ACM. 19, 11, November 1976. [Garcia-Molina82: Garcia-Moiina, H. and Wiederhold. G, ACM Trans Database Syst . 7, 2, Read-only transactions June 1982 in a distributed database. [Gray 7 8] Gray, J.N. Notes on database operating systems Course. R. Bayer. R,M Graham, In Operating Systems - An Advanced E.G. Seegmuller, Eds., Springer- Verlag, 1978. [Harder86; Evaluation of a multiple version scheme for concurrency control. Technical Report, University Kaiserslautern, West Germany, January 1986. Harder. T. and Petn.'. E. [Hsu83] Hsu, M and Madnick, S.E. Hierarchical database decomposition: a technique for database concurrency control. Proceedings of 2nd of Database Systems, March 1983. ACM SIGACT-SIGMOD Symposium on Principles (Hsu83bj Hsu, M. The Hierarchical Database Decomposition approach to Concurrency Control. Thesis, Sloan School, M.I T., Cambridge, MA, Ph.D. 1983. [Hsu86] Hsu, M. and Chan. Database Systems. A Partiuoned two-phase locking. To appear in ACM Transactions on Kedem83 kedem. of and Silberschatz, A, 30. 4, October 1983 Z. ACM. Lo' king |iiotocol?: from exclusive to shared locks. '^^ Wt:i Journal =— -= 24 [Kung79j [Papadimitriou84] Papadimitriou, ACM CH and KanpllaHs Trans. Database Syst 9 " P r n„ ^^^^^^^^^^^ March ^984 -"^-' ^^ multiple versions. [Reed78] [Reed79j Reed, D.P., Implementing atomic actions. Systems Principles, December 1979. In ^ proceedmes 7th cjvmr. ACM Symposium [SilberschatzSO] ^t:tLZ mo""' ' ^°"""'">- '" """'"'^' ^"'^"' • — on Operating Jo-rna, o, [Silberschatz82j [StearnsSlJ [Viemont82] rroceeumgs of 1982. \2kl 051 ^ I-ault Tolerance Computer Systems, June Date Due Lib-26-67 MIT LIBRARIES 3 TDflO DOM TEA QTb