Transactions, Concluded, and the Future of Data Management Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 4, 2003 Slide content courtesy of Susan Davidson, Raghu Ramakrishnan & Johannes Gehrke Final Administrivia Project demos today and tomorrow Final exam handed out at the end of today’s class Finals plus project reports due by 1PM, 12/18/2003 Project reports should be ballpark 10-15 pages Remember, quality and clarity of presentation matters! Also, email me a brief message detailing: Your contributions to the project Your group members’ contributions and your assessment of “group dynamics” Turn in at my office, 576 Levine Hall or to my assistant, Kathy Venit, in 308 Levine Hall 2 Last Time… We were discussing isolation levels How to keep transactions from interfering with one another Or at least, how to minimize this Recall the strongest version of isolation was serializability 3 Theory of Serializability A schedule of a set of transactions is a linear ordering of their actions e.g. for the simultaneous deposits example: R1(X.bal) R2(X.bal) W1(X.bal) W2(X.bal) A serial schedule is one in which all the steps of each transaction occur consecutively A serializable schedule is one which is equivalent to some serial schedule (i.e. given any initial state, the final state is the same as one produced by some serial schedule) The example above is neither serial nor serializable 4 Questions of Concern Given a schedule S, is it serializable? How can we "restrict" transactions in progress to guarantee that only serializable schedules are produced? 5 Conflicting Actions Consider a schedule S in which there are two consecutive actions Ii and Ij of transactions Ti and Tj respectively If Ii and Ij refer to different data items, then swapping Ii and Ij does not matter If Ii and Ij refer to the same data item Q, then swapping Ii and Ij matters if and only if one of the actions is a write Ri(Q) Wj(Q) produces a different final value for Q than Wj(Q) Ri(Q) 6 Testing for Serializability Given a schedule S, we can construct a di-graph G=(V,E) called a precedence graph V : all transactions in S E : Ti Tj whenever an action of Ti precedes and conflicts with an action of Tj in S Theorem: A schedule S is conflict serializable if and only if its precedence graph contains no cycles Note that testing for a cycle in a digraph can be done in time O(|V|2) 7 An Example T1 T2 T3 R(X,Y,Z) R(X) W(X) T1 R(Y) W(Y) T2 T3 Cyclic: Not serializable. R(Y) R(X) W(Z) 8 Another Example T1 T2 R(X) W(X) T3 T1 R(X) W(X) T2 T3 Acyclic: serializable R(Y) W(Y) R(Y) W(Y) 9 Producing the Equivalent Serial Schedule If the precedence graph for a schedule is acyclic, then an equivalent serial schedule can be found by a topological sort of the graph For the second example, the equivalent serial schedule is: R1(Y)W1(Y) R2(X)W2(X) R2(Y)W2(Y) R3(X)W3(X) 10 Locking and Serializability We said that for a serializable schedule, a transaction must hold all locks until it terminates (a condition called strict locking) It turns out that this is crucial to guarantee serializability Note that the first (bad) example could have been produced if transactions acquired and immediately released locks. 11 Well-Formed, Two-Phased Transactions A transaction is well-formed if it acquires at least a shared lock on Q before reading Q or an exclusive lock on Q before writing Q and doesn’t release the lock until the action is performed Locks are also released by the end of the transaction A transaction is two-phased if it never acquires a lock after unlocking one i.e., there are two phases: a growing phase in which the transaction acquires locks, and a shrinking phase in which locks are released 12 Two-Phased Locking Theorem If all transactions are well-formed and two-phase, then any schedule in which conflicting locks are never granted ensures serializability i.e., there is a very simple scheduler! However, if some transaction is not well-formed or two-phase, then there is some schedule in which conflicting locks are never granted but which fails to be serializable i.e., one bad apple spoils the bunch. 13 Summary of Transactions Transactions are all-or-nothing units of work guaranteed despite concurrency or failures in the system Theoretically, the “correct” execution of transactions is serializable (i.e. equivalent to some serial execution) Practically, this may adversely affect throughput isolation levels With isolation levels, users can specify the level of “incorrectness” they are willing to tolerate 14 What to Look for Down the Road … well, no one really knows the answer to this… … But here are some hints, ideas, and hot directions Sensors and streaming data Peer-to-peer meets databases “The Semantic Web” Collaborative data sharing 15 Sensors and Streaming Data No databases at all… … Instead we have networks of simple sensors Madden, starting at MIT Gehrke, Cornell Widom, Stanford queries are in SQL data is live and “streaming” we compute aggregates over “windows” 16 What’s Interesting Here We’re not talking about data on disk – we’re talking about queries over “current readings” Sensors are generally “stupid” and may be battery-operated A lot of challenges are networking-related: how to aggregate data before it gets sent, etc. The next step (e.g., work initiated here @ Penn): including sensors that capture images – a very different problem! This has many more compelling applications – security, monitoring, correlating multiple sensors, rescue operations, military logistics and coordination, etc. 17 Peer-to-Peer Computing Fundamentally, our model of DBMSs tends to be centralized Even for data integration: there’s a single mediator This has many implications: central administration, central coordination, etc. What can be gained from borrowing a page from peer-topeer systems like Napster, Kazaa, etc.? A better architecture? Solutions to many problems unsolved by distributed DBMSs? Replication, object location, distributed optimization, resiliency to failure, … New types of applications, e.g., in integration? 18 P2P Work As a new architecture for storage and querying PIER (Berkeley), P-Grid (EPFL), Medusa (MIT) A better way of thinking about translating and exchanging data Piazza (Washington), Orchestra (Penn), Hyperion (Toronto), work at Trento 19 The Semantic Web In some ways, a very “pie-in-the-sky” vision But some real and concrete problems might be partly solvable Goal is really very similar to data integration, where somehow we have mappings between the schemas Currently, most people in the SW community are from knowledge representation community and use RDF Focus: very rich ways of describing schemas – “ontologies” – that blend querying with class definitions “Teachers are people who teach students” “Tenure-track professors are teachers at universities who can get tenure”; etc. Implicit take on the problem: if we create better languages for describing ontologies, it’s easier to mediate between schemas 20 Holes in the Semantic Web What issues and concerns came up in the data integration assignment you had? Do you think a richer schema language would help for these? Do you think “better normalization” would help? Fundamentally, we need: Languages for not only describing relationships, but transformations between formats (e.g., XML schemas) Automatic or partly automated ways of discovering mappings and correspondences These are all database problems, and the solution likely must come from the DB community This is part of what P2P systems like Piazza, Hyperion try to address 21 My Take on the Future We’ve evolved from a world where data management is about controlling the data Instead, data management is about translating and transforming data using declarative languages It should ultimately become much like TCP or SOAP – a set of standard services for “getting stuff” from one point to another, or from one form to another It’s the plumbing that connects different applications using different formats Orchestra project at Penn: focuses on how to build a system for supporting collaborative science People publish and map data in different schemas What happens if people start updating it? How do you propagate, manage, trace, reconcile changes? 22