12/4

advertisement
Transactions, Concluded, and
the Future of Data Management
Zachary G. Ives
University of Pennsylvania
CIS 550 – Database & Information Systems
December 4, 2003
Slide content courtesy of Susan Davidson, Raghu Ramakrishnan & Johannes Gehrke
Final Administrivia
 Project demos today and tomorrow
 Final exam handed out at the end of today’s class
 Finals plus project reports due by 1PM, 12/18/2003
 Project reports should be ballpark 10-15 pages
 Remember, quality and clarity of presentation matters!
 Also, email me a brief message detailing:
 Your contributions to the project
 Your group members’ contributions and your assessment of
“group dynamics”
 Turn in at my office, 576 Levine Hall
or to my assistant, Kathy Venit, in 308 Levine Hall
2
Last Time…
 We were discussing isolation levels
 How to keep transactions from interfering with one
another
 Or at least, how to minimize this
 Recall the strongest version of isolation was
serializability
3
Theory of Serializability
 A schedule of a set of transactions is a linear ordering of their
actions
 e.g. for the simultaneous deposits example:
R1(X.bal) R2(X.bal) W1(X.bal) W2(X.bal)
 A serial schedule is one in which all the steps of each
transaction occur consecutively
 A serializable schedule is one which is equivalent to some
serial schedule (i.e. given any initial state, the final state is the
same as one produced by some serial schedule)
 The example above is neither serial nor serializable
4
Questions of Concern
 Given a schedule S, is it serializable?
 How can we "restrict" transactions in progress to
guarantee that only serializable schedules are
produced?
5
Conflicting Actions
 Consider a schedule S in which there are two consecutive
actions Ii and Ij of transactions Ti and Tj respectively
 If Ii and Ij refer to different data items, then swapping Ii and Ij
does not matter
 If Ii and Ij refer to the same data item Q, then swapping Ii and
Ij matters if and only if one of the actions is a write
 Ri(Q) Wj(Q) produces a different final value for Q than Wj(Q) Ri(Q)
6
Testing for Serializability
 Given a schedule S, we can construct a di-graph
G=(V,E) called a precedence graph
 V : all transactions in S
 E : Ti  Tj whenever an action of Ti precedes and conflicts
with an action of Tj in S
 Theorem:
A schedule S is conflict serializable if and only if its
precedence graph contains no cycles
 Note that testing for a cycle in a digraph can be
done in time O(|V|2)
7
An Example
T1
T2
T3
R(X,Y,Z)
R(X)
W(X)
T1
R(Y)
W(Y)
T2
T3
Cyclic: Not serializable.
R(Y)
R(X)
W(Z)
8
Another Example
T1
T2
R(X)
W(X)
T3
T1
R(X)
W(X)
T2
T3
Acyclic: serializable
R(Y)
W(Y)
R(Y)
W(Y)
9
Producing the Equivalent Serial
Schedule
 If the precedence graph for a schedule is acyclic, then
an equivalent serial schedule can be found by a
topological sort of the graph
 For the second example, the equivalent serial schedule
is:
 R1(Y)W1(Y) R2(X)W2(X) R2(Y)W2(Y) R3(X)W3(X)
10
Locking and Serializability
 We said that for a serializable schedule, a transaction
must hold all locks until it terminates (a condition
called strict locking)
 It turns out that this is crucial to guarantee
serializability
 Note that the first (bad) example could have been
produced if transactions acquired and immediately
released locks.
11
Well-Formed, Two-Phased
Transactions
 A transaction is well-formed if it acquires at least
a shared lock on Q before reading Q or an
exclusive lock on Q before writing Q and doesn’t
release the lock until the action is performed
 Locks are also released by the end of the transaction
 A transaction is two-phased if it never acquires a
lock after unlocking one
 i.e., there are two phases: a growing phase in which the
transaction acquires locks, and a shrinking phase in
which locks are released
12
Two-Phased Locking Theorem
 If all transactions are well-formed and two-phase,
then any schedule in which conflicting locks are
never granted ensures serializability
 i.e., there is a very simple scheduler!
 However, if some transaction is not well-formed or
two-phase, then there is some schedule in which
conflicting locks are never granted but which fails
to be serializable
 i.e., one bad apple spoils the bunch.
13
Summary of Transactions
 Transactions are all-or-nothing units of work
guaranteed despite concurrency or failures in the
system
 Theoretically, the “correct” execution of transactions
is serializable (i.e. equivalent to some serial
execution)
 Practically, this may adversely affect throughput 
isolation levels
 With isolation levels, users can specify the level of
“incorrectness” they are willing to tolerate
14
What to Look for Down the Road
 … well, no one really knows the answer to this…
 … But here are some hints, ideas, and hot directions




Sensors and streaming data
Peer-to-peer meets databases
“The Semantic Web”
Collaborative data sharing
15
Sensors and Streaming Data
 No databases at all…
 … Instead we have
networks of simple sensors
 Madden, starting at MIT
 Gehrke, Cornell
 Widom, Stanford
 queries are in SQL
 data is live and “streaming”
 we compute aggregates over
“windows”
16
What’s Interesting Here
 We’re not talking about data on disk – we’re talking about
queries over “current readings”
 Sensors are generally “stupid” and may be battery-operated
 A lot of challenges are networking-related: how to aggregate data
before it gets sent, etc.
 The next step (e.g., work initiated here @ Penn): including
sensors that capture images – a very different problem!
 This has many more compelling applications – security, monitoring,
correlating multiple sensors, rescue operations, military logistics and
coordination, etc.
17
Peer-to-Peer Computing
 Fundamentally, our model of DBMSs tends to be centralized
 Even for data integration: there’s a single mediator
 This has many implications: central administration, central
coordination, etc.
 What can be gained from borrowing a page from peer-topeer systems like Napster, Kazaa, etc.?
 A better architecture?
 Solutions to many problems unsolved by distributed DBMSs?
 Replication, object location, distributed optimization, resiliency to failure,
…
 New types of applications, e.g., in integration?
18
P2P Work
 As a new architecture for storage and querying
 PIER (Berkeley), P-Grid (EPFL), Medusa (MIT)
 A better way of thinking about translating and
exchanging data
 Piazza (Washington), Orchestra (Penn), Hyperion
(Toronto), work at Trento
19
The Semantic Web
 In some ways, a very “pie-in-the-sky” vision
 But some real and concrete problems might be partly solvable
 Goal is really very similar to data integration, where somehow we
have mappings between the schemas
 Currently, most people in the SW community are from
knowledge representation community and use RDF
 Focus: very rich ways of describing schemas – “ontologies” – that
blend querying with class definitions
 “Teachers are people who teach students”
“Tenure-track professors are teachers at universities who can get tenure”;
etc.
 Implicit take on the problem: if we create better languages for
describing ontologies, it’s easier to mediate between schemas
20
Holes in the Semantic Web
 What issues and concerns came up in the data integration
assignment you had?
 Do you think a richer schema language would help for these?
 Do you think “better normalization” would help?
 Fundamentally, we need:
 Languages for not only describing relationships, but transformations
between formats (e.g., XML schemas)
 Automatic or partly automated ways of discovering mappings and
correspondences
 These are all database problems, and the solution likely must come
from the DB community
 This is part of what P2P systems like Piazza, Hyperion try to address
21
My Take on the Future
 We’ve evolved from a world where data management is about
controlling the data
 Instead, data management is about translating and
transforming data using declarative languages
 It should ultimately become much like TCP or SOAP – a set of
standard services for “getting stuff” from one point to another, or
from one form to another
 It’s the plumbing that connects different applications using different
formats
 Orchestra project at Penn: focuses on how to build a
system for supporting collaborative science
 People publish and map data in different schemas
 What happens if people start updating it?
 How do you propagate, manage, trace, reconcile changes?
22
Download