slides - ProvenanceWeek

advertisement
A Generic Provenance Middleware
for Database Queries, Updates, and
Transactions
Bahareh Sadat Arab1, Dieter Gawlick2, Venkatesh
Radhakrishnan2, Hao Guo1, Boris Glavic1
IIT DBGroup1
Oracle2
Outline
❶
❷
❸
2
Motivation and Overview
GProM Vision
Provenance for Transactions
GProM - Provenance for Queries, Updates, and Transactions
Introduction
• Data Provenance
– Information about the origin and creation process data
• Provenance tracking for database operations
– Considerable interest from database community in last decade
• The de-facto standard for database provenance [1,2,3,4,5]
– model provenance as annotations on data (e.g., tuples)
– compute the provenance by propagating annotations (query rewrite)
SELECT
DISTINCT Owner
FROM CannAcc;
3
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of
Elegance in the Theory and Practice of Computation, Springer, 2013.
[2] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS,
2013.
[3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB
Journal, 14(4):373–396, 2005.
[4] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty,
and Lineage. In VLDB, pages 1151–1154, 2006.
[5] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):5–14, 2012.
GProM - Provenance for Queries, Updates, and Transactions
Use Cases
• Debugging data and transformations (queries)[1]
• Probabilistic databases (queries)[5]
• Auditing and compliance (transactions and update
statements)[6]
• Understanding data integration transformations (queries
and transactions)
• Assessing data quality and trust (queries and
transactions)[7]
 Computing provenance for updates and transactions is
essential for many use cases.
4
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance
Information. In Search of Elegance in the Theory and Practice of Computation, pringer, 2013.
[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System
for Data, Uncertainty, and Lineage. In VLDB, 2006.
[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, 2011.
[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 2012.
GProM - Provenance for Queries, Updates, and Transactions
Shortcomings of State-of-the-Art
• No practical implementation for updates
• No system or model supports transactions
• Inflexible provenance storage
– Always on [2,3]
– On-demand only [1]
• Query rewrite use atypical access patterns and
operator sequences
– -> leads to poor execution plans
• Most systems: only one type of provenance
5
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of
Elegance in the Theory and Practice of Computation, pringer, 2013.
[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB
Journal, 2005.
[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS,
2013.
GProM - Provenance for Queries, Updates, and Transactions
Objectives
1. Vision: Generic Provenance Database
Middleware (GProM).
– Provenance for
• Queries, updates, and transactions
– User decides when to compute and store
provenance
– Supports multiple provenance models
– Database-independent
2. Tracking provenance of concurrent
transactions
– Reenactment Queries
6
GProM - Provenance for Queries, Updates, and Transactions
Contributions
1. First solution for provenance of transactions
2. Retroactive on-demand provenance
computation
– Using read-only reenactment
3. Only requires audit log + time travel
– Supported by most DBMS
– No additional storage and runtime overhead
4. Non-invasive provenance computation
– query rewrite + annotation propagation
7
GProM - Provenance for Queries, Updates, and Transactions
Outline
❶
❷
❸
8
Motivation and Overview
GProM Vision
Provenance for Transactions
GProM - Provenance for Queries, Updates, and Transactions
System Architecture
• Database independent middleware
– Plug-able parser and SQL code generator
• Internal query representation
– Relational Algebra Graph Model (AGM)
• Core driver: Query rewrites
–
–
–
–
–
Provenance Computation
Flexible storage policies for provenance
Provenance import/export
AGM Optimizer (rewritten queries)
Extensibility: Rewrite Specification Language (RSL)
• Initial prototype build on-top of Oracle
9
GProM - Provenance for Queries, Updates, and Transactions
GProM Overview
10
GProM - Provenance for Queries, Updates, and Transactions
Provenance Computation
• Query rewrite
– Take original query q and rewrite into q+
Computes original results + provenance
– Propagate provenance through operations
Result
Result +
Provenance
Q
Q+
DB
11
GProM - Provenance for Queries, Updates, and Transactions
Example Rewrite
• Input:
SELECT DISTINCT u.Owner FROM Usacc u, CanAcc c WHERE u.ID = c.ID;
• Rewrite Parts:
USacc
CanAcc
SELECT ID, Owner, Balance, Type,
ID AS P1, Owner AS P2, Balance AS P3, Type AS P4
FROM USacc
SELECT ID, Owner, Balance, Type,
ID AS P5, Owner AS P6, Balance AS P7, Type AS P8
FROM CanAcc
WHERE u.ID = c.ID
WHERE u.ID = c.ID
SELECT DISTINCT Owner
SELECT Owner, P1, P2, P3, P4, P5, P6, P7, P8
• Output:
SELECT u.Owner, P1, P2, P3, P4, P5, P6, P7, P8
FROM
(SELECT ID, Owner, Balance, Type,
ID AS P1, Owner AS P2, Balance AS P3, Type AS P4
FROM USacc) u
(SELECT ID, Owner, Balance, Type,
ID AS P5, Owner AS P6, Balance AS P7, Type AS P8
FROM CanAcc) c
WHERE u.ID = c.ID;
12
GProM - Provenance for Queries, Updates, and Transactions
Provenance Computation
• Operates on relational algebra representation of queries
– Fixed set of rewrite rules per provenance type:
• One per type of algebra operator
• Recursive top-down rewrite
– For each relation access: duplicate attributes as provenance
– For each operator: replace with algebra graph that propagates
provenance annotations
• Composable
UsAcc
13
CanAcc
UsAcc
CanAcc
GProM - Provenance for Queries, Updates, and Transactions
Supporting Past Queries, Updates,
and Transactions
• Only needs audit log and time travel
– supported by most DBMS
• Sufficient for provenance of past queries [4]
• Our contribution
– Sufficient for updates and transactions
[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, 2010.
14
GProM - Provenance for Queries, Updates, and Transactions
Provenance Generation and
Storage Policies
• GProM default
– Only compute provenance if explicitly requested
• User can register storage policies
– When to store which type of provenance
POLICY storeOnR {
FIRE ON Query, Insert q
WHEN Root(q) +=> Table(R)
COMPUTE PI-CS
STORE AS NEW TABLE
NAMING SCHEME Hash
}
15
GProM - Provenance for Queries, Updates, and Transactions
Optimizing Rewritten Queries
• Query rewrite use atypical access patterns and
operator sequences
leads to poor execution plans
• Optimization for rewritten queries
– Heuristic
– Cost-based
SELECT ID, Owner, Balance, 'Premium ' AS Type,
prov_CanAcc_ID,
prov_CanAcc_Owner,
prov_CanAcc_Balance,
prov_CanAcc_Type,
prov_USacc_ID,
prov_USacc_Owner,
prov_USacc_Balance,
prov_USacc_Type
FROM u1
WHERE Balance > 1000000
UNION ALL
SELECT * FROM u1
WHERE (Balance > 1000000) IS NOT TRUE
16
SELECT ID, Owner, Balance,
CASE
WHEN Balance > 1000000
THEN 'Premium '
ELSE Type
END AS Type,
prov_CanAcc_ID,
prov_CanAcc_Owner,
prov_CanAcc_Balance,
prov_CanAcc_Type,
prov_USacc_ID,
prov_USacc_Owner,
prov_USacc_Balance,
prov_USacc_Type
FROM u1
...
GProM - Provenance for Queries, Updates, and Transactions
Rewrite Extensibility
• Extensible using Rewrite Specification Language (RSL)
– Concise specification of rewrite rules
RSL
User
1
2
RSL
Manager
1
Provenance
Rewriter
3
4
3
RSL
Interpreter
2
Policy
Policy
RSL
RULE mergeSelections {
FOR q => c => g
WHERE q->type = selection AND c->type = selection
REWRITE INTO
selection [pred = q->pred AND c->pred] => g
}
17
GProM - Provenance for Queries, Updates, and Transactions
Outline
❶
❷
❸
18
Motivation and Overview
GProM Vision
Provenance for Transactions
GProM - Provenance for Queries, Updates, and Transactions
Provenance of Transactions
19
GProM - Provenance for Queries, Updates, and Transactions
Provenance of Transactions
INSERT INTO USacc
(SELECT ID,
Owner,
Balance,
‘Standard’ AS Type
FROM CanAcc
WHERE Type = ‘US_dollar’);
UPDATE USacc
SET Type = ’Premium’
WHERE Balance > 1000000;
COMMIT;
20
GProM - Provenance for Queries, Updates, and Transactions
Provenance of Transactions
INSERT INTO Usacc
u1
(SELECT ID,
Owner,
Balance,
‘Standard’ AS Type
FROM CanAcc
WHERE Type = ‘US_dollar’);
21
u2
UPDATE Usacc
SET Type = ’Premium’
WHERE Balance > 1000000;
GProM - Provenance for Queries, Updates, and Transactions
Provenance of Transactions
• Our Approach:
Reenactment + Provenance Propagation
• Currently supports
– Snapshot Isolation
– Statement-level Snapshot Isolation
1
2
Gather
Transaction
Information
22
Construct
Update
Reenactment
Query
3
Construct
Transaction
Reenactment
Query
4
5
Rewrite For
Provenance
Computation
GProM - Provenance for Queries, Updates, and Transactions
Execute
Query
1.Gather Transaction Information
• Retrieve SQL statements of transaction from audit log
• Update u1:
INSERT INTO USacc
(SELECT ID,
Owner,
Balance,
‘Standard’ AS Type
FROM CanAcc
WHERE Type = ‘US_dollar’);
• Update u2:
UPDATE Usacc
SET Type = ’Premium’
WHERE Balance > 1000000;
23
GProM - Provenance for Queries, Updates, and Transactions
2. Translate Updates: Reenactment
• Update reads table version and outputs updated table version
• Multiple versions of the database
– Each modification of a tuple t causes a new version to be created
– Old tuple versions are kept (SI)
– Add version annotation τ to provenance of each updated row
•
Use semi-ring model
UPDATE Usacc
SET Type=’Premium’
WHERE Balance>1000000;
24
GProM - Provenance for Queries, Updates, and Transactions
2.Translate Updates
• Construct update reenactment query
– Simulates effect of update
– Read DB version seen by update using time travel
– Query result = updated table (Annotation-Equivalent)
INSERT INTO Usacc
(SELECT ID,
Owner,
Balance,
‘Standard’ AS Type
FROM CanAcc
WHERE Type = ‘US_dollar’);
UPDATE Usacc
SET Type = ’Premium’
WHERE Balance > 1000000;
25
SELECT ID, Owner, Balance, ’Standard’ AS Type
FROM CanAcc AS OF SCN 3652
WHERE Type=‘US_dollar’
UNION ALL
SELECT * FROM Usacc AS OF SCN 3652;
SELECT ID, Owner, Balance, ’Premium’ AS Type
FROM Usacc AS OF SCN 3652
WHERE Balance>1000000
UNION ALL
SELECT *
FROM Usacc AS OF SCN 3652
WHERE (Balance>1000000) IS NOT TRUE;
GProM - Provenance for Queries, Updates, and Transactions
3. Construct Reenactment Query
• Simulates the whole transaction
– Annotation-Equivalent to original transaction
• Merge reenactment queries based on concurrency control protocol
– Each concurrency control requires a different merge process
– SERIALIZABLE (Snapshot isolation) -> modifications before the
transaction started + previous updates of the transaction
– READ COMMITTED (Snapshot isolation) -> sees committed changes
by concurrent transaction
WHIT U1 AS
(SELECT ID, Owner, Balance, ’Standard’ AS Type
FROM CanAcc AS OF SCN 3652
WHERE Type=‘US_dollar’
UNION ALL
SELECT * FROM Usacc AS OF SCN 3652);
26
SELECT ID, Owner, Balance, ’Premium’ AS Type
FROM U1
WHERE Balance>1000000
UNION ALL
SELECT * FROM U1
WHERE (Balance>1000000) IS NOT TRUE;
GProM - Provenance for Queries, Updates, and Transactions
4. Rewrite For Provenance
Computation
• Rewrite reenactment query to compute
provenance using annotation propagation
WITH
u1 AS
(SELECT ID, Owner, Balance, ’Standard ’ AS Type,
ID AS prov_CanAcc_ID,
. . .
NULL AS prov_USacc_ID,
. . .
1 AS updated,
FROM CanAcc AS OF SCN 3652
WHERE Type = ’US dollar ’
UNION ALL
SELECT ID , Owner , Balance , Type ,
NULL AS prov_CanAcc_ID,
. . .
ID AS prov_USacc_ID,
. . .
0 AS updated
FROM USacc AS OF SCN 3652),
. . .
u1 AS
(SELECT . . .
27
GProM - Provenance for Queries, Updates, and Transactions
4. Execute Query
• Execute query to retrieve provenance
Updated USacc Tuples
28
Provenance from CanAcc
Provenance from USacc
ID
Owner
Balance
Type
P1
P2
P3
P4
P5
P6
3
Alice Bright
1,500,000
Premium
3
Alice Bright
1,500,000
NULL
NULL
NULL
5
Mark Smith
50
Standard
5
Mark Smith
50
NULL
NULL
NULL
GProM - Provenance for Queries, Updates, and Transactions
Conclusions
• We present our vision for GProM
– Database-independent middleware for computing
provenance of queries, updates, and transactions.
• First solution for provenance of transactions
• Query rewrite techniques on steroids:
–
–
–
–
–
Provenance computation
Transaction reenactment
Provenance translation
Provenance storage
Optimization
• Extensible through RSL language
29
GProM - Provenance for Queries, Updates, and Transactions
Future Works
• Implementing additional provenance types
• Comprehensive study of heuristic and cost-based
optimizations
• Design and implementation of RSL
• Implementing additional provenance formats
• Study reenactment for other concurrency control
mechanisms
– Locking protocols (2PL)
• Investigate additional Use-cases for Reenactment
– Transaction backout
– Retroactive What-if analysis
30
GProM - Provenance for Queries, Updates, and Transactions
Questions?
• Homepage:
Bahareh: http://www.cs.iit.edu/~dbgroup/people/barab.php
Boris: http://www.cs.iit.edu/~glavic/
• DBGroup:
http://www.cs.iit.edu/~dbgroup/
• GProM Project (partially funded by Oracle)
http://www.cs.iit.edu/~dbgroup/research/oracletprov.php
• Perm
http://www.cs.iit.edu/~dbgroup/research/perm.php
31
GProM - Provenance for Queries, Updates, and Transactions
References
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient
Generation and Querying of Provenance Information. In Search of
Elegance in the Theory and Practice of Computation, pages 291–320. Springer,
2013.
[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An
Annotation Management System for Relational Databases. VLDB Journal,
14(4):373–396, 2005.
[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative
data sharing via update exchange and provenance. TODS, 38(3): 19, 2013.
[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, pages 311–
322, 2010.
[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T.
Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and
Lineage. In VLDB, pages 1151–1154, 2006.
[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using
temporal databases. In TaPP, 2011.
[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and
provenance. SIGMOD Record, 41(3):5–14, 2012.
32
Q-Bomb
• One pattern that arises from reenactment are long chains of
SELECT clauses using CASE
– Each level references attributes from next level multiple times
– Subquery pull-up creates expressions of size exponential in the number
of SELECT clauses
– In praxis: optimization never finishes
• Minimal example using one row table
SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b
FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b
…
FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b
FROM R
33
Example Provenance Computation
34
Example – Update Reenactment
35
Example – Trans. Reenactment
36
Rewrite Reenactment Query
37
Execute Rewritten Query
38
Types of Update Operations - Insert
• Insert executed at time t
• Updated version of R contains
1. All tuples from previous version
2. All newly inserted tuples
• Fixed tuple defined in VALUES clause
• Results of query over database version at t
Union these two sets
INSERT INTO R VALUES (v1, ... ,vn);
INSERT INTO R (q);
39
(SELECT * FROM R AS OF t)
UNION ALL
(SELECT v1 AS a1, ... , vn AS an);
(SELECT * FROM R AS OF t)
UNION ALL
(q(t));
Types of Update Operations - Delete
• Delete executed at time t
• Tuples in updated version of R:
– All tuples from for which Condition is not
fulfilled
DELETE FROM R WHERE C ;
40
SELECT * FROM R AS OF t
WHERE (C) IS NOT TRUE;
Types of Update Operations - Update
• Update executed at time t
• Find tuples where Condition holds and update
the attribute values
• Find tuples where NOT Condition holds
Union these two sets
UPDATE R SET A WHERE C ;
41
(SELECT A’ FROM R AS OF t WHERE C)
UNION ALL
(SELECT * FROM R AS OF t WHERE (C) IS NOT TRUE)
READ COMMITTED
• Statement of a transaction T sees committed changes by concurrent
transaction
• For a given update we need to combine
– tuples produced by previous statements of same transaction
– tuples produced by transactions that committed before update
• Observations
– Once a transaction T modifies a tuple t, no other transaction can access t until T
commits
– Let ui be the update executed at time x of T that first modifies t
– ui will read the latest version committed x
– If we know ui then updates of T before x do not have to look at t
• Consider the database version 1 time unit (C-1) before commit of T
– This contains all the tuple versions seen by the first update of T updating each
individual tuple
– Let t be a tuple version in this version and it’s start time is y
– We know that updates from T which executed before y cannot have updated t
– We can use version C-1 as input for reenactment as long as we hide tuple
version t at y from an reenactment of an updated executed at x with x < y
42
READ COMMITTED
u1 AS
(SELECT
CASE WHEN Balance <=1000000 AND version <= 0 THEN 'Standard ' ELSE Type END AS Type ,
ID , Owner , Balance ,
CASE WHEN Balance <=1000000 AND version <= 0 THEN −1 ELSE version END AS version
FROM USacc AS OF SCN 3652)
,
u2 AS
(SELECT
CASE WHEN Balance > 1000000 AND version <= 1 THEN 'Premium' ELSE Type END AS Type ,
ID , Owner , Balance ,
CASE WHEN Balance > 1000000 AND version <= 1 THEN −1 ELSE version END AS version
FROM u1 )
43
SELECT ID , Owner , Balance , Type FROM u2 WHERE version = −1;
Database Independence
• Encapsulate database-specific functionality in
pluggable modules.
• What needs to be adapted are :
1) Parser
2) SQL code generator
3) Metadata access
4) Audit log access
5) Time travel activation.
44
Accessing Several Tables
• Transactions Accessing Several Tables
– We require user to specify which table she is
interested in
– Replace access to table with query for last update
that modified the table
R1
U1
R1
U4
R2
R3
45
U3
U2
R3
R3
R1
Download