Recomputing Materialized Instances After Changes to Mappings and Data Todd J. Green Zachary G. Ives & ICDE ’12 Washington, DC April 2, 2012 Change is a Constant in Data Management • Databases are highly dynamic; many kinds of changes need to be propagated efficiently: – To data (“view maintenance”) – To view definitions (“view adaptation”) – Others, such as schema evolution, etc. • Collaborative data sharing systems (e.g., ORCHESTRA [Ives+ 05]), declarative programming platforms (e.g., LogicBlox [Huang+11]) exacerbate this need: – Large numbers (100s to 10s of 1000s) of materialized views – Frequent updates to data, schemas, mapping/view definitions 2 Change Propagation: a Problem of Computing Differences View maintenance view definition source data R change to source data (difference wrt current version) R¢ Given: V Goal: materialized view compute change to materialized view (difference) V¢ View adaptation Given: source data view definition R change to view definition (another kind of difference) V materialized view Goal: compute change to materialized view V¢ 3 Challenges in Change Propagation • View maintenance: studied since at least the mid-eighties [Blakeley+ 86], but existing solutions quite narrow and limited – Various known methods to compute changes “incrementally”, e.g., count algorithm [Gupta+ 93] – How do we optimize this process? What is space of all update plans? • View adaptation: less attention, but renewed importance in context of data exchange/collaborative data sharing systems – Previous approaches: limited to case-based methods for simple changes [Gupta+ 01] – Complex changes? Again, space of all update plans? • Key challenge: compute changes using database queries! 4 Contributions • We build on our previous theoretical work [G+ 09] to implement a unified and cost-based approach to handling updates to source data, view definitions, or both • Practical implementation in Orchestra CDSS, based on rewriting queries using materialized views and enriched data model – Core of our engine is unaware of which kind of update (to source data or to view definition) it is dealing with! – Engine also automatically exploits provenance information, if present • We demonstrate significant practical benefits on workloads typical of data sharing in the life sciences 5 Background I: ORCHESTRA CDSS [Ives+08] a Collaborative Data Sharing System • Set of peers (e.g., collaborating life scientists), each with database, agree to share information • Peers linked via network of compositional schema mappings – define how data/updates applied to one peer instance should be transformed and applied to other peer instances • System tracks provenance (lineage) information [Green+ 07] as updates are mapped/transformed – Basis of provenance-based trust policies – Also used to guide update propagation 6 Example: Sharing Morphological Data Alice’s field observations: a ID Species Image Character State 34 Lemur catta hand color white 47 Lemur catta hand color white Carol wants to gather information from Alice, Bob, uBio, and put into own data repository: Bob’s field observations: b, c SID Char State 61 hand color black SID Species 61 Lemur catta Picture Carol’s Guide to Primate Hand Colors Common Name Hand Color schema mappings Standard species names: d Species Common Name Lemur catta Ring-Tailed Lemur Can do this using schema mappings 7 Example: Sharing Morphological Data (2) Alice’s field observations: a Datalog mappings relating databases ID Species Image Character State 34 Lemur catta hand color white 47 Lemur catta hand color white Bob’s field observations: b, c SID Char State 61 hand color black SID Species 61 Lemur catta Picture e(Name, Color) :– b(Id, “hand color”, Color), c(Id, Species,_), d(Species, Name). e(Name, Color) :– a(Id, Species,_, “hand color”, Color), d(Species, Name). Carol’s Guide to Primate Hand Colors: e Common Name Hand Color Ring-Tailed Lemur white Standard species names: d Species Common Name Lemur catta Ring-Tailed Lemur 8 Example: Sharing Morphological Data (2) Alice’s field observations: a Datalog mappings relating databases ID Species Image Character State 34 Lemur catta hand color white 47 Lemur catta hand color white Bob’s field observations: b, c SID Char State 61 hand color black SID Species 61 Lemur catta Picture e(Name, Color) :– b(Id, “hand color”, Color), c(Id, Species,_), d(Species, Name). e(Name, Color) :– a(Id, Species,_, “hand color”, Color), d(Species, Name). Carol’s Guide to Primate Hand Colors: e Common Name join Hand Color Ring-Tailed Lemur black white Standard species names: d Species Common Name Lemur catta Ring-Tailed Lemur 9 Example: Sharing Morphological Data (2) Alice’s field observations: a Datalog mappings relating databases ID Species Image Character State 34 Lemur catta hand color white 47 Lemur catta hand color white Bob’s field observations: b, c SID Char State 61 hand color black SID Species 61 Lemur catta Picture e(Name, Color) :– b(Id, “hand color”, Color), c(Id, Species,_), d(Species, Name). e(Name, Color) :– a(Id, Species,_, “hand color”, Color), d(Species, Name). Carol’s Guide to Primate Hand Colors: E join Common Name Hand Color Ring-Tailed Lemur black white Ring-Tailed Lemur white Standard species names: d Species Common Name Lemur catta Ring-Tailed Lemur 10 Background II: Data Updates as Z-Relations [G+09] • Can think of changes to data as a kind of annotated relation inserted tuple + deleted tuple – • Z-relation: a relation where each tuple is associated with a (positive or negative) count r¢ – Positive counts indicate (multiple) insertions; a b c d negative counts, (multiple) deletions 2 –3 – Uniform representation for both data and changes to data – Update application = union (a query!) r’ = r [ r¢ 11 Relational Algebra (RA) on Z-Relations [G+09] join (⋈) multiplies counts union ([), projection (¼) add counts (same as for semiringannotated relations [G+07]) selection (¾) multiplies counts by 0 or 1 difference (–) subtracts counts Note, difference can lead to negative counts (unlike “proper subtraction” in bag semantics where negative counts are truncated to 0) Note, 12 Incremental View Maintenance: An Application of Z-Relations Materialized view (with duplicates): Source relation: a a 1 a c 1 1 b b 2 a 1 c c 1 R¢ b a -1 b b -1 c d +1 b d +1 R a b 1 c b 1 b c b v(X,Y) :– r(X,Z), r(Z,Y) deletion insertion V¢ Delta rules [Gupta+ 93] for v with Z-relations semantics: 2 copies of (b,b) delete 1 copy of (b,b) insert 1 copy of (b,d) v¢(X,Y) :– r(X,Z), r¢(Z,Y) v¢(X,Y) :– r¢(X,Z), r’(Z,Y) 13 Z-Relations are Amenable to Advanced Optimization Strategies [G+09] • For change propagation, fundamental need for difference in query language => full relational algebra (RA) • Under set or bag (multiset) semantics, basic optimization tasks---e.g., testing query equivalence---are undecidable for RA • Under Z-semantics, equivalence of RA queries is, surprisingly, decidable • Even better, rewriting queries using materialized views can be done via sound and complete procedure 14 Our Approach in This Work • Cast view maintenance/view adaptation as special cases of rewriting queries using views • Use cost-based search to find a good (not perfect) plan within time budget • Emulate Z-semantics with an off-the-shelf DBMS via encoding scheme • Handle / exploit provenance information, if we have it – Provenance has sizable storage overhead – But also unlocks many useful new rewritings 15 View Maintenance: a Special Case of Rewriting Queries Using Views on Z-Relations Query (to compute diff.): v¢(X,Y) :– r’(X,Z), r’(Z,Y) – v¢(X,Y) :– r(X,Z), r(Z,Y) rewrite v¢ using the materialized views Delta rules rewriting: v¢(X,Y) :– r(X,Z), r¢(Z,Y) v¢(X,Y) :– r¢(X,Z), r’(Z,Y) Materialized views: v(X,Y) :– r(X,Z), r(Z,Y) r’(X,Y) :– r(X,Y) r’(X,Y) :– r¢(X,Y) ... OTHER PLANS…? Another delta rules rewriting: v¢(X,Y) :– r¢(X,Z), r(Z,Y) v¢(X,Y) :– r’(X,Z), r¢(Z,Y) 16 View Adaptation: Another Application of Rewriting Queries Using Views Old view definition: New view definition: v(X,Y) :– r(X,Z), r(Z,Y). v(X,Y) :– s(X,Y,_). v’(X,Y) :– r(X,Z), r(Z,Y). reformulate using materialized view v ... OTHER PLANS…? A plan to “adapt” v into v’: v’(X,Y) :– v(X,Y). – v’(X,Y) :– s(X,Y,_). 17 Searching the Space of Rewritings: Time-Boxed, Cost-Based Hill-Climbing original plan p1 for q’ with its (est’d) cost : 27 p1 : 27 PLAN HEAP p3114 27 18 v’(…) v’(…):-:-…… 12 20 v’(…) 17 ::17 15 p14 45 v’(…) v’(…) :-:- …… 45 2 :: 20 p216 ::45 74 v’(…) v’(…):-:-…… 74 … p2 : 45 p3 : 17 p14 : 20 “one-step” rewritings of p1 using views : + costs p17 : 18 “two-step” rewritings of p1 using views : + costs … p16 : 74 v’(…) :- … … EXPLORED EXPLORED SET SET p11 : 27 v’(…) :- … pBEST : 27 :- … 33 : 17p1 v’(…) pBEST : 17 :- … 15 : 12p3 v’(…) p BEST p15 v’(…) : 12 :- … 14 : 20 BEST p15 : 12 p15 : 12 p16 : 74 (none) OUT OF TIME return best plan found, p15 : 12 TIME BUDGET VIEWS q(…) :- … . r’(…) :- … . s’(…) :- … . 18 How Does Provenance Fit In? • For view adaptation, often useful to “separate” disjuncts of a union, or “recover” values projected away • Would like some sort of index structure for this • Such a structure already exists in CDSS in form of provenance information 19 Graphical Model of Data Provenance ID Species Character State 34 L.catta hand color white 47 L.catta hand color white Datalog mappings: ¢ Species Comm. Name L. catta Ring-Tailed Lemur Provenance table for m1: ¢ m1: e(Name, Color) :– a(Id, Species, “hand color”, Color), d(Species, Name). Comm. Name Hand Color Ring-tailed Lemur white = a.Species = d.Comm. Name = a.Character ID Species Character State Species Comm. Name Comm. Name Hand Color 34 L.catta hand color white L. catta Ring-Tailed L. Ring-tailed L. white 47 L.catta hand color white L. catta Ring-Tailed L. Ring-tailed L. white Compress table using mapping’s correspondences 20 How to Compute Provenance Graph? Use Datalog! To record provenance for mapping m1 e(N, C) :– a(I, S, “...”, C), d(S, N). we convert it to a pair of mappings m1(I, S, N, C) :– a(I, S, “...”, C), d(S, N). e(N, C) :– m1(_, _, C, N). The first rule builds the provenance table for m1 The second rule projects over m1 to populate e “Just more Datalog views” => automatically exploited by reformulation engine! – engine doesn’t even know that it’s using provenance! 21 Exploiting Provenance Information in View Adaptation Example (WITHOUT provenance): e(…) :– a(…), d(…). e(…) :– c(…), g(…). mapping revision e’(…) :– a(…), d(…). e’(…) :– c(…), g(…), f(…). cost ≈ 2-way join + 3-way join Incremental plan to compute e’ using e (faster???) e’(…) :– e(…). e’(…) :– c(…), g(…), f(…). – e’(…) :– c(…), g(…). cost ≈ 2-way join + 3-way join… 22 How Provenance Information Enables New Rewritings (cont’d) Same example (but WITH provenance): e(…) :– m1(…). e(…) :– m2(…). m1(...) :– a(…), d(…). m2(,...) :– c(…), g(…). mapping revision e’(…) :– m1(...). e’(…) :– m2(…). m1’(…) :– a(…), d(…). m2’(…) :– c(…), g(…), f(…). cost ≈ 2-way join + 3-way join Incremental plan to compute e’ (and mapping tables) e’(…) :– m1’(...). e’(…) :– m2’(...). m1’(…) :– m1(...). m2’(...) :– m2(...), f(…). cost ≈ 2-way join! 23 Experimental Evaluation • Synthetic workload based on SWISS-PROT biological dataset • Generate source tables, mappings, changes to mappings and data – Changes to mapping definitions guided by empirical observations of schema changes in practice • Start with 16 source relations, 24 views • Apply sequences of 24 “primitive modifications” to view definitions – add/drop column in rule head, add/drop data source for view, add correspondence table, reorder columns, ... 24 0" View+#+ View+#+ View+#+ Fig. 8. Mixed workload (mapping and data changes); no provenance tables Net speedup ~45% 1" 0" View+#+ Fig. 9. Mixed workload (mapping and data changes); with provenance tables 2" with provenance ng+Time+ ing+Time+ 2" without provenance F c AVG" AVG" 1" 0" Fig. 6. Mixed workload (mapping changes only); with provenance tables Normalized+running+/ me+ Net speedup ~20% 5" 18" 9" 14" 23" 17" 22" 11" 4" 19" 3" 15" 0" 8" 12" 1" 7" 6" 2" 10" 16" 21" 13" 20" Normalized+running+/ me+ Fig. 5. Mixed workload (mapping changes only); no provenance tables mapping changes + data changes AVG" 1" 8" 10" 9" 23" 12" 3" 14" 5" 17" 11" 21" 4" 19" 0" 20" 2" 15" 6" 22" 13" 16" 18" 1" 7" 0" AVG" 1" Net speedup ~80% 22" 9" 10" 20" 19" 14" 21" 8" 18" 6" 2" 23" 5" 13" 16" 1" 15" 4" 11" 3" 0" 7" 12" Normalized+running+/ me+ Net speedup ~40% 22" 23" 9" 5" 19" 14" 18" 4" 15" 11" 8" 21" 12" 1" 2" 3" 16" 7" 6" 10" 0" 20" 13" mapping changes only Normalized+running+/ me+ Highlights of Experiments 25 F p Summary • We’ve shown that optimized change propagation is feasible, and can yield large speedups – Can handle updates to mappings, data, or both via a generic reformulation engine based on optimizing queries using materialized views • For systems like ORCHESTRA that store provenance information, even more opportunities for optimization – Easy to retrofit any Datalog-based system (e.g., LogicBlox) to store same kind of provenance information – (Benefits must be balanced with storage costs…) 26 Related Work • Incremental view maintenance [Blakeley+ 86], [Gupta+ 93], ... – “deltas” [Gupta+ 93]: an early form of our Z-relations • Answering queries using views [Levy+ 95], [Chaudhuri+ 95], [Afrati&Pavlaki 06], Chase&Backchase [Deutsch,Popa,Tannen 99], ... • Bag-containment/bag-equivalence of CQs/UCQs [Lovász 67], [Chaudhuri&Vardi 93], [Ioannidis&Ramakrishnan 95], [Cohen+ 99], [Jayram+ 06] • View adaptation [Mohania&Dong 96], [Gupta+ 01] 27 Related Work (cont) • Mapping evolution [Velegrakis+ 03] • Recursively-compiled view maintenance plans [Ahmad&Koch 09, Koch 10] • Data exchange [Fagin+05], P2P data exchange [Fuxman+05] • Youtopia [Koch09] • Mapping adaptation [Yu&Popa05] 28