slides/icde12_mappings - Computer Science @ UC Davis

advertisement
Recomputing Materialized Instances
After Changes to Mappings and Data
Todd J. Green
Zachary G. Ives
&
ICDE ’12 Washington, DC
April 2, 2012
Change is a Constant in Data Management
• Databases are highly dynamic; many kinds of changes
need to be propagated efficiently:
– To data (“view maintenance”)
– To view definitions (“view adaptation”)
– Others, such as schema evolution, etc.
• Collaborative data sharing systems (e.g., ORCHESTRA [Ives+
05]), declarative programming platforms (e.g., LogicBlox
[Huang+11]) exacerbate this need:
– Large numbers (100s to 10s of 1000s) of materialized views
– Frequent updates to data, schemas, mapping/view definitions
2
Change Propagation: a Problem of
Computing Differences
View maintenance
view definition
source
data
R
change to source
data (difference wrt
current version)
R¢
Given:
V
Goal:
materialized
view
compute change to
materialized view
(difference)
V¢
View adaptation
Given:
source
data
view definition
R
change to view definition
(another kind of difference)
V
materialized
view
Goal: compute change to
materialized view
V¢
3
Challenges in Change Propagation
• View maintenance: studied since at least the mid-eighties [Blakeley+
86], but existing solutions quite narrow and limited
– Various known methods to compute changes “incrementally”, e.g.,
count algorithm [Gupta+ 93]
– How do we optimize this process? What is space of all update plans?
• View adaptation: less attention, but renewed importance in
context of data exchange/collaborative data sharing systems
– Previous approaches: limited to case-based methods for simple
changes [Gupta+ 01]
– Complex changes? Again, space of all update plans?
• Key challenge: compute changes using database queries!
4
Contributions
• We build on our previous theoretical work [G+ 09] to implement
a unified and cost-based approach to handling updates to
source data, view definitions, or both
• Practical implementation in Orchestra CDSS, based on
rewriting queries using materialized views and enriched data
model
– Core of our engine is unaware of which kind of update (to source data
or to view definition) it is dealing with!
– Engine also automatically exploits provenance information, if present
• We demonstrate significant practical benefits on workloads
typical of data sharing in the life sciences
5
Background I: ORCHESTRA CDSS [Ives+08]
a Collaborative Data Sharing System
• Set of peers (e.g., collaborating life scientists), each with
database, agree to share information
• Peers linked via network of compositional schema
mappings
– define how data/updates applied to one peer instance should
be transformed and applied to other peer instances
• System tracks provenance (lineage) information [Green+
07] as updates are mapped/transformed
– Basis of provenance-based trust policies
– Also used to guide update propagation
6
Example: Sharing Morphological Data
Alice’s field observations: a
ID
Species Image
Character
State
34
Lemur
catta
hand color
white
47
Lemur
catta
hand color
white
Carol wants to gather
information from Alice, Bob,
uBio, and put into own data
repository:
Bob’s field observations: b, c
SID
Char
State
61
hand color
black
SID
Species
61
Lemur
catta
Picture
Carol’s Guide to Primate Hand Colors
Common Name
Hand Color
schema
mappings
Standard species names: d
Species
Common Name
Lemur catta
Ring-Tailed Lemur
Can do this using
schema mappings
7
Example: Sharing Morphological Data (2)
Alice’s field observations: a
Datalog mappings relating databases
ID
Species Image
Character
State
34
Lemur
catta
hand color
white
47
Lemur
catta
hand color
white
Bob’s field observations: b, c
SID
Char
State
61
hand color
black
SID
Species
61
Lemur
catta
Picture
e(Name, Color) :–
b(Id, “hand color”, Color),
c(Id, Species,_), d(Species, Name).
e(Name, Color) :–
a(Id, Species,_, “hand color”, Color),
d(Species, Name).
Carol’s Guide to Primate Hand Colors: e
Common Name
Hand Color
Ring-Tailed Lemur white
Standard species names: d
Species
Common Name
Lemur catta
Ring-Tailed Lemur
8
Example: Sharing Morphological Data (2)
Alice’s field observations: a
Datalog mappings relating databases
ID
Species Image
Character
State
34
Lemur
catta
hand color
white
47
Lemur
catta
hand color
white
Bob’s field observations: b, c
SID
Char
State
61
hand color
black
SID
Species
61
Lemur
catta
Picture
e(Name, Color) :–
b(Id, “hand color”, Color),
c(Id, Species,_), d(Species, Name).
e(Name, Color) :–
a(Id, Species,_, “hand color”, Color),
d(Species, Name).
Carol’s Guide to Primate Hand Colors: e
Common Name
join
Hand Color
Ring-Tailed Lemur black
white
Standard species names: d
Species
Common Name
Lemur catta
Ring-Tailed Lemur
9
Example: Sharing Morphological Data (2)
Alice’s field observations: a
Datalog mappings relating databases
ID
Species Image
Character
State
34
Lemur
catta
hand color
white
47
Lemur
catta
hand color
white
Bob’s field observations: b, c
SID
Char
State
61
hand color
black
SID
Species
61
Lemur
catta
Picture
e(Name, Color) :–
b(Id, “hand color”, Color),
c(Id, Species,_), d(Species, Name).
e(Name, Color) :–
a(Id, Species,_, “hand color”, Color),
d(Species, Name).
Carol’s Guide to Primate Hand Colors: E
join
Common Name
Hand Color
Ring-Tailed Lemur black
white
Ring-Tailed Lemur white
Standard species names: d
Species
Common Name
Lemur catta
Ring-Tailed Lemur
10
Background II: Data Updates as Z-Relations [G+09]
• Can think of changes to data as a
kind of annotated relation
inserted tuple
+
deleted tuple
–
• Z-relation: a relation where each tuple is associated with a
(positive or negative) count
r¢
– Positive counts indicate (multiple) insertions;
a b
c d
negative counts, (multiple) deletions
2
–3
– Uniform representation for both data and changes to data
– Update application = union (a query!)
r’ = r [ r¢
11
Relational Algebra (RA) on Z-Relations [G+09]
join (⋈) multiplies counts
union ([), projection (¼) add counts
(same as for semiringannotated relations
[G+07])
selection (¾) multiplies counts by 0 or 1
difference (–) subtracts counts
Note, difference can lead to negative counts (unlike “proper
subtraction” in bag semantics where negative counts are
truncated to 0)
Note,
12
Incremental View Maintenance:
An Application of Z-Relations
Materialized view (with duplicates):
Source relation:
a
a
1
a
c
1
1
b
b
2
a
1
c
c
1
R¢ b
a
-1
b
b
-1
c
d
+1
b
d
+1
R
a
b
1
c
b
1
b
c
b
v(X,Y) :– r(X,Z), r(Z,Y)
deletion
insertion
V¢
Delta rules [Gupta+ 93] for v with Z-relations semantics:
2 copies of
(b,b)
delete 1 copy
of (b,b)
insert 1 copy
of (b,d)
v¢(X,Y) :– r(X,Z), r¢(Z,Y)
v¢(X,Y) :– r¢(X,Z), r’(Z,Y)
13
Z-Relations are Amenable to Advanced
Optimization Strategies [G+09]
• For change propagation, fundamental need for
difference in query language => full relational algebra
(RA)
• Under set or bag (multiset) semantics, basic optimization
tasks---e.g., testing query equivalence---are undecidable
for RA
• Under Z-semantics, equivalence of RA queries is,
surprisingly, decidable
• Even better, rewriting queries using materialized views
can be done via sound and complete procedure
14
Our Approach in This Work
• Cast view maintenance/view adaptation as special cases
of rewriting queries using views
• Use cost-based search to find a good (not perfect) plan
within time budget
• Emulate Z-semantics with an off-the-shelf DBMS via
encoding scheme
• Handle / exploit provenance information, if we have it
– Provenance has sizable storage overhead
– But also unlocks many useful new rewritings
15
View Maintenance: a Special Case of
Rewriting Queries Using Views on Z-Relations
Query (to compute diff.):
v¢(X,Y) :– r’(X,Z), r’(Z,Y)
– v¢(X,Y) :– r(X,Z), r(Z,Y)
rewrite v¢ using the
materialized views
Delta rules rewriting:
v¢(X,Y) :– r(X,Z), r¢(Z,Y)
v¢(X,Y) :– r¢(X,Z), r’(Z,Y)
Materialized views:
v(X,Y) :– r(X,Z), r(Z,Y)
r’(X,Y) :– r(X,Y)
r’(X,Y) :– r¢(X,Y)
... OTHER PLANS…?
Another delta rules
rewriting:
v¢(X,Y) :– r¢(X,Z), r(Z,Y)
v¢(X,Y) :– r’(X,Z), r¢(Z,Y)
16
View Adaptation: Another Application of
Rewriting Queries Using Views
Old view definition:
New view definition:
v(X,Y) :– r(X,Z), r(Z,Y).
v(X,Y) :– s(X,Y,_).
v’(X,Y) :– r(X,Z), r(Z,Y).
reformulate using
materialized view v
... OTHER PLANS…?
A plan to “adapt” v into v’:
v’(X,Y) :– v(X,Y).
– v’(X,Y) :– s(X,Y,_).
17
Searching the Space of Rewritings:
Time-Boxed, Cost-Based Hill-Climbing
original plan p1 for q’
with its (est’d) cost : 27
p1 : 27
PLAN HEAP
p3114
27
18 v’(…)
v’(…):-:-……
12
20
v’(…)
17 ::17
15
p14
45 v’(…)
v’(…) :-:- ……
45
2 :: 20
p216 ::45
74 v’(…)
v’(…):-:-……
74
…
p2 : 45
p3 : 17
p14 : 20
“one-step” rewritings
of p1 using views : + costs
p17 : 18
“two-step” rewritings
of p1 using views : + costs
…
p16 : 74 v’(…) :- …
…
EXPLORED
EXPLORED SET
SET
p11 : 27 v’(…) :- …
pBEST
: 27 :- …
33 : 17p1 v’(…)
pBEST
: 17 :- …
15 : 12p3 v’(…)
p
BEST
p15 v’(…)
: 12 :- …
14 : 20
BEST p15 : 12
p15 : 12
p16 : 74
(none)
 OUT OF TIME
return best plan found, p15 : 12
TIME BUDGET
VIEWS
q(…) :- … .
r’(…) :- … .
s’(…) :- … .
18
How Does Provenance Fit In?
• For view adaptation, often useful to “separate”
disjuncts of a union, or “recover” values projected
away
• Would like some sort of index structure for this
• Such a structure already exists in CDSS in form of
provenance information
19
Graphical Model of Data Provenance
ID
Species
Character
State
34
L.catta
hand color
white
47
L.catta
hand color
white
Datalog mappings:
¢
Species
Comm. Name
L. catta
Ring-Tailed Lemur
Provenance table for m1:
¢
m1: e(Name, Color) :–
a(Id, Species, “hand color”, Color),
d(Species, Name).
Comm. Name
Hand Color
Ring-tailed Lemur
white
= a.Species
= d.Comm. Name = a.Character
ID
Species
Character
State
Species
Comm. Name
Comm. Name
Hand Color
34
L.catta
hand color
white
L. catta
Ring-Tailed L.
Ring-tailed L.
white
47
L.catta
hand color
white
L. catta
Ring-Tailed L.
Ring-tailed L.
white
Compress table using mapping’s correspondences
20
How to Compute Provenance Graph?
Use Datalog!
To record provenance for mapping m1
e(N, C) :– a(I, S, “...”, C), d(S, N).
we convert it to a pair of mappings
m1(I, S, N, C) :– a(I, S, “...”, C), d(S, N).
e(N, C) :– m1(_, _, C, N).
The first rule builds the
provenance table for m1
The second rule projects
over m1 to populate e
“Just more Datalog views” => automatically exploited by
reformulation engine!
– engine doesn’t even know that it’s using provenance!
21
Exploiting Provenance Information
in View Adaptation
Example (WITHOUT provenance):
e(…) :– a(…), d(…).
e(…) :– c(…), g(…).
mapping revision
e’(…) :– a(…), d(…).
e’(…) :– c(…), g(…), f(…).
cost ≈
2-way join + 3-way join
Incremental plan to compute e’ using e (faster???)
e’(…) :– e(…).
e’(…) :– c(…), g(…), f(…).
– e’(…) :– c(…), g(…).
cost ≈
2-way join + 3-way join…
22
How Provenance Information
Enables New Rewritings (cont’d)
Same example (but WITH provenance):
e(…) :– m1(…).
e(…) :– m2(…).
m1(...) :– a(…), d(…).
m2(,...) :– c(…), g(…).
mapping revision
e’(…) :– m1(...).
e’(…) :– m2(…).
m1’(…) :– a(…), d(…).
m2’(…) :– c(…), g(…), f(…).
cost ≈
2-way join + 3-way join
Incremental plan to compute e’ (and mapping tables)
e’(…) :– m1’(...).
e’(…) :– m2’(...).
m1’(…) :– m1(...).
m2’(...) :– m2(...), f(…).
cost ≈
2-way join!
23
Experimental Evaluation
• Synthetic workload based on SWISS-PROT biological
dataset
• Generate source tables, mappings, changes to mappings
and data
– Changes to mapping definitions guided by empirical
observations of schema changes in practice
• Start with 16 source relations, 24 views
• Apply sequences of 24 “primitive modifications” to view
definitions
– add/drop column in rule head, add/drop data source for view,
add correspondence table, reorder columns, ...
24
0"
View+#+
View+#+
View+#+
Fig. 8.
Mixed workload (mapping and data
changes); no provenance tables
Net speedup ~45%
1"
0"
View+#+
Fig. 9.
Mixed workload (mapping and data
changes); with provenance tables
2"
with provenance
ng+Time+
ing+Time+
2"
without provenance
F
c
AVG"
AVG"
1"
0"
Fig. 6. Mixed workload (mapping changes only);
with provenance tables
Normalized+running+/ me+
Net speedup ~20%
5"
18"
9"
14"
23"
17"
22"
11"
4"
19"
3"
15"
0"
8"
12"
1"
7"
6"
2"
10"
16"
21"
13"
20"
Normalized+running+/ me+
Fig. 5. Mixed workload (mapping changes only);
no provenance tables
mapping
changes
+ data
changes
AVG"
1"
8"
10"
9"
23"
12"
3"
14"
5"
17"
11"
21"
4"
19"
0"
20"
2"
15"
6"
22"
13"
16"
18"
1"
7"
0"
AVG"
1"
Net speedup ~80%
22"
9"
10"
20"
19"
14"
21"
8"
18"
6"
2"
23"
5"
13"
16"
1"
15"
4"
11"
3"
0"
7"
12"
Normalized+running+/ me+
Net speedup ~40%
22"
23"
9"
5"
19"
14"
18"
4"
15"
11"
8"
21"
12"
1"
2"
3"
16"
7"
6"
10"
0"
20"
13"
mapping
changes
only
Normalized+running+/ me+
Highlights of Experiments
25
F
p
Summary
• We’ve shown that optimized change propagation is
feasible, and can yield large speedups
– Can handle updates to mappings, data, or both via a
generic reformulation engine based on optimizing queries
using materialized views
• For systems like ORCHESTRA that store provenance
information, even more opportunities for
optimization
– Easy to retrofit any Datalog-based system (e.g., LogicBlox)
to store same kind of provenance information
– (Benefits must be balanced with storage costs…)
26
Related Work
• Incremental view maintenance [Blakeley+ 86], [Gupta+ 93], ...
– “deltas” [Gupta+ 93]: an early form of our Z-relations
• Answering queries using views [Levy+ 95], [Chaudhuri+ 95],
[Afrati&Pavlaki 06], Chase&Backchase [Deutsch,Popa,Tannen 99], ...
• Bag-containment/bag-equivalence of CQs/UCQs
[Lovász 67], [Chaudhuri&Vardi 93], [Ioannidis&Ramakrishnan 95], [Cohen+
99], [Jayram+ 06]
• View adaptation [Mohania&Dong 96], [Gupta+ 01]
27
Related Work (cont)
• Mapping evolution [Velegrakis+ 03]
• Recursively-compiled view maintenance plans
[Ahmad&Koch 09, Koch 10]
• Data exchange [Fagin+05], P2P data exchange
[Fuxman+05]
• Youtopia [Koch09]
• Mapping adaptation [Yu&Popa05]
28
Download