Containment of - Computer Science @ UC Davis

advertisement
Containment of Conjunctive Queries
on Annotated Relations
Todd J. Green
University of Pennsylvania
March 25, 2009
@ ICDT 09, Saint Petersburg
The Need for Data Provenance
• Many new database applications must track where data
came from (as it is combined and transformed by
queries, schema mappings, etc.): data provenance
– Debugging schema mappings
– Assessing data quality, trustworthiness
– Computing probabilities
– Enforcing access control policies
– Preserving the scientific record
• Must do this while also satisfying DBMS performance
requirements and retaining compatibility with legacy
systems
2
Challenge: Provenance May Affect
Query Optimization
• Query optimization strategies depend fundamentally on
issues of query containment and equivalence
– Query minimization, rewritings queries using materialized views, etc.
• Well-known difference between set, bag semantics: consider
Q(x,y) :– R(x,y)
Q’(u,v) :– R(u,v), R(u,w)
Under set semantics, Q and Q’ are equivalent; under bag
semantics, they are not! (“redundant” join in Q’ affects
output tuple multiplicities)
• Issues pointed out in [Buneman+ 01], reiterated in [Buneman+ 08]
3
Contributions
• We study containment and equivalence of conjunctive queries
(CQs) and unions of conjunctive queries (UCQs), for
provenance models captured by semiring-annotated
relations:
– Provenance polynomials (ORCHESTRA system) [Green+ 07]
– Why-provenance [Buneman+ 01]
– Data warehousing lineage [Cui+ 01]
– Trio system lineage [Das Sarma+ 08]
• We give positive decidability results and complexity
characterizations in (nearly) all cases
• We show interesting connections with same problems under
set semantics and bag semantics
4
Outline
• Semiring-annotated relations (K-relations)
• Bounds based on semiring homomorphisms
• Results for provenance polynomials
• Overview of other results
5
A Unifying Framework for Data Provenance:
Semiring Annotated Relations [Green+ PODS 07]
• Basic idea: annotate source tuples with tuple ids, combine
and propagate during query processing
– Abstract “+” records alternative use of data (union, projection)
– Abstract “¢” records joint use of data (join)
– Yields space of annotations K
• K-relation: a relation whose tuples are annotated with
elements from K
– Notation: R(t) means annotation of t in K-relation R
6
Combining Annotations in Queries
A
B
C
D
ID
Species
34
Character
State
L.catta
hand color
white p
47
L.catta
hand color
white q
ID
Character
State
61
hand color
black
ID
Species
61
Lemur catta
Species
Img
source tuples
annotated with
tuple ids from K
r
Img
s
Comm. Name
Lemur catta Ring-tailed
Lemur
u
7
Combining Annotations in Queries
Union of conjunctive queries (UCQ)
A
B
C
D
ID
Species
34
Img
Character
State
L.catta
hand color
white p
47
L.catta
hand color
white q
ID
Character
State
61
hand color
black
ID
Species
61
Lemur catta
Species
E(name, color) :–
B(id, “hand color”, color),
C(id, species,_), D(species, name)
rr
join
Img
s
E
Comm. Name
Hand Color
Ring-tailed Lemur
black
r¢s¢u
Comm. Name
Lemur catta Ring-tailed
Lemur
u
Operation x¢y means joint use of data
annotated by x and data annotated by y
8
Combining Annotations in Queries
Union of conjunctive queries (UCQ)
A
B
C
D
ID
Species
34
Character
State
L.catta
hand color
white p
47
L.catta
hand color
white q
ID
Character
State
61
hand color
black
ID
Species
61
Lemur catta
Species
Img
E(name, color) :–
A(id, species,_, “hand color”, color),
D(species, name)
r
Img
E
s
Comm. Name
Lemur catta Ring-tailed
Lemur
E(name, color) :–
B(id, “hand color”, color),
C(id, species,_), D(species, name)
Comm. Name
Hand Color
Ring-tailed Lemur
black
r¢s¢u
Ring-tailed Lemur
white
p¢u
Ring-tailed Lemur
white
q¢u
u
Operation x¢y means joint use of data
annotated by x and data annotated by y
9
Combining Annotations in Queries
Union of conjunctive queries (UCQ)
A
B
C
D
ID
Species
34
Character
State
L.catta
hand color
white p
47
L.catta
hand color
white q
ID
Character
State
61
hand color
black
ID
Species
61
Lemur catta
Species
Img
E(name, color) :–
A(id, species,_, “hand color”, color),
D(species, name)
r
Img
E
s
Comm. Name
Lemur catta Ring-tailed
Lemur
E(name, color) :–
B(id, “hand color”, color),
C(id, species,_), D(species, name)
u
Comm. Name
Hand Color
Ring-tailed Lemur
black
Ring-tailed Lemur
white
Ring-tailed Lemur
white
r¢s¢u
p¢u +
p¢u
q¢u
q¢u
Operation x+y means alternate use of data
annotated by x and data annotated by y
10
What Properties Do K-Relations Need?
• DBMS query optimizers choose from among many plans,
assuming certain identities:
– union is associative, commutative
– join associative, commutative, distributive over union
– projections and selections commute with each other and with union and
join (when applicable)
• Equivalent queries should produce same provenance!
• Proposition [Green+ 07]. Above identities hold for positive
relational algebra queries on K-relations iff (K, +, ¢, 0, 1) is a
commutative semiring
11
What is a Commutative Semiring?
• An algebraic structure (K, +, ¢, 0, 1) where:
– K is the domain
– + is associative, commutative with 0 identity
– ¢ is associative, commutative with 1 identity
– ¢ is distributive over +
– 8 a 2 K, a ¢ 0 = 0 ¢ a = 0
(unlike ring, no requirement for additive inverses)
• Big benefit of semiring-based framework: one framework
unifies many database semantics
12
Semirings Unify Commonly-Used
Database Semantics
Standard database models:
(B, Æ, Ç, >, ?)
Set semantics
(ℕ, +, ∙, 0, 1)
Bag semantics
Incomplete/probabilistic data:
(PosBool(X), Æ, Ç, >, ?)
Conditional tables
[Imielinski&Lipski 84]
(P(), [, Å, ;, )
Probabilistic event tables
[Fuhr&Rölleke 97]
Also ranked query models, dissemination policies, ...
13
Semirings Unify Provenance Models
X a set of indeterminates, can be thought of as tuple ids
(N[X], +, ¢, 0, 1)
“most informative”
Provenance polynomials
(Lin(X), [, [*, ;, ;*)
sets of contributing tuples
Data warehousing lineage
(Why(X), [, d, ;, {;})
sets of sets of contributing tuples
Why-provenance
(Trio(X), +, ¢, 0, 1)
bags of sets of contributing tuples
Trio-style lineage
(B[X], +, ¢, 0, 1)
Boolean prov. polynomials
[Green+ 07]
[Cui+ 00]
[Buneman+ 01]
[Das Sarma+ 08]
14
A Hierarchy of Provenance
Example: 2p2r + pr + 5r2 + s
most informative
drop coefficients
p2r + pr + r2 + s
N[X]
Trio(X)
B[X]
drop both exp. and coeff.
pr + r + s
Why(X)
apply absorption
(pr + r ´ r)
PosBool(X)
r+s
collapse terms
Lin(X)
prs
least informative
drop exponents
3pr + 5r + s
B
non-zero?
true
A path downward from K1 to K2 indicates that there exists a
surjective semiring homomorphism h : K1  K2
15
What Does Query Containment
Mean for K-Relations?
• Notion of containment based on natural order for K:
a ≤K b iff exists c s.t. a + c = b
– When this is a partial order, call K naturally ordered; all semirings
considered here are naturally ordered
• Lift to K-relations:
R ≤K R’ iff for all tuples t R(t) ≤K R’(t)
– For K = B (set semantics), this is set-containment
– For K = ℕ (bag semantics), this is bag-containment
– For K = PosBool(X), this is logical implication
• Queries on K-relations: say that Q is K-contained in Q’ iff for
all K-relations R, Q(R) ≤K Q’(R)
16
Provenance Hierarchy and Query Containment
most informative
strongest notion of
containment
N[X]
Trio(X)
B[X]
any K
(positive K)
Why(X)
N
Lin(X)
least informative
PosBool(X)
B
weakest notion of
containment
A path downward from K1 to K2 also indicates that for UCQs Q1, Q2,
if Q1 is K1-contained in Q2, then Q1 is K2-contained in Q2
17
Prov. Hierarchy and Query Containment (2)
• Provenance hierarchy tells us something about relative
behavior of K-containment for various K
• Doesn’t tell us which implications are strict; we’d also like to
know whether containment/equivalence is even decidable!
• One case already known:
Theorem [Grahne+ 97]. If K is a distributive lattice, then for
UCQs Q,Q’, Q is K-contained in Q’ iff Q is set-contained in Q’
– Distributive lattices are between PosBool(X) (for c-tables) and B in
previous slide
– Other examples: dissemination policies, prob. event tables, ...
18
Summary: Logical Implications
of Containment/Equivalence
B[X]
N[X]
Why(X)
Trio(X)
N
N[X]
Trio(X)
B[X]
N[X]
B[X]
N[X]
B[X]
N
Why(X)
Trio(X)
Why(X)
Trio(X)
Why(X)
Lin(X)
Lin(X)
N
Lin(X)
N
Lin(X)
PosBool(X)
B
PosBool(X)
B
PosBool(X)
PosBool(X)
CQs, cont.
“K1
B
CQs, equiv.
UCQs, cont.
B
UCQs, equiv.
K2” indicates that for CQs (UCQs), K1 cont. (equiv.) implies K2 cont. (equiv.)
All implications not marked “
” are strict. Red arrows are from [Grahne+ 97].
19
Summary: Logical Implications
of Containment/Equivalence
B[X]
N[X]
Why(X)
Trio(X)
N
Lin(X)
PosBool(X)
CQs, cont.
“K1
B
N
N[X]
CQs separating the various notions of K-containment:
B[X]
N[X]
B[X]
N[X]
...other
examples...
other
examples
Q(u)
:– R(u,v), R(u,w)
Why(X)
Trio(X)
Trio(X)
Q’(x) :–
R(x,y)Why(X)
Q(x,y) :– R(x,y)
Q
is Lin(X)-contained in Q’, but N
Lin(X)
Q’(u,v) :– N R(u,v),Lin(X)
R(u,w)
Q is not bag-contained in Q’
Q is set-contained in Q’, but
PosBool(X)
PosBool(X)
B
B
Q is not Lin(X)-contained in Q’
CQs, equiv.
UCQs, cont.
B[X]
Trio(X)
Why(X)
Lin(X)
PosBool(X)
B
UCQs, equiv.
K2” indicates that for CQs (UCQs), K1 cont. (equiv.) implies K2 cont. (equiv.)
All implications not marked “
” are strict. Red arrows are from [Grahne+ 97].
20
Summary: Logical Implications
of Containment/Equivalence
B[X]
N[X]
Why(X)
Trio(X)
B[X]
N[X]
B[X]
N[X]
Bag-equivalence of UCQs implies KWhy(X) forTrio(X)
Why(X)
equivalence
provenance
modelsTrio(X)
(in fact, bag-equivalence implies KLin(X) for any
N
equivalence
N K) Lin(X)
N
Lin(X)
PosBool(X)
CQs, cont.
“K1
bag semantics
B
PosBool(X)
B
CQs, equiv.
PosBool(X)
B
UCQs, cont.
N
N[X]
B[X]
Trio(X)
Why(X)
Lin(X)
PosBool(X)
B
UCQs, equiv.
K2” indicates that for CQs (UCQs), K1 cont. (equiv.) implies K2 cont. (equiv.)
All implications not marked “
” are strict. Red arrows are from [Grahne+ 97].
21
Tools for Main Results: Containment
Mappings, Canonical Databases
Theorem [Chandra&Merlin 77]. For CQs Q, Q’, following are equivalent:
1.
2.
3.
Q is (set-)contained in Q’
Q(can(Q)) ⊆ Q’(can(Q)) where can(Q) is canonical database for Q
There is a containment mapping h : vars(Q)  vars(Q’)
Most of our results follow this template, with two key differences:
– We use provenance-annotated canonical databases:
e.g., Q(x,y) :– R(x,z), R(z,y)
canN[X](Q) is R =
– We use variations of containment mappings
x z p
z y q
e.g., exact containment mapping: a containment mapping h : vars(Q) 
vars(Q’) that induces a bijection between atoms of Q and atoms of Q’
22
N[X]-Containment/Equivalence of CQs
Natural order for N[X]: monomial-wise comparison of coefficients
e.g., p2 ≤N[X] 2p2 + pq but
p2 ≰N[X] p3
Theorem. For CQs Q, Q’, the following are equivalent:
1.
Q is N[X]-contained in Q’
2.
Q(canN[X](Q)) ≤N[X] Q’(canN[X](Q))
3.
There is an exact containment mapping h : vars(Q)  vars(Q’)
and checking containment is NP-complete
Corollary. Q and Q’ are N[X]-equivalent iff they are isomorphic (and
checking equivalence is graph isomorphism-complete)
23
N[X]-Containment/Equivalence of UCQs
Theorem. For UCQs Q,Q’, if Q is not N[X]-contained in Q’, then
there is a small counterexample, i.e., an N[X]-relation R s.t.
– Size of R (tuples and their annotations) polynomial in |Q| + |Q’|
– Q(R) ≰N[X] Q’(R)
Corollary. N[X]-containment of UCQs is in PSPACE
– Exact complexity: don’t know!
Theorem. For UCQs Q,Q’, Q is N[X]-equivalent to Q’ iff Q and
Q’ are isomorphic (and checking is again graph isomorphismcomplete)
24
Highlights of Other Results
• Why(X) and Trio(X): CQ containment based on onto
containment mappings
• Lin(X): CQ containment based on covering containment
mappings
• These kinds of containment mappings have been used before,
for checking bag-containment of CQs [Chaudhuri&Vardi 93]!
– Decidability of this problem: open
– But, onto containment mappings sufficient for bag-containment
– And, covering containment mappings necessary for bag-containment
• Hence for CQs, Why(X)/Trio(X)-containment and Lin(X)containment “sandwich” bag-containment
25
N[X]-Equivalence and Bag-Equivalence
• Theorem. For UCQs, N[X]-equivalence is the same as bagequivalence
– Proof idea. For polynomials A, B in N[X], we have A = B iff
for all valuations ν : X  N, Evalν(A) = Evalν(B)
• We have used this idea in another ICDT 09 paper; and results
there for Z-relations also hold for Z[X]-relations
– A fact used in ORCHESTRA system @ Penn for optimizing change
propagation with provenance
26
Summary: Complexity of Checking
Containment/Equivalence of CQs/UCQs
CQs
cont
B
PosBool(X)
Lin(X)
Why(X)
Trio(X)
B[X]
N[X]
N
NP
NP
NP
NP
NP
NP
NP
? (Π2phard)
UCQs
equiv
NP
NP
NP
GI
GI
GI
GI
GI
cont
NP
NP
NP
NP
?
NP
in PSPACE undec
equiv
NP
NP
NP
NP
GI
NP
GI
GI
Bold type indicates results of this paper
“NP” indicates NP-complete, “GI” indicates graph isomorphism-complete
NP-complete/GI-complete considered “tractable” here
- Complexity in size of query; queries small in practice
27
Related Work on Query Containment
• Set semantics [Chandra&Merlin 77], [Sagiv&Yannakakis 80], ...
• Bag, bag-set semantics [Lovász 67], [Chaudhuri&Vardi 93],
[Ioannidis&Ramakrishnan 95], [Cohen+ 99], [Jayram+ 06], ...
– Label systems of [Ioannidis&Ramakrishnan 95]: similar in spirit to K-relations
• Bilattice-annotated relations [Grahne+ 97], parametric
databases [Lakshmanan&Shiri 01]
– Also similar in spirit to K-relations
• Minimal-witness why-prov. [Buneman+ 01], where-prov. [Tan 03]
• Z-relations/Z[X]-relations [Green+ 09]
28
Conclusion
• When optimizers rewrite queries, the provenance of query
answers may change! This paper helps us understand how.
• We have given positive decidability results and complexity
characterizations for CQ/UCQ containment/equivalence on
various kinds of provenance-annotated databases
• For optimizations common in commercial DBMSs (i.e., those
compatible with bag semantics), we have shown that they
imply no change in provenance
29
Open Problems for Future Work
• Decidability of Trio(X)-containment of UCQs?
• Exact complexity of N[X]-containment of UCQs? (GI-hard, in
PSPACE)
• Complexity when UCQs are represented as positive relational
algebra queries (exponentially more concise than UCQs)?
30
Download