Containment of Conjunctive Queries on Annotated Relations TJ Green University of Pennsylvania

advertisement
Containment of Conjunctive
Queries on Annotated Relations
TJ Green
University of Pennsylvania
Symposium on Database Provenance
University of Edinburgh
May 21, 2008
Provenance and Query Optimization
• Many kinds of semiring-based provenance
annotations to choose from:
–
–
–
–
–
lineage
why-provenance
minimal witness why-provenance
provenance polynomials
...
• These seem to keep track of more/less
information
• A fundamental question: how does this affect
query optimization?
2
Conjunctive Queries on K-Relations
• Datalog-style syntax for conjunctive queries (CQs):
Q(x,y) :- R(x,z), R(z,y)
• Semantics of applying the CQ to a K-relation R : D£D  K:
Q(a,b) = z2D R(a,z)¢R(z,b)
• # of repetitions of an atom in the body matters
• For unions of conjunctive quereis (UCQs) (equivalent to
positive RA), sum over CQs:
P(x,y) :- R(x,z), R(z,y) P(x,y) :- R(x,w), R(y,w)
• Semantics of UCQ applied to R ― a sum over CQs:
P(a,b) = z2D R(a,z)¢R(z,b) + w2D R(a,w)¢R(b,w)
3
Choice of K Affects Query Optimization
K = N (bag semantics) differs from K = B (set semantics)
e.g., the conjunctive queries
Q1(x) :- R(x,y), R(x,z)
Q2(u) :- R(u,v)
are set-equivalent, but not bag-equivalent
Conjunctive Queries
(CQs)
Bag Semantics ? (¦2p-hard)
Containment (vN) [Chaudhuri&Vardi 93]
Bag Semantics isomorphism ()
Equivalence (´N) [CV 93]
Unions of Conjunctive Queries
(UCQs)
undecidable
[Ioannidis&Ramakrishnan 95]
?
4
Our Contributions
• We make a systematic study of query
containment and query equivalence for various
provenance models
• We show that K-containment and K-equivalence
of CQs and UCQs are decidable for lineage, whyprovenace, and the provenance polynomials
N[X], as well as a new model, B[X]
• The decision procedures are based on interesting
variations of containment mappings
• We analyze the complexity in each case
5
Our Contributions
• As a corollary of the decidability result for
N[X]-equivalence of UCQs, we also fill in a gap
in the chart for bag semantics:
Conjunctive Queries
(CQs)
Bag Semantics ? (¦2p-hard)
Containment (vN) [Chaudhuri&Vardi 93]
Bag Semantics isomorphism ()
Equivalence (´N) [CV 93]
Unions of Conjunctive Queries
(UCQs)
undecidable
[Ioannidis&Ramakrishnan 95]
isomorphism ()
6
K-Containment for Queries
• For semiring K, define a ·K b , 9c . a + c = b.
If ·K is a partial order, it is called the natural
order, and K is said to be naturally-ordered
• B, N, lineage, why-provenance, B[X], and N[X]
are all naturally-ordered
• We define K-containment using the natural
order:
Q1 vK Q2 , 8I 8t Q1(I)(t) ·K Q2(I)(t)
Q1 ´K Q2 , 8I 8t Q1(I)(t) = Q2(I)(t)
7
A Hierarchy of Semiring Provenance (1)
• Provenance polynomials (N[X], +, ¢, 0, 1) – tracks
calculations abstractly; most general
e.g., 2p2r + 3ps + ps3
• Drop coefficients to get (B[X], +, ¢, 0, 1)
p2r + ps + ps3
• Drop exponents to get why-prov. (P(P(X)), [, d, ;, {;})
{{p,r}, {p,s}}
• Flatten set-of-sets to get lineage (P(X), +, ¢, ?, ;)
{p,r,s}
• Drop, flatten, etc. correspond to surjective semiring
homomorphisms
8
A Hierarchy of Semiring Provenance (2)
• Suppose h : K1  K2 is a semiring homomorphism.
Then a ·K1 b implies h(a) ·K2 h(b). If h is also
surjective, then h(a) ·K2 h(b) implies a ·K1 b.
• Definition: K1 ¹ K2 means P vK2 Q implies P vK1 Q
• Proposition: for any positive K
B ¹ K ¹ N[X]
(All those we consider are positive.) Moreover:
• Proposition (Provenance Hierarchy):
B ¹ lineage ¹ Why-Prov. ¹ B[X] ¹ N[X]
9
Containment Mappings
• A containment mapping from CQ Q to CQ P is a
function h : Vars(Q)  Vars(P) such that
– head of Q is mapped to head of P
– every atom in body of Q is mapped to an atom in body
of P
• Theorem [CM77]: For CQs P,Q we have P vB Q iff
there is a containment mapping from Q to P
– e.g. Q1(x) :- R(x,y), R(x,z)
Q2(u) :- R(u,v)
– h which sends u  x and v  y is a containment
mapping
• Checking for existence of containment mapping is
NP-complete
10
Canonical Databases
• Take body of CQ, “freeze” into database instance
[CM77], and tag each tuple with a “tuple id”
• We’ll denote by canK(Q) the canonical database for Q
with abstract tags from K
• e.g., Q(w) :- R(u,v), R(v,w)
canN[X](Q) = canB[X](Q) = R
canlin(Q) = R
u v
{x1}
v w {x2}
u v
x1
v w x2
canwhy(Q) = R
u v
{{x1}}
v w {{x2}}
11
Lineage-Containment of CQs
• Covering set of containment mappings: for every atom
A in the body of P there is a containment mapping h : Q
 P with A in the image of h
• Theorem: For CQs P, Q the following are equivalent:
1. P vlin Q
2. P(canlin(P)) µlin Q(canlin(P))
3. there is a covering set of containment mappings from Q
to P
• Note: covering sets of containment mappings were
identified in [CV 93] as a necessary (but not sufficient)
condition for bag-containment of CQs
12
Why-Containment of CQs
• A containment mapping is onto if it induces a
surjection on atoms
• Theorem: For CQs P, Q the following are
equivalent:
1. P vwhy Q
2. P(canwhy(P)) µwhy Q(canwhy (P))
3. there is an onto containment mapping h : Q  P
• Note: onto containment mappings were
identified in [CV 93] as a sufficient (but not
necessary) condition for bag-containment of CQs
13
B[X], N[X]-containment of CQs
• A containment mapping is exact if it induces a
bijection on atoms
• Theorem: For CQs P, Q and for K 2 {B[X], N[X]}
the following are equivalent
1. P vK Q
2. P(canK (P)) µK Q(canK (P))
3. there is an exact containment mapping h : Q  P
• Another way to think of exact containment
mappings: by unifying variables in Q, you get a
query isomorphic to P
14
So Far
• K-containment of CQs is decidable for all the
provenance models in the hierarchy
• Next, we indicate which steps in the hierarchy
are strict, and which collapse:
B Á lineage Á Why-Prov. Á B[X] ¼ N[X]
15
Separating the Models for v of CQs
• B Á lineage:
Q1(x,y) :- R(x,y), R(x,z)
Q2(x,y) :- R(x,y)
Q1 vB Q2 but Q1 vlin Q2
• lineage Á why:
Q1(x) :- R(x,y), R(x,z)
Q2(x) :- R(x,y)
Q1 vlin Q2 but Q1 vwhy Q2
• why Á B[X]:
Q1(x,y) :- R(x,y)
Q2(x,y) :- R(x,y), R(x,z)
Q1 vwhy Q2 but Q1 vB[X] Q2
16
From Containment to Equivalence
• {Onto|exact} containment mappings in both
directions implies CQs are isomorphic, so
why-provenance, B[X], and N[X] collapse to:
P ´why Q , P ´B[X] Q , P ´N[X] Q , P  Q
• In contrast, for lineage, having sets of covering
containment mappings in both directions does
not imply isomorphism (but still decidable)
17
From CQs to UCQs
• For idempotent semirings (where + is
idempotent) this is easy. B, PosBool(B), lineage,
why-provenance, and B[X] are idempotent; N[X]
is not (omitted)
• Proposition [after SY80]: If K is idempotent, then
for UCQs P, Q we have P vK Q iff for every CQ P in
P there is a CQ Q in Q such that P vK Q
• Corollary: For idempotent K, the problems of
checking K-equivalence of CQs and K-equivalence
of UCQs are polynomially equivalent
18
N[X]- and Bag-Equivalence of UCQs
• As with CQs, N[X]-equivalence of UCQs turns
out to be the same as isomorphism:
Theorem: For UCQs P, Q, P ´N[X] Q iff P  Q
• But, it turns out that N[X]-equivalence and Nequivalence of UCQs are intimately related:
Theorem: for UCQs P, Q, P ´N[X] Q iff P ´N Q
Thus:
Corollary: for UCQs P, Q P ´N Q iff P  Q
19
Summary: Complexity Results
• Theorem: checking for {covering set of|onto|exact}
containment mappings is NP-complete
• Checking for query isomorphism: believed >P, <NP
B
CQs
UCQs
PosBool(B)
N
Lineage
Why-Pr.
B[X]
N[X]
vK
NP
NP
[CM 77] [PODS 07]
? (¦2p-hard)
[CV 93]
NP-ct
NP-ct
NP-ct
NP-ct
´K
NP
ibid.
NP
ibid.

ibid.
NP-ct



vK
NP
[SY 80]
NP
ibid.
undec
[IR 95]
NP-ct
NP-ct
NP-ct
PSPACE
´K
NP
ibid.
NP
ibid.

NP-ct
NP-ct
NP-ct

20
Summary: Provenance Hierarchy
B
CQs
B[X]
Why-Pr.
N[X]
vK
¼
Á
Á
Á
Á
¼
´K
¼
Á
Á
¼
¼
¼
B
UCQs
N
Lineage
PosB.(B)
Lineage
PosB.(B)
B[X]
Why-Pr.
N[X]
vK
¼
Á
Á
Á
Á
´K
¼
Á
Á
Á
Á
21
Related Work
• Already mentioned
–
–
–
–
Set-cont. and equiv. of CQs [Chandra&Merlin 77]
Set-cont. and equiv. of UCQs [Sagiv&Yannakakis 80]
Bag-cont. of UCQs [Ioannidis&Ramakrishnan 95]
Bag-equiv. of CQs [Chaudhuri&Vardi 93]
• Containment of CQs with where-provenance [Tan 03]
• Bag-set semantics [CV 93], combined semantics [Cohen 06]
– For K-relations: support operator of [Geerts&Poggi 08]
generalizes duplicate elimination
• Bag-containment of CQs [Jayram+ 06]
22
Future Work
• Loose ends:
– Lower bound for N[X]-containment of UCQs (we gave only
a PSPACE upper bound)
– Generalize results for specific semirings to semirings with
certain properties?
• Beyond UCQs: Datalog
– is K-containment of Datalog programs the same as setcontainment when K is a distributive lattice?
– is bag-equivalence/N[X]-equivalence undecidable for
Datalog?
• Could semiring framework give any insight into bagcontainment of CQs?
• Query optimization for annotated XML
23
24
Download