slides - sigmod

advertisement
The History of Datalog
Origins
Failure
Resurrection
1
An Odd Encounter
Several years ago, I met a colleague,
Monica Lam, in the hallway at Stanford.
“I hear you were involved in the early
work on Datalog.”
She had discovered this work and used
it in her system for large-scale dataflow analysis.
2
Odd Encounter – (2)
The application is naturally recursive.
Very large-scale (analyzed code of 800K
lines).
They (Monica and her student John
Whaley) had an implementation
bddbddb that compiled Datalog rules
into BDD’s (binary decision diagrams).
3
Where Did Datalog Come From?
1. Codd’s tuple and domain calculus
(1972).
2. Gallaire and Minker’s “Logic and
Databases” (1978).
3. Prolog (1976).
4
Codd’s Logics
TRC. { t | R(r) and S(s) and t.A = r.A
and r.B = s.B and t.C = s.C }
 Implemented by Stonebraker as QUEL.
DRC. { ac | R(ab) and S(bc) }
 Implemented by Zloof as Query-byExample.
5
“Logic and Databases”
Viewed queries as the result of an
entire logical theory.
Thus allows recursion, negation,
theories with multiple minimal models.
Closed/open-world evaluations.
6
Prolog
A conventional programming language
with predicates as function calls.
Bizarre execution rule.
Example: you have to write TC as:
path(X,Y) :- arc(X,Y).
path(X,Y) :- arc(X,Z),
path(Z,Y).
7
Implementation of Logical Query
Languages for Databases
In 1984 I took sabbatical at Hebrew
University and wrote a paper with the
above title.
It has some crazy stuff that makes me
wonder “what was I thinking?”
Much was fixed by others, later.
Published in SIGMOD (no real theorems!).
8
Implementation – (2)
Key idea: Prolog notation + Hornclause, unique fixedpoint semantics.
Key idea: It’s about algorithms for
query execution, not logical models.
 Original thought in that direction was really
by Henschen and Naqvi.
9
Enter “Datalog”
The term “Datalog” to refer to positive
Horn clauses without function symbols
was first proposed by Dave Maier and
David S. (“the other”) Warren.
Appears in their book Programming
with Logic (1988), but in common use
before that.
10
Good Implementation Ideas
1. Seminaive evaluation (Bancilhon and
Ramakrishnan, 1986 – also in SIGMOD).
2. Specialized linear-recursion
implementations (many people including
Naughton, Ramakrishnan, Sagiv, Vardi,…).
3. Magic sets (Beeri and Ramakrishnan,
1987 – finally something got into PODS).
11
Magic Sets
 A query-rewriting scheme.
 Similar in effect to a number of queryexecution ideas such as
1. Query-Subquery (Rohmer, Lescoeur, and
Kerasit, 1986).
2. Memoing (Dietrich and Warren, 1985).
12
Negation
With negated subgoals in Datalog
 Example: bachelor(X) :- male(X),
NOT married(X,Y)
you run the risk of multiple minimal models.
Stratified model (Chandra-Harel, 1982; Apt,
Blair, Walker, 1985).
 Well-founded semantics (Van Gelder,
Ross, Schlipf, 1988).
13
The Death of Datalog
Recursion turned out not to be all that
important in the world of the 1980’s.
In the AI community, where logic was
taken more seriously than in DB, the
emphasis was on expressiveness, not
tractability.
14
The Rebirth
Datalog slept, but nothing could take
away its important virtues:
 Simplicity and declarativeness.
 Tractability.
 Simple execution engine.
While “rule-based systems” were long an
AI staple, they never got these features
of Datalog.
15
bddbddb
Why did Monica Lam think of Datalog
for data-flow analysis?
Classical DFA was for code
optimization.
 Only inner loops are important, so data
never needed to get really large.
16
bddbddb – (2)
Monica was looking at a different
application: software security.
 Example: can a string read at one point be
passed to a SQL call without first being the
argument of a function that checks safety?
Entire program analyzed as a whole.
 Example: 800K lines of Apache.
 Now it’s a database problem.
17
Overlog and Dedalus
At about the same time, Joe Hellerstein
was experimenting with Datalog, first
for prototyping and later for the real
implementation.
General direction: protocols for
distributed systems.
18
Overlog and Dedalus – (2)
Two important additions: time and
space as first-class concepts.
Example (space): Assume each node
has a table of arcs out.
 arc(@n, h) means the table at node n
contains an arc to node h.
19
Example – Continued
Each node n computes the set of nodes
it can reach by consulting the reach
sets for the nodes to which n has arcs.
reach(@n, m) :- arc(@n, h),
reach(@h, m).
20
Some Other Datalog Directions
1. Webdamlog (Abiteboul et al., these
proceedings).
 Adds creation of rules at remote sites.
2. PrPl (Lam et al.).
 Social networking in Datalog.
3. SecPAL (Becker et al.).
 Microsoft authorization language
translated to Datalog.
21
Other Directions – (2)
4. LogicBlox (Molham Aref, CEO).
 Startup in Atlanta GA.
 One of several Datalog-based startups.
 Uses Datalog for customized decisionsupport systems.
 Many extensions, including controlled 2nd
–order predicates.
 Still has a tractable, straightforward
execution model.
22
Conclusions
Too early to tell how important Datalog
will be.
 Will simplicity and tractability beat
expressiveness?
But moving in the right direction(s)
now.
From Datalog 2.0 Workshop: needs an
open-source standard, like mySQL.
23
Download