The History of Datalog Origins Failure Resurrection 1 An Odd Encounter Several years ago, I met a colleague, Monica Lam, in the hallway at Stanford. “I hear you were involved in the early work on Datalog.” She had discovered this work and used it in her system for large-scale dataflow analysis. 2 Odd Encounter – (2) The application is naturally recursive. Very large-scale (analyzed code of 800K lines). They (Monica and her student John Whaley) had an implementation bddbddb that compiled Datalog rules into BDD’s (binary decision diagrams). 3 Where Did Datalog Come From? 1. Codd’s tuple and domain calculus (1972). 2. Gallaire and Minker’s “Logic and Databases” (1978). 3. Prolog (1976). 4 Codd’s Logics TRC. { t | R(r) and S(s) and t.A = r.A and r.B = s.B and t.C = s.C } Implemented by Stonebraker as QUEL. DRC. { ac | R(ab) and S(bc) } Implemented by Zloof as Query-byExample. 5 “Logic and Databases” Viewed queries as the result of an entire logical theory. Thus allows recursion, negation, theories with multiple minimal models. Closed/open-world evaluations. 6 Prolog A conventional programming language with predicates as function calls. Bizarre execution rule. Example: you have to write TC as: path(X,Y) :- arc(X,Y). path(X,Y) :- arc(X,Z), path(Z,Y). 7 Implementation of Logical Query Languages for Databases In 1984 I took sabbatical at Hebrew University and wrote a paper with the above title. It has some crazy stuff that makes me wonder “what was I thinking?” Much was fixed by others, later. Published in SIGMOD (no real theorems!). 8 Implementation – (2) Key idea: Prolog notation + Hornclause, unique fixedpoint semantics. Key idea: It’s about algorithms for query execution, not logical models. Original thought in that direction was really by Henschen and Naqvi. 9 Enter “Datalog” The term “Datalog” to refer to positive Horn clauses without function symbols was first proposed by Dave Maier and David S. (“the other”) Warren. Appears in their book Programming with Logic (1988), but in common use before that. 10 Good Implementation Ideas 1. Seminaive evaluation (Bancilhon and Ramakrishnan, 1986 – also in SIGMOD). 2. Specialized linear-recursion implementations (many people including Naughton, Ramakrishnan, Sagiv, Vardi,…). 3. Magic sets (Beeri and Ramakrishnan, 1987 – finally something got into PODS). 11 Magic Sets A query-rewriting scheme. Similar in effect to a number of queryexecution ideas such as 1. Query-Subquery (Rohmer, Lescoeur, and Kerasit, 1986). 2. Memoing (Dietrich and Warren, 1985). 12 Negation With negated subgoals in Datalog Example: bachelor(X) :- male(X), NOT married(X,Y) you run the risk of multiple minimal models. Stratified model (Chandra-Harel, 1982; Apt, Blair, Walker, 1985). Well-founded semantics (Van Gelder, Ross, Schlipf, 1988). 13 The Death of Datalog Recursion turned out not to be all that important in the world of the 1980’s. In the AI community, where logic was taken more seriously than in DB, the emphasis was on expressiveness, not tractability. 14 The Rebirth Datalog slept, but nothing could take away its important virtues: Simplicity and declarativeness. Tractability. Simple execution engine. While “rule-based systems” were long an AI staple, they never got these features of Datalog. 15 bddbddb Why did Monica Lam think of Datalog for data-flow analysis? Classical DFA was for code optimization. Only inner loops are important, so data never needed to get really large. 16 bddbddb – (2) Monica was looking at a different application: software security. Example: can a string read at one point be passed to a SQL call without first being the argument of a function that checks safety? Entire program analyzed as a whole. Example: 800K lines of Apache. Now it’s a database problem. 17 Overlog and Dedalus At about the same time, Joe Hellerstein was experimenting with Datalog, first for prototyping and later for the real implementation. General direction: protocols for distributed systems. 18 Overlog and Dedalus – (2) Two important additions: time and space as first-class concepts. Example (space): Assume each node has a table of arcs out. arc(@n, h) means the table at node n contains an arc to node h. 19 Example – Continued Each node n computes the set of nodes it can reach by consulting the reach sets for the nodes to which n has arcs. reach(@n, m) :- arc(@n, h), reach(@h, m). 20 Some Other Datalog Directions 1. Webdamlog (Abiteboul et al., these proceedings). Adds creation of rules at remote sites. 2. PrPl (Lam et al.). Social networking in Datalog. 3. SecPAL (Becker et al.). Microsoft authorization language translated to Datalog. 21 Other Directions – (2) 4. LogicBlox (Molham Aref, CEO). Startup in Atlanta GA. One of several Datalog-based startups. Uses Datalog for customized decisionsupport systems. Many extensions, including controlled 2nd –order predicates. Still has a tractable, straightforward execution model. 22 Conclusions Too early to tell how important Datalog will be. Will simplicity and tractability beat expressiveness? But moving in the right direction(s) now. From Datalog 2.0 Workshop: needs an open-source standard, like mySQL. 23