Trees, semistructured data, and other strange ways to go beyond tables 1 Serge Abiteboul INRIA & ENS Cachan PODS 30th Anniversary, 2011 IMS, hierarchical model, Vrelations, Jacobs’s calculus, Hardgrave’s broom, nested relations, format model, complex objects, logical data model, object databases, lambda calculus, regular trees, F-logic, NF1F, NF2, COL, IFO, LDL, IQL, SGML, HTML, ASN.1, XML, YAML, JSON… Another one of these No-SQL talks? S. Abiteboul – INRIA Saclay Luc Véro 2 Introduction Trees are useless n Theorem: Information lives in trees and not in relations A tree is a tree. How many more do you have to look at? Proof: the Bible does not say Ronald Reagan, governor of « But of the two dimensional California, opposing the table of knowledge of good expansion of Redwood andNational evil …Park » (1966) Knowledge lives in trees But of the tree of the knowledge of good and evil, thou shalt not eat of it: for in the day that thou eatest thereof thou shalt surely die. Genesis, 2. 17 We don’t need anything beyond relations. These things are useless. Reject! Anonymous referee (circa 1990) S. Abiteboul – INRIA Saclay 3 Organization Introduction Hierarchical data model 60s Nested relations 80s Complex objects early 90s Semistructured data & unranked labeled trees late 90s Unranked labeled ordered trees, aka XML early 00s Evolving trees, aka Active XML mid 00s Cycles 90s to now More or less chronological Conclusion S. Abiteboul – INRIA Saclay 4 For lack of time, we will ignore IMS and the hierarchical model • The language was purely navigational anyway We will also ignore early works such as Makinouchi, Jacobs or Hardgrave We will start with N1NF • • • François Bancilhon in France Hans Schek in Germany PhD thesis of Nicole Bidoit S. Abiteboul – INRIA Saclay 5 Non-First-Normal-Form Name Name Alice Alice Alice Bob Bob Bob A quarter on tables. Now what? Trees! DB101 N1NF Child Child Toto Toto Lulu Lulu Mimi Mimi Zaza Zaza Car Car Jaguar Jaguar 2CV 2CV Mustang Mustang Prius Prius live in prefer 1NF relations Data would to live in infamous nested relations aka V-relations aka N1NF relations aka NF2 relations S. Abiteboul – INRIA Saclay 6 A The devil is in the details V-relations A B C A B A C A 1 1 2 1 1 1 1 1 1 1 2 3 3 2 2 2 3 4 3 2 3 3 1 3 3 2 3 2 3 1 3 3 4 B 1 1 1 1 2 1 3 1 1 2 1 1 3 1 2 3 1 1 2 3 A is a key No new power S. Abiteboul – INRIA Saclay N1NFrelations A is not a key The size is now possibly exponential in the size of the domain Complex object model tuple and set constructors used freely 7 Families * Peter Children Cars Name * * Peter Name Year Name Sex BMW 2010 Toto M Name Sex Zaza – INRIA F Saclay S. Abiteboul Children Cars Name * * Name Year Name Sex 1976 Mimi F 2CV 8 A logic and algebra for complex objects Logic: main novelty is set variables – non first-order Example: AbouBanat Query { T.Father | Families(T) X T.Children ( X.Sex = F ) } Algebra: powerset operation, unnest/nest Name Child Car Name Child Car Name Child Car Alice Toto Bob Mimi Mustang Bob Bob Mimi Zaza Mustang Bob Zaza Mustang Mimi Zaza Lulu Mustang Prius Bob Lulu Prius Bob Lulu Prius S. Abiteboul – INRIA Saclay 9 Results Equivalence theorem: algebra and logic have same expressive power Remark: one can compute TC using algebra/logic (waoh! Cool!) Also studied: fixpoint, datalog, while… Complexity: each new level of nesting introduces one more exponential n 2n 22 …. Need to control the use of powerset S. Abiteboul – INRIA Saclay 10 From complex objects to semistructured data Families * Peter Children Cars Name * * Peter Name Year Name Sex BMW 2010 Toto M Name Sex Zaza – INRIA F Saclay S. Abiteboul Children Cars Name * * Name Year Name Sex 1976 Mimi F 2CV 11 Revolution 1: more flexibility Families * Peter Children Cars Name * * Peter Name Year Name Sex BMW 2010 Toto M Name Sex Annotations Trash Zaza – INRIA F Saclay S. Abiteboul Children Cars Name * * Name Year Name Sex 1976 Mimi F 2CV 12 Revolution 2: Remove some nodes; name all Families * Family Family Peter Children Cars Name * Car * Child Peter Child Name Year Name Sex BMW 2010 Toto M Name Cars Name Sex Ann. Zaza – INRIA Trash F Saclay S. Abiteboul * Car Name Year 2CV 1976 13 Unranked label trees Families Family Family Name Children Cars Cars Name Peter Peter Child Child Car Name Year Name Sex BMW 2010 Toto M Name Sex Car Ann. Zaza – INRIA Trash F Saclay S. Abiteboul Name Year 2CV 1976 14 This is better adapted to a Web context Self describing data: No separation between schema and data Flexibility Not such a big deal May be the main contribution is the format? <families><family><name>Peter</Name><Cars><Car><Name>BMW </Name><Year>2010</Year></Car></Cars><Children><Child> … Plus ça change, plus c’est la même chose The more things change, the more they stay the same S. Abiteboul – INRIA Saclay 15 What else? The trees are unbounded r a$ a$ a ab a ab a ba Like nested relations, trees are unbounded in width Unlike nested relations, they are unbounded in depth One can simulate 2 counter machines with 2 branches • • • Do applications simulate 2 counter machines with XML documents? I am still looking for one XML documents are rarely deep But even for bounded trees there are fun questions: e.g., is the equivalence of monadic datalog decidable for bounded data trees S. Abiteboul – INRIA Saclay a a What else? the trees are ordered Unranked labeled ordered trees = XML Order is often painful for optimization 16 Ignore order Respect order Classical optimization Totally new ball game Bring in tree automata Reconcile S. Abiteboul – INRIA Saclay 17 Selling argument is the Web… The move from relations to trees is interesting But the move from centralized to distributed as well and much less investigated Where the fun is: • • • • Scale is beyond what we though was thinkable Machines are totally autonomous Schema replaced by numerous ontologies True/false logic replaced by inconsistency, probabilities, trust, belief… S. Abiteboul – INRIA Saclay 18 And the trees are evolving (aka Active XML) An old idea from object databases: mix data and computation Resorts Resort Name State Aspen Colorado snowcond hotels snow !Yahoo.com/GetHotels <city name=“Aspen”/>) !Unisys.com/snow (“Aspen”) Unit Meter Depth 1 S. Abiteboul – INRIA Saclay 19 And there are cycles For lack of time, I will not mention the network model [Codasyl 1969] • The language was purely navigational anyway Person Name Spouse Adam Person If I would add references to XML, I’d get cycles Name Lots of models for graph data, e.g., IQL Eve Spouse Some fun results: e.g., some copy elimination problem when trying to obtain a ChandraHarel completeness for IQL • Similar issue for unordered trees [recent result with Vianu] S. Abiteboul – INRIA Saclay Paris C. Kanellakis 20 Conclusion Is this a good time to do research on trees in databases? The best time to plant a tree was 20 years ago. The next best time is now. Chinese Proverb S. Abiteboul – INRIA Saclay Advertisement Book on Web data management to appear at Cambridge University Press http://webdam.inria.fr/Jorge