Trees, semistructured data,
and other strange ways to go
beyond tables
1
Serge Abiteboul
INRIA & ENS Cachan
PODS 30th Anniversary, 2011
IMS, hierarchical model, Vrelations, Jacobs’s calculus,
Hardgrave’s broom, nested
relations, format model, complex
objects, logical data model, object
databases, lambda calculus, regular
trees, F-logic, NF1F, NF2, COL,
IFO, LDL, IQL, SGML, HTML,
ASN.1, XML, YAML, JSON…
Another one of
these No-SQL
talks?
S. Abiteboul – INRIA Saclay
Luc
Véro
2
Introduction
Trees are useless
n
Theorem: Information lives
in trees and not in relations
A tree is a tree. How many more
do you have to look at?
Proof: the Bible does not say
Ronald Reagan, governor of
« But
of the two dimensional
California, opposing the
table
of knowledge
of good
expansion
of Redwood
andNational
evil …Park
»  (1966)
Knowledge lives in trees
But of the tree of the knowledge
of good and evil, thou shalt not
eat of it: for in the day that thou
eatest thereof thou shalt surely
die.
Genesis, 2. 17
We don’t need anything beyond
relations. These things are
useless. Reject!
Anonymous referee (circa 1990)
S. Abiteboul – INRIA Saclay
3
Organization
Introduction
Hierarchical data model
60s
Nested relations
80s
Complex objects
early 90s
Semistructured data & unranked labeled trees
late 90s
Unranked labeled ordered trees, aka XML
early 00s
Evolving trees, aka Active XML
mid 00s
Cycles
90s to now
More or less chronological
Conclusion
S. Abiteboul – INRIA Saclay
4
For lack of time, we will ignore IMS and the hierarchical model
•
The language was purely navigational anyway
We will also ignore early works such as Makinouchi, Jacobs or
Hardgrave
We will start with N1NF
•
•
•
François Bancilhon in France
Hans Schek in Germany
PhD thesis of Nicole Bidoit
S. Abiteboul – INRIA Saclay
5
Non-First-Normal-Form
Name
Name
Alice
Alice
Alice
Bob
Bob
Bob
A quarter on
tables. Now what?
Trees!
DB101
N1NF
Child
Child
Toto
Toto
Lulu
Lulu
Mimi
Mimi
Zaza
Zaza
Car
Car
Jaguar
Jaguar
2CV
2CV
Mustang
Mustang
Prius
Prius
live in prefer
1NF relations
Data would
to live in infamous
nested relations
aka V-relations
aka N1NF relations
aka NF2 relations
S. Abiteboul – INRIA Saclay
6
A
The devil is in the details
V-relations
A
B
C
A
B
A
C
A
1
1
2
1
1
1
1
1
1
1
2
3
3
2
2
2
3
4
3
2
3
3
1
3
3
2
3
2
3
1
3
3
4
B
1
1
1
1
2
1
3
1
1
2
1
1
3
1
2
3
1
1
2
3
A is a key
No new power
S. Abiteboul – INRIA Saclay
N1NFrelations
A is not a key
The size is
now possibly
exponential
in the size of
the domain
Complex object model
tuple and set constructors used freely
7
Families
*


Peter
Children
Cars
Name
*

*

Peter

Name Year
Name
Sex
BMW 2010
Toto
M
Name
Sex
Zaza – INRIA
F Saclay
S. Abiteboul
Children
Cars
Name
*
*


Name Year
Name
Sex
1976
Mimi
F
2CV
8
A logic and algebra for complex objects
Logic: main novelty is set variables – non first-order
Example: AbouBanat Query
{ T.Father | Families(T)  X  T.Children ( X.Sex = F ) }
Algebra: powerset operation, unnest/nest
Name
Child Car
Name
Child
Car
Name
Child
Car
Alice
Toto
Bob
Mimi
Mustang
Bob
Bob
Mimi
Zaza
Mustang
Bob
Zaza
Mustang
Mimi
Zaza
Lulu
Mustang
Prius
Bob
Lulu
Prius
Bob
Lulu
Prius
S. Abiteboul – INRIA Saclay
9
Results
Equivalence theorem: algebra and logic have same expressive
power
Remark: one can compute TC using algebra/logic (waoh! Cool!)
Also studied: fixpoint, datalog, while…
Complexity: each new level of nesting introduces one more
exponential
n
2n
22
….
Need to control the use of powerset
S. Abiteboul – INRIA Saclay
10
From complex objects to semistructured data
Families
*


Peter
Children
Cars
Name
*

*

Peter

Name Year
Name
Sex
BMW 2010
Toto
M
Name
Sex
Zaza – INRIA
F Saclay
S. Abiteboul
Children
Cars
Name
*
*


Name Year
Name
Sex
1976
Mimi
F
2CV
11
Revolution 1: more flexibility
Families
*


Peter
Children
Cars
Name
*

*

Peter

Name Year
Name
Sex
BMW 2010
Toto
M
Name
Sex Annotations
Trash
Zaza – INRIA
F Saclay
S. Abiteboul
Children
Cars
Name
*
*


Name Year
Name
Sex
1976
Mimi
F
2CV
12
Revolution 2: Remove some nodes; name all
Families
*
 Family
Family 
Peter
Children
Cars
Name
*
 Car
*
 Child
Peter
 Child
Name Year
Name
Sex
BMW 2010
Toto
M
Name
Cars
Name
Sex
Ann.
Zaza – INRIA
Trash
F Saclay
S. Abiteboul
*
 Car
Name Year
2CV
1976
13
Unranked label trees
Families
Family
Family
Name
Children
Cars
Cars
Name
Peter
Peter
Child
Child
Car
Name Year
Name
Sex
BMW 2010
Toto
M
Name
Sex
Car
Ann.
Zaza – INRIA
Trash
F Saclay
S. Abiteboul
Name Year
2CV
1976
14
This is better adapted to a Web context
Self describing data: No separation between schema and data
Flexibility
Not such a big deal
May be the main contribution is the format?
<families><family><name>Peter</Name><Cars><Car><Name>BMW
</Name><Year>2010</Year></Car></Cars><Children><Child> …
Plus ça change,
plus c’est la même chose
The more things change,
the more they stay the same
S. Abiteboul – INRIA Saclay
15
What else? The trees are unbounded
r
a$
a$
a
ab
a
ab
a
ba
Like nested relations, trees are unbounded in width
Unlike nested relations, they are unbounded in depth
One can simulate 2 counter machines with 2 branches
•
•
•
Do applications simulate 2 counter machines with XML
documents?
I am still looking for one
XML documents are rarely deep
But even for bounded trees there are fun questions: e.g.,
is the equivalence of monadic datalog decidable for
bounded data trees
S. Abiteboul – INRIA Saclay
a
a
What else? the trees are ordered
Unranked labeled ordered trees = XML
Order is often
painful for
optimization
16
Ignore order
Respect order
Classical optimization
Totally new ball game
Bring in tree automata
Reconcile
S. Abiteboul – INRIA Saclay
17
Selling argument is the Web…
The move from relations to trees is interesting
But the move from centralized to distributed as well
and much less investigated
Where the fun is:
•
•
•
•
Scale is beyond what we though was thinkable
Machines are totally autonomous
Schema replaced by numerous ontologies
True/false logic replaced by inconsistency, probabilities, trust, belief…
S. Abiteboul – INRIA Saclay
18
And the trees are evolving (aka Active XML)
An old idea from object databases: mix data and computation
Resorts
Resort
Name
State
Aspen
Colorado
snowcond
hotels
snow
!Yahoo.com/GetHotels
<city name=“Aspen”/>)
!Unisys.com/snow
(“Aspen”)
Unit
Meter
Depth
1
S. Abiteboul – INRIA Saclay
19
And there are cycles
For lack of time, I will not mention the
network model [Codasyl 1969]
•
The language was purely
navigational anyway
Person
Name
Spouse
Adam
Person
If I would add references to XML, I’d
get cycles
Name
Lots of models for graph data, e.g.,
IQL
Eve
Spouse
Some fun results: e.g., some copy
elimination problem when trying to
obtain a ChandraHarel
completeness for IQL
•
Similar issue for unordered trees
[recent result with Vianu]
S. Abiteboul – INRIA Saclay
Paris C. Kanellakis
20
Conclusion
Is this a good time to do research on trees in databases?
The best time to plant a tree was 20 years ago.
The next best time is now.
Chinese Proverb
S. Abiteboul – INRIA Saclay
Advertisement
Book on Web data management
to appear at Cambridge University Press
http://webdam.inria.fr/Jorge