Categorical databases David I. Spivak Presented on 2011/01/20

advertisement
Categorical databases
David I. Spivak
dspivak@math.mit.edu
Mathematics Department
Massachusetts Institute of Technology
Presented on 2011/01/20
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
1/1
Introduction
Purpose of the talk
Purpose of the talk
The purpose of this talk is:
To make a clear connection between databases and categories.
To show the advantages brought by this connection:
conceptual clarity,
inclusion of RDF and semi-structured data in the model,
ability to define functorial data migration and schema evolution,
good understanding of views and queries,
integration with PL.
To discuss possible uses of monads in database theory.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
2/1
Introduction
Basic idea
Basic idea
A database schema is a category C.
A state (or “instance”) of C is a functor δ : C → Set.
The power of this idea is in its simplicity.
The model is clear and easy to learn.
Its simplicity invites flexibility and expansion.
It is self-contained — much of database theory fits inside without
ad-hoc extensions.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
3/1
Introduction
Categories
Categories
Idea: A category models objects of a certain sort
and the relationships between them.
g
•A
C :=
i
f
j
•D d
/ •B
$
:•
C
h
$
•E
k
Think of it like a graph: the nodes are objects and the arrows are
relationships.
Some paths can be equated with others (example: j.k = i 3 ).
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
4/1
Introduction
Categories
Formal definition of a category I: data
A category C consists of the following data:
1 A set Ob(C), called the set of objects of C.
Objects x ∈ Ob(C) may be written as •x or simply as x.
2 For each x, y ∈ Ob(C) a set Arr (x, y ), called the set of arrows in C
C
from x to y .
f
An element of ArrC (x, y ) may be written f : x → y or •x −
→ •y .
3 For each x, y , z ∈ Ob(C) a composition law
ArrC (x, y ) × ArrC (y , z) → ArrC (x, z).
We write f ; g = h to denote that following arrow f then arrow g is
the same as following arrow h:
f / y
•
BB
BB
g
B
h BB •x B
•z
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
5/1
Introduction
Categories
Formal definition of a category II: rules
These data must satisfy the following requirements:
1 For every object y ∈ Ob(C) there is an “identity arrow”
idy : y → y
in ArrC (y , y ) such that for any f : x → y , the equations idx ; f = f
and f ; idy = f hold:
f / y
•
BB
BB
idy
B
f BB •x B
•x B
BB
BBf
idx
BB
B
•x f / •y
2
•y
Composition is associative. That is, given
f
g
h
•w −
→ •x −
→ •y −
→ •z ,
the following equation holds:
f ; (g ; h) = (f ; g ); h
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
6/1
Introduction
Categories
The category of Sets
Arguably the most important category is the category of Sets.
I’ll denote it Set.
An object in Set is a set, like ∅, {1, 2, 7}, or N.
An arrow in Set is a total function.
For example there are 9 functions (arrows) from {x, y } to {1, 2, 7}.
The composition law simply takes two “composable” functions
f
g
A−
→B−
→C
and spits out their “composition” f ; g : A → C .
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
7/1
Introduction
Functors: mappings between categories
Functors: mappings between categories
One way to think of a category is as a directed graph, where certain
paths have been declared equivalent.
A functor is a graph morphism that is required to respect these
equivalences.
Definition: A functor F : C → D consists of
A function Ob(F ) : Ob(C) → Ob(D) and
a function Arr(F ) : Arr(C) → Arr(D),
that respect
the source and target of every arrow,
the identity arrow of every node, and
the composition law.
Note that the composition of functors is a functor.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
8/1
Introduction
Recap
Recap
A category C is a system of objects and arrows, and a composition
law.
A functor C → D is a mapping that preserves these structures.
Set is the category whose objects are sets, whose arrows are (total)
functions, and whose composition law is the specification of how
functions compose.
If C is the category on the left below, then a functor δ : C → Set
might look like this:
C :=
A
g
C
David I. Spivak (MIT)
f
/B ;
δ=
•a1 •a2 •a3
(b1 ,b1 ,b2 )
/ •b1 •b2
•c1
Categorical databases
Presented on 2011/01/20
9/1
Introduction
What is a database?
What is a database?
A database consists of a bunch of tables and relationships between
them.
The rows of a table are called “records” or “tuples.”
The columns are called “attributes.”
A column may be “pure data” or may be a “key.”
A table may have “foreign key columns” that link it to other tables.
A foreign key of table A links into the primary key of table B.
A schema may have “business rules.”
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
10 / 1
Introduction
Foreign Keys and business rules
Foreign Keys and business rules
Example:
Id
101
102
103
First
David
Bertrand
Alan
Employee
Last
Hilbert
Russell
Turing
Mgr
103
102
103
Dpt
q10
x02
q10
Id
q10
x02
Department
Name
Sales
Production
Secr
101
102
Note the Id (primary key) columns and the foreign key columns.
Id column could just be internal “row numbers” or could be typed.
“Row numbers” (i.e. pointers) are not part of the relational model but
they are naturally part of the categorical model.
Luckily, they are a part of every implementation (I have been told), so
this model fits better with practice than the relational model does.
Perhaps we should enforce certain integrity constraints (business
rules):
The manager of an employee E must be in the same department as E ,
The secretary of a department D must be in D.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
11 / 1
Introduction
Foreign Keys and business rules
Data columns as foreign keys
Theoretically we can consider a data-type as a 1-column table.
Examples:
String
a
b
Integer
0
1
.
.
.
z
aa
ab
.
.
.
9
10
11
.
.
.
.
.
.
So even data columns can be considered as foreign keys (to respective
1-column tables).
Conclusion: each column in a table is a key – one primary,
the rest foreign.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
12 / 1
Introduction
Foreign Keys and business rules
Example again
Id
101
102
103
First
David
Bertrand
Alan
Employee
Last
Hilbert
Russell
Turing
String
Id
a
b
Mgr
103
102
103
Dpt
q10
x02
q10
Id
q10
x02
Department
Name
Sales
Production
Secr
101
102
.
.
.
z
aa
ab
.
.
.
Mgr;Dpt=Dpt
Secr;Dpt=idDepartment
Mgr
/ Department
l•
l
l
l
lll
Last
lllName
l
l
vll
•Employee
First
o
Dpt
Secr
•String
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
13 / 1
Introduction
Database schema as a category
Database schema as a category
A database schema is a system of tables linked by foreign keys.
This is just a category!
Mgr;Dpt=Dpt
Secr;Dpt=idDepartment
Mgr
C :=
/ Department
l•
l
l
l
lll
Last
lllName
l
l
vll
•Employee
First
o
Dpt
Secr
•String
Each object x in C is a table (Employee, Departments, String);
each arrow x → y is a column of table x.
Id column of a table corresponds to the identity arrow of that object.
Declaring business rules (e.g. Mgr;Dpt=Dpt) is declaring the
composition law.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
14 / 1
Introduction
Schema=Category, Data=Functor
Schema=Category, Data=Functor
Let C be the category
Secr;Dpt=idDepartment
Mgr;Dpt=Dpt
Mgr
C :=
/ •Department
ll
l
l
lll
Last
lllName
vllll
•Employee
First
o
Dpt
Secr
•String
A functor δ : C → Set consists of
A set for each object of C and
a function for each arrow of C, such that
the declared equations hold.
In other words, δ fills the schema with data.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
15 / 1
Introduction
Schema=Category, Data=Functor
Data as a set-valued functor
Id
101
102
103
First
David
Bertrand
Alan
Employee
Last
Hilbert
Russell
Turing
String
Id
a
b
Mgr
103
102
103
Dpt
q10
x02
q10
Department
Name
Sales
Production
Id
q10
x02
Secr’y
101
102
.
.
.
z
aa
ab
.
.
.
C :=
δ : C → Set
m; d = d
s; d = idD
m
m
/ •D
s
ooo
o
o
o
l
ooon
woooo
•
•E
f
o
S
d
101,102,103
f
l
/
l
l
l
lll
llnl
l
l
l
v ll
o
d
q10,x02
s
a,b,. . . ,z,aa,ab,. . .
.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
16 / 1
Introduction
Schema=Category, Data=Functor
Summary of basic idea
C :=
δ : C → Set
m; d = d
s; d = idD
m
m
/ •D
s
ooo
o
o
o
l
ooon
woooo
•
•E
f
o
S
101,102,103
d
f
l
/
l
l
l
l
lll
llln
l
l
l
v l
o
d
q10,x02
s
a,b,. . . ,z,aa,ab,. . .
.
A category C is a schema. Objects of C are tables, arrows of C are
columns.
A functor δ : C → Set fills the tables with compatible data.
For each table x, the set δ(x) is its set of rows.
The composition law in C is enforced by δ as business rules.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
17 / 1
Introduction
What does this model do for you?
What does this model do for you?
The above ideas are hopefully clear, but what are the advantages?
Conceptual clarity is nice; the model is straightforward.
A simple model is a flexible model.
Outline of the rest of the talk:
The Grothendieck construction: incorporating semi-structured data
into the relational model.
Functors between schemas give rise to data migration and views.
Integration of DB and PL via CT.
Application of monads to database theory.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
18 / 1
Semi-structured data and RDF
The Grothendieck construction
The Grothendieck construction
Let C be a category and let δ : C → Set be a functor.
We can convert δ into a category Gr (δ) in a canonical way:
Example:
C :=
A
f
/B ;
δ=
•a1 •a2 •a3
(b1 ,b1 ,b2 )
/ •b1 •b2
g
C
•c1
Gr (δ) is also known as the category of elements of δ:
•a1 •a2 •a3
::
:: •c1
David I. Spivak (MIT)
Categorical databases
,+ •b 3 •b
1
2
Presented on 2011/01/20
19 / 1
Semi-structured data and RDF
The Grothendieck construction
Grothendieck construction applied to database states
Suppose given the following state, considered as δ : C → Set
Id
101
102
103
Id
q10
x02
Employee
First
Last
David
Hilbert
Bertrand
Russell
Alan
Turing
Department
Name
Secr’y
Sales
101
Production
102
s; d = idD
m; d = d
Mgr
103
102
103
Dpt
q10
x02
q10
m
/ •D
oo
o
o
oo
l
ooon
o
o
o
wo
•
•E
C =
f
o
d
s
S
Here is Gr (δ), the category of elements of δ:
d
•101
•102
c
103
m
Gr (δ) =
David I. Spivak (MIT)
•q10
•x02
s
l
f
*
9•
...
•Alan
•Alao
...
...
•Bertranc
•Bertrand
...
•David
...
/ •Hilbert
•Production
•Russell
•Sales
•Turing
...
Categorical databases
w
n
Presented on 2011/01/20
20 / 1
Semi-structured data and RDF
A different perspective on data
A different perspective on data
In fact, the Grothendieck construction of δ : C → Set always yields not
only a category Gr (δ) but a functor
π : Gr (δ) → C.
Gr (δ):=
C :=
d
101
102
•
•
c
8•
•Alan
•Alao
...
...
Bertranc
Bertrand
...
•David
Russell
•
...
Sales
•
•
/ •Hilbert
Turing
•
•Production
w
•
n
...
s; d = idD
m
π
...
•
•
x02
/ •D
|
|
||
||n
l
|
||
|~ |
•E
s
l
q10
−
→
m
f
*
103
m; d = d
f
o
d
s
•S
The fiber over (inverse image of) every object X ∈ C is a set of objects
π −1 (X ) ⊆ Gr (δ). That set is δ(X ).
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
21 / 1
Semi-structured data and RDF
RDF schema and stores
RDF schema and stores
Gr (δ):=
C :=
d
•101
•102
c
8•
103
•x02
•Alan
•Alao
...
•Bertranc
•Bertrand
...
•David
...
/ •Hilbert
•Production
•Russell
•Sales
•Turing
...
w
n
/ •D
|
|
||
||n
l
|
||
~||
•E
s
...
s; d = idD
m
π
...
•q10
−
→
m
l
f
*
m; d = d
f
o
d
s
•S
The relation to RDF triples is clear: each arrow f : x → y in Gr (δ) is
a triple with subject x, predicate f , and object y .
For example (101 department x02), (x02 name Production), etc..
C is the RDF schema and Gr (δ) is the triple store.
SPARQL queries (graph patterns) are easily expressible in this model.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
22 / 1
Semi-structured data and RDF
RDF schema and stores
Relaxing into the RDF perspective
Given C, requiring functors δ : C → Set may be too restrictive.
This is the well-known issue with relational databases.
What properties does the associated functor π : Gr (δ) → C have?
Gr (δ):=
C :=
d
101
102
•
•
c
*
8•
103
...
•Alan
•Alao
...
Bertranc
Bertrand
•
•Russell
•
•
x02
/ •Hilbert
...
•Sales
•Turing
Production
w
n
•
...
/ •D
|
|
|
||
|
l
|n
||
|~ |
•E
−
→
...
•
...
s; d = idD
m
•
s
l
David
q10
π
m
f
m; d = d
f
o
d
s
•S
π is a “discrete op-fibration.”
What if we relax that requirement?
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
23 / 1
Semi-structured data and RDF
Allowing for semi-structured data
Allowing for semi-structured data
We can think of any functor π : D → C as a “semi-state” on C.
Such a functor π can encode incomplete, non-atomic, or bad data.
D :=
/6 •Hello
B
mmmm
m
•102
•Goodbye
•103 6
mmmm
104
•101
•
π↓
C :=
•A
f
/ •B
Row 103 has no data in the f cell, and row 104 has too much.
Bad data (data not conforming to declared composition laws) can also occur
in a functor π : D → C.
Any semi-state on C can be functorially “corrected” to a state if necessary.
Conclusion: category theory flexibly deals with problems – the model
gracefully extends.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
24 / 1
Functorial data migration and querying
States and migration
States and migration
Given a schema C, the category of states on C is denoted C–Set.
The objects of C–Set are functors δ : C → Set.
The morphisms are natural transformations.
C–Set is a topos; it has an internal language and logic supporting the
typed lambda calculus.
A morphism between schemas is a functor F : C → D.
Given such a morphism, we want to be able to canonically transfer
states on C to states on D and vice versa.
That means, given F : C → D we want functors
C–Set → D–Set
and
D–Set → C–Set.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
25 / 1
Functorial data migration and querying
The “easy” migration functor, ∆
The “easy” migration functor, ∆
Given a morphism of schemas (i.e. a functor)
F : C → D,
we can transform states on D to states on C as follows:
Given δ : D → Set
C
F
/D
δ
/ Set
6
get F ;δ : C → Set
F ;δ
Thus we have a functor ∆F : D–Set → C–Set.
∆F basically operates by “re-indexing.” Using it, one can duplicate or
drop tables or columns.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
26 / 1
Functorial data migration and querying
The “harder” migration functors, Σ and Π
The “harder” migration functors, Σ and Π
Given a morphism of schemas (i.e. a functor) F : C → D,
the functor ∆F : D–Set → C–Set has two adjoints:
a left adjoint ΣF : C–Set → D–Set, and
a right adjoint ΠF : C–Set → D–Set,
C–Set o
ΣF
∆F
/
D–Set o
∆F
ΠF
/
C–Set
Thus, given a morphism F of schemas, these three functors,
∆F , ΣF , and ΠF
allow one to move data back and forth between C and D in canonical
ways.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
27 / 1
Functorial data migration and querying
Views as polynomial functors
Views as polynomial functors
These functors can be arbitrarily composed to create views.
B:=
x GG
xx •man SSGGGG
x
SS) #
|xxullll
•name iR
5 •; address
bFFRR
FF woman kkwkww
FF•
w
ww
•boy
•girl
•street
C:=
D:=
x FF
xx •man SSFFF
x
SS)F#
|xxukkkk
•name iS
5 •< street
bFFSS
FF woman kkxkxx
FF•
x
xx
•boy
F
←
−
G
−
→
s
ysss
•name
eKKK
K
•male
•female
KKK
K%
•street
s9
sss
•girl
↓H
city
•
E:=
•
s O KKKK
%
ysss
street
•name e
KKK •Q Xss9 •
K ss
male
•female
Given a state γ : B → Set, what is ΠH ◦ ΣG ◦ ∆F (γ) : E → Set?
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
28 / 1
Functorial data migration and querying
Views as polynomial functors
A simple “SELECT” query
SELECT title, isbn
FROM book
WHERE price > 100
D :=
C :=
/ •String
JJ
Jisbn
price JJ
J$
•book
•R>100
/
title
•R
•isbn-num
E :=
/
•W
•R>100
/ •String
JJ
Jisbn
price JJ
J$
•book
F
−
→
X
/
•R
title
•W
G
←
−
•isbn-num
title
/ String
HH •
HHisbn
HH
$
•isbn-num
∆G ◦ ΠF is the appropriate view.
For any δ : C → Set, we materialize the view as ∆G ◦ ΠF (δ).
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
29 / 1
Category theory unites Databases and Programming Languages
The category of types as a database schema
The category of types as a database schema
Category theory has greatly benefited PL theory, and functional
languages in particular.
In the category Types, the objects are types and the arrows are
computable functions.
How does this correspond with the database schema idea?
For each type T , imagine a table.
Its rows are the values of T ;
it’s columns are functions T → T 0 with domain T ;
each cell contains the evaluation of a function on an input.
Types is a canonical database schema with a canonical state.
An ordinary database schema plays the same role, except now the
types and functions are user-defined.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
30 / 1
Category theory unites Databases and Programming Languages
The category of types as a database schema
Functors can import Haskell data types into schemas
Recall that any functor F : C → D induces data-migration functors
∆F , ΣF , ΠF between states on C and states on D.
Thinking of Types as a database schema with a canonical state, any
functor C → Types will induce a theoretical database state on C.
We can then compare an arbitrary database state on C with this
theoretical one using morphisms in C–Set.
There is not time to go into it here, but hopefully it is clear that such
functors allow us to import Haskell data-types and functions into an
ordinary database.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
31 / 1
Category theory unites Databases and Programming Languages
Monads on database states
Monads on database states
Given a category X , a monad on X consists of
a functor M : X → X ,
a natural transformation η : idX → M (called “return”), and
a natural transformation µ : M 2 → M (called “join”).
These are subject to some rules....
Example in X =Types: List, Maybe, State, ....
It is often useful to consider maps a → Mb, where a and b are objects
in X .
If C is a database schema, there are many interesting monads on
X := C–Set.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
32 / 1
Category theory unites Databases and Programming Languages
Sources of monads on C–Set
Sources of monads on C–Set
Given any monad M : Set → Set on the category of Sets, there is an
induced monad on C–Set.
Thus every Haskell monad, like List and Maybe, has a
“table-by-table” analogue for databases. Can you imagine it?
Given any functor F : C → D, the data-migration adjunction
(ΣF , ∆F ) induces a monad on C and a comonad on D.
Similarly, given F : C → D, the other data-migration adjunction
(∆F , ΠF ) induces a monad on D and a comonad on C.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
33 / 1
Category theory unites Databases and Programming Languages
Examples of monads on C–Set
Examples of monads on C–Set
Fix a database state δ : C → Set, and a table T ∈ Ob(C).
We will describe the relationship between table δ(T ) and table Mδ(T ) for
various monads M.
If M = Maybe, then there will be one more row in Mδ(T ) than in
δ(T ) (called Nothing), and its attribute in every column will be
Nothing as well. The rest of the table will remain exactly the same.
If M = List, then there will be a row of Mδ(T ) for every list of rows
in δ(T ). The f attribute of [x1 , . . . , xn ] will be [f (x1 ), . . . , f (xn )].
The monad arising on •
• from the functor
•
sends •A
•B to •A∪B
• → •
•A∪B .
In each case, it is worthwhile to imagine a map δ → M.
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
34 / 1
Category theory unites Databases and Programming Languages
Examples of monads on C–Set
Algebras for a monad
An algebra for a monad M is a natural transformation M → idX
satisfying some rules.
Algebras for a monad on Types are sometimes interesting:
An algebra List(A) → A “folds” some binary associative function onto
a list of A’s.
An algebra Maybe(A) → A chooses a “default value” in A.
An algebra for a monad M on C–Set may also be interesting.
If M is induced by a monad on Set (e.g. List or Maybe), then it’s a
table-by-table version of the above.
If M is induced by data migration functors; i.e. from F : C → D, then
M-algebras have a different flavor (not “table-by-table”).
These M-algebras “emulate D–Set inside C–Set”. In fact, if the
migration-functor is monadic then the category of M-algebras is
equivalent to D–Set, even though it takes place entirely inside C–Set!
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
35 / 1
Conclusion
Summary
Summary
I hope the connection between databases and categories is clear.
Id
101
102
103
Id
q10
x02
Employee
First
Last
David
Hilbert
Bertrand
Russell
Alan
Turing
Department
Name
Secr
Sales
101
Production
102
m; d = d
Mgr
103
102
103
Dpt
q10
x02
q10
s; d = idD
m
/ •D
oo
o
o
oo
l
ooon
o
o
o
wo
•
•E
f
o
d
s
S
I discussed how one can use this connection to facilitate:
formalizing views as polynomial functors;
merging relational and RDF outlooks (NoSQL);.
merging database and PL theory;
adding cool tricks to databases using monads.
The main point is that this is a self-contained, unified approach.
With a clear formalism of databases, one might be able to apply proof
assistants in exciting new ways (e.g. as query-optimizers).
Thanks for the invitation to speak!
David I. Spivak (MIT)
Categorical databases
Presented on 2011/01/20
36 / 1
Download