Uploaded by Qingsong Guo

QL-mmdb

advertisement
Query Languages for MultiModel Databases
Qingsong Guo
UDBMS Group, University of Helsinki
2020.04.09
Outline
1
Data Models
2
Query Languages for MMDBMS
1
Various Data Model
1. Structured data: SQL
2. Semi-structured data:
3. Graph data:
Traits for Database Query Languages
• Declarative rather than procedural
• We developed
– fantastic algorithms
– great systems
– clever query processing and storage strategies
– a lot of brilliant stuff
• DBMSs have become fast
– when you look at TPC-C: we are currently able to execute ~half a
million transactions per second
– to simple index-lookups, e.g., in a hash table, we are currently at 20
million operations per second
1
Relational Model and Languages
1. Structured data: SQL
2. Semi-structured data:
3. Graph data:
Relational Model
• A relational database consists of relations
– Each row defines a relationship between a set of values
– Table(analogy with relation)
– Schema of a relation
Example Instances
• “Sailors” and “Reserves”
relations for our examples.
“bid”= boats. “sid”: sailors
• We’ll use positional or named
field notation, assume that
names of fields in query results
are `inherited’ from names of
fields in query input relations.
R1
sid bid
day
22 101 10/10/96
58 103 11/12/96
S1
s id
22
31
58
S2
s
i
d
s
n
a
m
e
r
a
t
i
n
g
a
g
e
2
8
y
u
p
p
y
9
3
5
.
0
3
1
l
u
b
b
e
r
8
5
5
.
5
4
4
g
u
p
p
y
5
3
5
.
0
5
8
r
u
s
t
y
1
0
3
5
.
0
snam e
d u s tin
lu b b e r
ru s ty
ra tin g
7
8
10
age
4 5 .0
5 5 .5
3 5 .0
Relational Model
• E. F. Codd, A Relational Model of Data for Large Shared Data
Banks, Communications of the ACM 13(6), June 1970
– set-based query language
• relational query language
– Normalization
• avoid redundancy
– data independence
• between logical and physical level
Formal Relational Query Languages
• Two mathematical Query Languages form the basis for
“real” languages (e.g. SQL), and for implementation:
– Relational Algebra: More operational(procedural), and always
used as internal representation for query evaluation plans.
– Relational Calculus: Lets users describe what they want, rather
than how to compute it. (Non-operational, declarative.)
2
Semistructured Data Model
1. Structured data: SQL
2. Semi-structured data:
3. Graph data:
Semistructured data: Past and
Present
Richer (than relational) type systems
• Object-Oriented data model
• Nested Relational data model
Schemaless and Schema-Optional data
• Schemaless Labelled Graphs
• more on graphs in the second part
• XML as labelled tree
Scalable nested & semistructured formats
• (schemaless) JSON
• (machine-oriented, columnar) Parquet, ORC, …
• Google’s buffer protocols …
Semistructured data
Can be thought of as SQL data model extension and
restriction removal
- complex types: arrays, (nested) tuples, maps
- rigid schema is not necessary
XML and labelled trees…
- also allows modeling of complex types
- Lack of distinction between tuple and list
complicates typical querying use cases
Nested & Semistructured Query
Languages
Declarative
+
Expressive
Power
Semistructured
Data
Be like SQL – but for nested & semistructured data
The string approach to extending
databases with new types of data
• Treat the nested and semistructured data as strings of a
particular format, e.g. JSON
• Introduce UDFs to process such strings
- A JSON encoded tuple and an SQL tuple are two different
things even if identical
=> We will not further deal with this approach
SQL++
• SQL++ : A Backwards-Compatible SQL , which can access a
SQL extension with nested and semistructured data
– Queries exhibit XQuery and OQL abilities, yet backwards
compatible with SQL-92
• Surveying the multiple and different languages and their
semantics options with Configurable SQL++
– Options modify the semantics of the language in order to morph into
one of the languages
SQL++: http://arxiv.org/abs/1405.3631
http://db.ucsd.edu/wp-content/uploads/pdfs/375.pdf
SQL++ Data Model
• Extend SQL with arrays + nesting + heterogeneity
Following mostly JSON’s notation but applicable to any
format that can be abstracted into these concepts
{
'location': 'Alpine',
'readings': [
{
'time': timestamp('2014-03-12T20:00:00'),
'ozone': 0.035,
'no2': 0.0050
},
{
'time': timestamp('2014-03-12T22:00:00'),
'ozone': 'm',
'co': 0.4
} ]
}
• Array nested
inside a tuple
• Heterogeneous
tuples in
collections
• Arbitrary
compositions of
array, bag, tuple
SQL++ Data Model
• Extend SQL with arrays + nesting + heterogeneity
+ permissive structures
[
[5, 10, 3],
[21, 2, 15, 6],
[4, {lo: 3, exp: 4, hi: 7}, 2, 13, 6]
]
• Arbitrary
compositions of
array, bag, tuple
SQL++ Data Model
Can also think of as extension of JSON
• Use single quotes for literals
• Extend JSON with bags and enriched types
{
'location': 'Alpine',
'readings': {{
{
'time': timestamp('2014-03-12T20:00:00'),
'ozone': 0.035,
'no2': 0.0050
},
{
'time': timestamp('2014-03-12T22:00:00'),
'ozone': 'm',
For presentation
'co': 0.4
simplicity, we’ll
} }}
omit the '…'
}
• Bags {{ … }}
• Enriched types
SQL++ Data Model
{
{
location: 'Alpine',
location: string,
readings: {{
readings: {{
{
{
time: timestamp('2014-03-12T20:00:00'),
time: timestamp,
ozone: 0.035,
ozone: any,
no2: 0.0050
*
},
}
{
}}
time: timestamp('2014-03-12T22:00:00'),
}
ozone: 'm',
co: 0.4
} }}
}
• With schema
• Or, schemaless
• Or, partial schema
Comparisons
• OQL without schema
• Simpler than XML and the XQuery data model
• Unlike labeled trees (the favorite XML abstraction of XPath
and XQuery research) makes the distinction between tuple
constructor and list/array/bag constructor
From SQL to SQL++
• Goal: Queries exhibit XQuery and OQL abilities, yet
backwards compatible with SQL-92
• SQL Backwards-compatible because
–easy to teach
–easy to extend functionalities of SQL systems
Backwards Compatibility with SQL
readings
{ sid:
{ sid:
{ sid:
}}
: {{
2, temp: 70.1 },
2, temp: 49.2 },
1, temp: null }
{{
{ sid : 2 }
SELECT DISTINCT r.sid
FROM
readings AS r
WHERE r.temp < 50
Find sensors that recorded a
temperature below 50
sid
2
2
temp
70.1
49.2
1
null
}}
sid
2
From SQL to SQL++:
What is the minimum needed changes?
• Goal: Queries that input/output any complex type, yet
backwards compatible with SQL
• Adjust SQL syntax/semantics for heterogeneity,
complex input/output values
• Achieved with few extensions and mostly by SQL limitation
removals
– OQL influence
• Minimal SQL++ core
– SQL and the full SQL++ are syntactic sugar over a small SQL++ core
From Tuple Variables to Element
Variables
readings :
[
1.3,
0.7,
0.3,
0.8
]
SELECT
FROM
WHERE
ORDER BY
LIMIT
r AS co
readings AS r
r < 1.0
r DESC
2
Find the highest two sensor readings
that are below 1.0
[
{ co: 0.8 },
{ co: 0.7 }
]
Semantics: Bindings to Element Variables
FROM readings AS r
BoutFROM = BinWHERE
= {{ ⟨ r : 1.3 ⟩,
⟨ r : 0.7 ⟩,
⟨ r : 0.3 ⟩,
⟨ r : 0.8 ⟩ }}
WHERE r < 1.0
readings :
[
1.3,
0.7,
0.3,
0.8
]
BoutWHERE = BinORDERBY = {{ ⟨ r : 0.7 ⟩,
⟨ r : 0.3 ⟩,
⟨ r : 0.8 ⟩ }}
ORDER BY r DESC
BoutORDERBY = BinLIMIT = [ ⟨ r : 0.8 ⟩,
⟨ r : 0.7 ⟩,
⟨ r : 0.3 ⟩ ]
LIMIT 2
BoutLIMIT = BinSELECT
SELECT r AS co
= [ ⟨ r : 0.8 ⟩
⟨ r : 0.7 ⟩ ]
[
{ co: 0.8 },
{ co: 0.7 }
]
Environment
⊢
Query
→
Result
FROM readings AS r
BoutFROM = BinWHERE =
{{ ⟨ r : 1.3 ⟩,
⟨ r : 0.7 ⟩,
⟨ r : 0.3 ⟩,
⟨ r : 0.8 ⟩ }}
ℾ=⟨
scale: 2,
readings :
[
1.3,
0.7,
0.3,
0.8
]
⟩
WHERE r < 1.0
BoutWHERE = BinORDERBY =
{{ ⟨ r : 0.7 ⟩,
⟨ r : 0.3 ⟩,
⟨ r : 0.8 ⟩ }}
ORDER BY r DESC
BoutORDERBY = BinLIMIT =
[ ⟨ r : 0.8 ⟩,
⟨ r : 0.7 ⟩,
⟨ r : 0.3 ⟩ ]
LIMIT 2
BoutLIMIT = BinSELECT =
[ ⟨ r : 0.8 ⟩
⟨ r : 0.7 ⟩ ]
SELECT ELEMENT r * scale
[
1.6,
1.4
]
SQL++
SQL++ Core
SELECT
FROM
WHERE
ORDER BY
LIMIT
r AS co
readings AS r
r < 1.0
r DESC
2
SELECT ELEMENT
{'co': r}
FROM
readings AS r
WHERE
r < 1.0
ORDER BY r DESC
LIMIT
2
No need for homogeneity of bindings
Environment
⊢
Query
→
Result
FROM readings AS r
ℾ=⟨
readings :
{{
0.3,
{
low : 0.2,
high: 0.4
},
null,
'C'
}}
⟩
b1 = ⟨ r : 0.3⟩
b2 = ⟨ r : {
low : 0.2,
high: 0.4
}⟩
b3 = ⟨ r : null ⟩
b4 = ⟨ r : 'C'⟩
SELECT ELEMENT r
{{
0.3,
{
low : 0.2,
high: 0.4
},
null,
'C'
}}
Correlated Subqueries Everywhere
sensors :
[
[1.3, 2]
[0.7, 0.7, 0.9]
[0.3, 0.8, 1.1]
[0.7, 1.4]
]
SELECT
FROM
r AS co
sensors
s
WHERE
r < 1.0
ORDER BY r DESC
LIMIT
2
AS s,
AS r
Find the highest two sensor readings
that are below 1.0
[
{ co: 0.9 },
{ co: 0.8 }
]
Correlated Subqueries Everywhere
FROM sensors AS s,
s
AS r
BoutFROM = BinWHERE
sensors : [
[1.3, 2]
[0.7, 0.7, 0.9]
[0.3, 0.8, 1.1]
[0.7, 1.4]
]
⟨ s : [1.3,
⟨ s : [1.3,
⟨ s : [0.7,
⟨ s : [0.7,
⟨ s : [0.7,
⟨ s : [0.3,
⟨ s : [0.3,
⟨ s : [0.3,
⟨ s : [0.7,
⟨ s : [0.7,
}}
WHERE r < 1.0
ORDER BY r DESC
LIMIT 2
SELECT r AS co
2]
2]
0.7,
0.7,
0.7,
0.8,
0.8,
0.8,
1.4]
1.4]
= {{
, r : 1.3 ⟩,
, r : 2 ⟩,
0.9] , r : 0.7 ⟩,
0.9] , r : 0.7 ⟩,
0.9] , r : 0.9 ⟩,
1.1] , r : 0.3 ⟩,
1.1] , r : 0.8 ⟩,
1.1] , r : 1.1 ⟩,
, r : 0.7 ⟩,
, r : 1.4 ⟩
Correlated Subqueries Everywhere
FROM sensors AS s,
s
AS r
WHERE r < 1.0
BoutWHERE = BinORDERBY
sensors : [
[1.3, 2]
[0.7, 0.7, 0.9]
[0.3, 0.8, 1.1]
[0.7, 1.4]
]
⟨ s : [0.7,
⟨ s : [0.7,
⟨ s : [0.7,
⟨ s : [0.3,
⟨ s : [0.3,
⟨ s : [0.7,
}}
ORDER BY r DESC
LIMIT 2
SELECT r AS co
0.7,
0.7,
0.7,
0.8,
0.8,
1.4]
= {{
0.9] , r : 0.7 ⟩,
0.9] , r : 0.7 ⟩,
0.9] , r : 0.9 ⟩,
1.1] , r : 0.3 ⟩,
1.1] , r : 0.8 ⟩,
, r : 0.7 ⟩
Correlated Subqueries Everywhere
FROM sensors AS s,
s
AS r
WHERE r < 1.0
ORDER BY r DESC
BoutORDERBY = BinLIMIT
sensors : [
[1.3, 2]
[0.7, 0.7, 0.9]
[0.3, 0.8, 1.1]
[0.7, 1.4]
]
⟨ s : [0.7,
⟨ s : [0.3,
⟨ s : [0.7,
⟨ s : [0.7,
⟨ s : [0.7,
⟨ s : [0.3,
0.7,
0.8,
0.7,
0.7,
1.4]
0.8,
=[
0.9] , r : 0.9 ⟩,
1.1] , r : 0.8 ⟩,
0.9] , r : 0.7 ⟩,
0.9] , r : 0.7 ⟩,
, r : 0.7 ⟩,
1.1] , r : 0.3 ⟩
]
LIMIT 2
BoutLIMIT = BinSELECT
=[
⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩,
⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩
]
SELECT r AS co
[
{ co: 0.9 },
{ co: 0.8 }
]
Composability: A Powerful Idea from
Functional Programming Languages
• Variables generate environments within which queries
operate
• Initial environment provides named top-level values
• The OQL legacy: When evaluating a subquery names and
variables behave no different
Constructing Nested Results
Discussion of Environments
{{
ℾ0 = ⟨
sensors : {{
{sensor: 1},
{sensor: 2}
}} ,
logs: {{
{sensor:
co:
{sensor:
co:
{sensor:
co:
}}
⟩
1,
0.4},
1,
0.2},
2,
0.3},
FROM
sensors AS s
BoutFROM = BinSELECT
= {{
⟨ s : {sensor: 1} ⟩,
⟨ s : {sensor: 2} ⟩
}}
SELECT
s.sensor AS sensor,
(
SELECT l.co AS co
FROM
logs AS l
WHERE l.sensor = s.sensor
) AS readings
{
sensor : 1,
readings: {{
{co:0.4},
{co:0.2}
}}
},
{
sensor : 2,
readings: {{
{co:0.3}
}}
}
}}
Constructing Nested Results
ℾ1 = ⟨
s : {sensor : 1},
FROM
logs AS l
BoutFROM = BinSELECT
sensors : {{
{sensor: 1},
{sensor: 2}
}} ,
logs: {{
{sensor:
co:
{sensor:
co:
{sensor:
co:
}}
⟩
1,
0.4},
1,
0.2},
2,
0.3},
= {{
⟨ l : {sensor: 1, co: 0.4} ⟩,
⟨ l : {sensor: 1, co: 0.2} ⟩,
⟨ l : {sensor: 2, co: 0.3} ⟩
}}
WHERE
l.sensor = s.sensor
BoutFROM = BinSELECT
= {{
⟨ l : {sensor: 1, co: 0.4} ⟩,
⟨ l : {sensor: 1, co: 0.2} ⟩
}}
SELECT
l.co AS co
{{
{co:0.4},
{co:0.2}
}}
How to deal with order:
Create order Vs Presume order
• SQL assumes bags as inputs but may output lists
by using ORDER BY
• not enough when input has order
• XQuery adopted order preservation semantics
• eg, filtering a list will produce a list where qualified
elements retain their original order
• often becomes a burden
eg, when order is dictated for join results
Extending SQL with
Utilization of Input Order
x
y
SELECT
sensors :
[
[1.3, 2]
[0.7, 0.7, 0.9]
[0.3, 0.8, 1.1]
[0.7, 1.4]
]
r AS co,
x AS x,
y AS y
FROM
sensors AS s AT y,
s
AS r AT x
WHERE
r < 1.0
ORDER BY r DESC
LIMIT
2
Find the highest two sensor readings that are
below 1.0
[
{ co: 0.9,
x: 3,
y: 2
},
{ co: 0.8,
x: 2,
y: 3
}
]
Order Preservation by
Ordering according to the Input Order
x
y
SELECT
sensors :
[
[1.3, 2]
[0.7, 0.7, 0.9]
[0.3, 0.8, 1.1]
[0.7, 1.4]
]
r AS co,
x AS x,
y AS y
FROM
sensors AS s AT y,
s
AS r AT x
WHERE
r < 1.0
ORDER BY y ASC, x ASC
LIMIT
2
Find the first two sensor readings that are
below 1.0
[
{ co: 0.7,
x: 1,
y: 2
},
{ co: 0.7,
x: 2,
y: 2
}
]
(What if) Automated Order Preservation
SELECT
FROM
r AS co
sensors AS s,
s
AS r
WHERE
r < 1.0
ORDER BY PRESERVATION
LIMIT
2
SELECT
FROM
r AS co
sensors AS s AT y,
s
AS r AT x
WHERE
r < 1.0
ORDER BY y ASC, x ASC
LIMIT
2
Iterating over Attribute-Value pairs
Constructing tuples with data-dependent
number of Attribute-Value pairs
ℾ
⊢
⟨ readings : {{
{ no2: 0.6, co: 0.7,
co2: [0.5, 2] },
{ no2: 0.5, co: [0.4], co2: 1.3
} }} ⟩
FROM
readings AS r
SELECT ELEMENT (
r AS {g:v}
Q FROM
WHERE g LIKE "co%"
SELECT ATTRIBUTE g:v )
↓
V {{ { co: 0.7,
co2: [0.5, 2] },
{ co: [0.4], co2: 1.3
} }}
INNER Vs OUTER Correlation:
Capturing Outerjoins, OUTER FLATTEN
FROM
readings AS {g:v},
v AS n
[
{
gas:
num:
}, {
gas:
num:
}, {
gas:
num:
}
ℾ=⟨
readings : {
co : [],
{{
no2: [0.7],
⟨ g : 'no2', v : [0.7], n : 0.7 ⟩,
so2: [0.5, 2]
⟨ g : 'so2', v : [0.5, 2], n : 0.5 ⟩,
}
⟨ g : 'so2', v : [0.5, 2], n : 2 ⟩ }}
⟩
SELECT ELEMENT
{'gas': g,
'num': n }
]
'no2',
0.7
'so2,
0.5,
'so2',
2
INNER Vs OUTER Correlation:
Capturing Outerjoins, OUTER FLATTEN
FROM
readings AS {g:v}
INNER CORRELATE
v AS n
[
{
gas:
num:
}, {
gas:
num:
}, {
gas:
num:
}
ℾ=⟨
readings : {
{{
co : [],
⟨ g : "no2", v : [0.7], n : 0.7 ⟩,
no2: [0.7],
⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩,
so2: [0.5, 2]
⟨ g : "so2", v : [0.5, 2], n : 2 ⟩
}
}}
⟩
SELECT ELEMENT
{'gas': g,
'num': n }
]
'no2',
0.7
'so2',
0.5,
'so2',
2
INNER Vs OUTER Correlation:
Capturing Outerjoins, OUTER FLATTEN
FROM
readings AS {g:v}
OUTER CORRELATE
v AS n
[
{
gas:
num:
}, {
gas:
num:
}, {
gas:
num:
}, {
gas:
num:
}
ℾ=⟨
readings : {
co : [],
{{ ⟨ g : "co", v : [], n : missing ⟩,
no2: [0.7],
⟨ g : "no2", v : [0.7], n : 0.7 ⟩,
so2: [0.5, 2]
⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩,
}
⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }}
⟩
SELECT ELEMENT
{ gas: g,
num: n }
]
'co',
null
'no2',
0.7
'so2',
0.5,
'so2',
2
An OQLish GROUP BY, at a deeper level
than SQL’s
Environment
⊢
Query 1 (Core SQL++)
FROM
ℾ=⟨
co_readings : {{
{ sensor: 1,
co
: 0.3 },
{ sensor: 1,
co
: 0.5 },
{ sensor: 2,
co
: 0.6 }
}}
⟩
co_readings AS c
→
Query 2 (with SQL++ syntactic sugar)
FROM
Result
co_readings AS c
b1 = ⟨ c : {sensor: 1, co: 0.3} ⟩
b2 = ⟨ c : {sensor: 1, co: 0.5} ⟩
b3 = ⟨ c : {sensor: 2, co: 0.6} ⟩
{{
{
GROUP BY c.sensor AS s;
GROUP AS grp
GROUP BY c.sensor
b'1 = ⟨ s : 1, grp : {{ { c : {sensor: 1, co: 0.3} },
{ c : {sensor: 1, co: 0.5} } }} ⟩
b'2 = ⟨ s : 2, grp : {{ { c : {sensor: 2, co: 0.6} } }} ⟩
SELECT ELEMENT
{
sensor:s,
readings:
( SELECT ELEMENT g.c.co
FROM
grp AS g),
number: count(grp),
average: avg(
SELECT ELEMENT g.c.co
FROM
grp AS g) }
SELECT
c.sensor
AS sensor,
( SELECT ELEMENT g.c.co
FROM
group AS g)
AS readings,
count(*)
AS number,
avg(c.co)
AS average
sensor :
readings:
number :
average :
},
1,
{{ 0.3, 0.5 }}
2,
0.4
{
sensor :
readings:
number :
average :
2,
{{ 0.6 }},
1,
0.6
}
}}
Alike OQL and unlike SQL’s semantics, the SELECT clause semantics make sense just
by knowing the incoming variables (and environment)
ℾ
⊢
⟨ sensors : {{ { sensor: "s1", reading: {co: 0.7, no2: 2 } }
{ sensor: "s2", reading: {co: 0.5, so2: 1.3} } }} ⟩
FROM sensors AS s
INNER CORRELATE s.reading AS {g:v}
GROUP BY g AS x
Q
↓
V
SELECT ELEMENT {
gas: x,
reading: (
FROM group AS p
SELECT ATTRIBUTE
p.s.sensor : p.v )}
{{ ⟨ s : {
⟨s:{
⟨s:{
⟨s:{
sensor:"s1",reading:{co:0.7,no2:2} },
g : "co", v : 0.7 ⟩,
sensor:"s1",reading:{co:0.7,no2:2} }, g : "no2", v : 2 ⟩,
sensor:"s2",reading:{co:0.5,so2:1.3} }, g : "co", v : 0.5 ⟩,
sensor:"s2",reading:{co:0.5,so2:1.3} }, g : "so2", v : 1.3 ⟩ }}
{{ ⟨ x : "co", group : {{
{s:{ sensor:"s1",reading:{co:0.7,no2:2} }, g:"co",v:0.7},
{s:{ sensor:"s2",reading:{co:0.5,so2:1.3} },g:"co",v:0.5} }} ⟩,
⟨ x : "no2", group : {{
{s:{ sensor:"s1",reading:{co:0.7,no2:2} }, g:"no2",v:2} }} ⟩,
⟨ x : "so2", group : {{
{s:{ sensor:"s2",reading:{co:0.5,so2:1.3} },g:"so2",v:1.3} }} ⟩ }}
{{ { gas: "co", reading: { s1: 0.7, s2: 0.5 },
{ gas: "no2", reading: { s1: 2 },
{ gas: "so2", reading: { s2: 1.3 } }}
Summing Up: The SQL++ Changes to SQL
• Goal: Queries that input/output any JSON type, yet backwards
compatible with SQL
• Adjust SQL syntax/semantics for heterogeneity, complex values
• Achieved with few extensions and SQL limitation removals
– Element Variables
– Correlated Subqueries (in FROM and SELECT clauses)
– Optional utilization of Input Order
– Grouping indendent of Aggregation Functions
– ... and a few more details
What SQL aspects have been omitted
• SQL’s *
– Could be done using ranging over attributes
– It is easily replaced by ability to project in SELECT full tuples (or full
anythings)
– An optimizable implementation needs schema
• Consider the view V = SELECT * FROM R r, S s, T t WHERE …
• How do you push a condition?
– SELECT … FROM V v WHERE v.foo = sth
• Selected SQL operations (UNION) that rely on attributes having a
certain order in their tuples
• SQL’s coercion of subqueries
Operations in the absence of information
and/or absence of typing
What if in
• FROM coll v
–coll is not a bag/array (eg, is a tuple or a scalar)
• SELECT x.foo AS bar
–x has no foo
• x.foo = x.bar
– when different types or missing sides
• For good or bad many options have been taken!
SQL++ Removes Superficial Differences
• Removing the current “Tower of Babel” effect
• Providing formal syntax and semantics
SELECT AVG(temp) AS tavg
FROM readings
GROUP BY sid
SQL
db.readings.aggregate(
{$group: {_id: "$sid",
tavg: {$avg:"$temp"}}})
readings -> group by sid = $.sid
into { tavg: avg($.temp) };
Jaql
for $r in collection("readings")
group by $r.sid
return { tavg: avg($r.temp) }
JSONiq
MongoDB
a = LOAD 'readings' AS
(sid:int, temp:float);
b = GROUP a BY sid;
c = FOREACH b GENERATE
AVG(temp);
DUMP c;
Pig
Surveyed features
15 feature matrices (1-11 dimensions each) classifying:
• Data values
• Schemas
• Access and construct nested data
• Missing information
• Equality semantics
• Ordering semantics
• Aggregation
• Joins
• Set operators
• Extensibility
Methodology
For each feature:
1. A formal definition of the feature in SQL++
2. A SQL++ example
3. A feature matrix that classifies each dimension of the
feature
4. A discussion of the results, partial support and unexpected
behaviors
Example: Data values
1. SQL++ example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
location: 'Alpine',
readings: [
{
time: timestamp('2014-03-12T20:00:00'),
ozone: 0.035,
no2: 0.0050
},
{
time: timestamp('2014-03-12T22:00:00'),
ozone: 'm',
co: 0.4
} ]
}
Example: Data values
2. SQL++ BNF for values:
Example: Data values
3. Feature matrix:
Composability (top-level values) Heterogeneity Arrays
Bags
Sets
Maps
Tuples Primitives
Hive
Bag of tuples
No
Yes
No
No
Partial
Yes
Yes
Jaql
Any Value
Yes
Yes
No
No
No
Yes
Yes
Pig
Bag of tuples
Partial
No
Partial
No
Partial
Yes
Yes
CQL
Bag of tuples
No
Partial
No
Partial
Partial
No
Yes
Any Value
Yes
Yes
No
No
No
Yes
Yes
MongoDB
Bag of tuples
Yes
Yes
No
No
No
Yes
Yes
N1QL
Bag of tuples
Yes
Yes
No
No
No
Yes
Yes
SQL
Bag of tuples
No
No
No
No
No
No
Yes
AQL
Any Value
Yes
Yes
Yes
No
No
Yes
Yes
BigQuery
Bag of tuples
No
No
No
No
No
Yes
Yes
MongoJDBC
Bag of tuples
Yes
Yes
No
No
No
Yes
Yes
Any Value
Yes
Yes
Yes
Partial
Yes
Yes
Yes
JSONiq
SQL++
The SQL/JSON way to incorporating access
to JSON files and JSON columns
For those who really want their FROM clause to only have collections
of tuples
• UDF’s and constructs for turning the file into a table
– FROM … json(file, pattern) AS j … WHERE c AND c’
– assuming the goal is relational results
– pattern can escalate to a full blown SQL++ FROM
– and WHERE assuming conditions c need to be explicitly pushed into json
computations
• think CORRELATE that has both structural and value based conditions
• Piggybacking on nested features of SQL for JSON columns
– FROM … json(t.jsoncol, pattern) AS j …
From Language to Algebra as an extension
of the Relational Algebra
•
•
•
•
•
Operator SCAN for scanning FROM collections into bindings
Operators for INNER CORRELATE, OUTER CORRELATE
Operator for evaluating nested subqueries
Operator for constructing (as needed in SELECT)
Operators for picking the bindings that turn into an output
collection or a tuple
Client
How options complicate middleware
Federated SQL++ Queries
SQL++ Results
FORWARD Virtual Database
SQL++ Query Processor
SQL++ Virtual Views of Sources
SQL++
Virtual
View
SQL++
Virtual
View
SQL++
Virtual
View
SQL++
Virtual
View
SQL++
Virtual
View
SQL
Wrapper
NewSQL
Wrapper
NoSQL
Wrapper
SQL-onHadoop
Wrapper
Java Objects
Wrapper
SQL-onHadoop
Database
Java
In-Memory
Objects
Native
queries
SQL
Database
NewSQL
Database
Native
results
NoSQL
Database
Client
SQL++ Virtual and/or Materialized
Views
SQL++ Queries
SQL++ Results
FORWARD Integration Database
SQL++ Query Processor
SQL++ View
Definitions
SQL++ Integrated Views
SQL++
Integrated
View
SQL++
Integrated
View
SQL++
Added Value
View
SQL++
Virtual
View
SQL++
Virtual
View
SQL++
Virtual
View
SQL++
Virtual
View
SQL++
Virtual
View
SQL
Database
NewSQL
Database
NoSQL
Database
SQL-onHadoop
Database
Java
In-Memory
Objects
Use Case 1:
Hide the ++ source features behind SQL
views
BI Tool
/ Report Writer
SELECT
m.sid, t AS temp
FROM
SQL query
measurements AS m,
SQL result
m.temperatures AS t
FORWARD Virtual Database
SQL View
measurements:
sid
2
2
1
temp
70.1
70.2
71.0
Couchbase query
or its SQL++ equivalent
measurements: {{
{sid: 2, temp: 70.1},
{sid: 2, temp: 70.2},
{sid: 1, temp: 71.0}
}}
Couchbase results
measurements: {{
{ sid: 2, temperatures: [70.1, 70.2] },
{ sid: 1, temperatures: [71.0] }
}}
Couchbase
Use Case 2:
Semantics-Aware Pushdown
"Sensors that recorded a temperature below 50"
SELECT DISTINCT m.sid
FROM
measurements AS m
WHERE m.temp < 50
SQL++ semantics for
m.temp < 50
SQL++ query
MongoDB Virtual Database
Virtual View
(Identical)
MongoDB Wrapper
MongoDB query
...
{ $match: {temp: {$lt: 50}}},
{ $match: {temp: {$not: null}}}
...
MongoDB semantics for
temp $lt 50
SQL++ result
MongoDB results
MongoDB
measurements: {{
{ sid: 2, temp: 70.1 },
{ sid: 2, temp: 49.2 },
{ sid: 1, temp: null }
}}
Push-Down Challenges:
Limited capabilities & semantic variations
• How to efficiently push down computation?
Limited capabilities
• Sources like SQL and MongoDB cannot execute the
entirety of SQL++
Many semantic variations
• How to mediate incompatible features?
• Step 1: understand the space of options!
SQL++ captures Semantics Variations
• Semantics of "less-than" comparison are different across
sources:
<sql , <mongodb , etc.
• Config Parameters to capture these variations
@lt { MongoDB }
( x < y )
@lt {
complex
:
type_mismatch:
null_lt_null :
null_lt_value:
...
} ( x < y )
deep,
false,
false,
false,
Semantics Variations
In NoSQL, NewSQL, SQL-on-Hadoop variation of semantics for:
• Paths
• Equality
• Comparisons
And all the operators that use them:
• Selections
• Grouping
• Ordering
• Set operations
Each of these features has a set of config parameters in
Configurable SQL++
Survey around configuration options
•
•
•
•
•
Fast evolving space
Configuration options itemize and explain the space options
Rationalize discussion towards standard
And if no standard at least understand where we stand
Survey
Example: SELECT clause
1. SQL++ example:
– Projecting nested collections:
SELECT TUPLE
s.lat, s.long,
(SELECT r.ozone
FROM readings AS r
WHERE r.location = s.location)
FROM
sensors AS s
"Position and (nested)
ozone readings of
each sensor?"
– Projecting non-tuples:
SELECT ELEMENT ozone FROM readings
"Bag of all the (scalar)
ozone readings?"
Example: SELECT clause
3. Feature matrix:
Projecting tuples containing nested collections
Projecting non-tuples
Hive
Partial
No
Jaql
Yes
Yes
Pig
Partial
No
CQL
No
No
JSONiq
Yes
Yes
MongoDB
Partial
Partial
N1QL
Partial
Partial
SQL
No
No
AQL
Yes
Yes
BigQuery
No
No
MongoJDBC
No
No
SQL++
Yes
Yes
Example: SELECT clause
4. Discussion of the results:
• Not well supported features
• 3 languages support them entirely (same
cluster as for data values)
Projecting tuples containing nested collections
Projecting non-tuples
Hive
Partial
No
Jaql
Yes
Yes
Pig
Partial
No
CQL
No
No
JSONiq
Yes
Yes
MongoDB
Partial
Partial
N1QL
Partial
Partial
SQL
No
No
AQL
Yes
Yes
BigQuery
No
No
MongoJDBC
No
No
SQL++
Yes
Yes
Config Parameters
We use config parameters to encompass and compare various
semantics of a feature:
•
•
•
•
Minimal number of independent dimensions
1 dimension = 1 config parameter
SQL++ formalism parametrized by the config parameters
Feature matrix classifies the values of each config parameter
Example: Paths
• Config parameters for tuple navigation:
@tuple_nav {
absent
: missing,
type_mismatch: error,
} ( x.y )
Example: Paths
• The feature matrix classifies, for each language, the value of each
config parameter:
Missing
Hive
Error
Jaql
Null
Pig
Error
CQL
Error
JSONiq
Missing#
MongoDB
Missing
N1QL
Missing
SQL
Error
AQL
Null
BigQuery
Error
MongoJDBC Missing#
SQL++
@path
Type mismatch
Error
Error#
Error
Error
Missing#
Missing
Missing
Error
Error
Error
Missing#
@path
Example: Equality Function
• Config parameters for equality:
@eq {
complex
:
type_mismatch
:
null_eq_null
:
null_eq_value
:
missing_eq_missing:
missing_eq_value :
missing_eq_null
:
} ( x = y )
error,
false,
null,
null,
missing,
missing,
missing
Example: Equality Function
• The feature matrix classifies, for each language, the value of each
config parameter:
• No real cluster
• Some languages have multiple (incompatible) equality functions
• Some edge cases cannot happen due to other limitations (SQL has no complex values)
Complex
Hive (=, <=>)
Err, Err
Jaql
Boolean
Pig
Boolean partial
CQL
Err
JSONiq (=, deep-equal()) Err, Boolean
MongoDB
Boolean
N1QL
Boolean
SQL
N/A
AQL
Err
BigQuery
Err
MongoJDBC
Boolean
SQL++
@equal
Type
mismatch
Err, Err
Null
Err
Err
Err, False
False
False
Err
Err
Err
False
@equal
Null = Null Null = Value
Null, True
Null
Null
Err
True, True
True
Null
Null
Null
Null
True
@equal
Null, False
Null
Null
False/Null
False, False
False
Null
Null
Null
Null
False
@equal
Missing =
Missing =
Missing =
Missing
Value
Null
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Missing, True Missing, False Missing, False
True
False
False
Missing
Missing
Missing
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
False
False
@equal
@equal
@equal
Advise on consistent settings of the
configuration options
• Not every configuration option setting is sensible
• Configuration options have to subscribe to common sense
iff x = y then [x] = [y]
if x = y and y = z then x = z
– which enables rewritings
• Have to be consistent with each other
Eg, if true AND missing is missing,
then false OR missing is missing
Use Case 3: Integrated Query Processing
"Sensors in a given area
that recorded a low
temperature?"
SELECT s.lat, s.lng, m.temp
FROM
sensors AS s
JOIN
measurements AS m
SQL++ Query Processor
ON s.id = m.sid
SQL++
WHERE (s.lat > 32.6 ANDSQL++
s.lat < 32.9
AND s.lng > -117.0 AND s.lng < -117.3)
SQL Virtual Database
MongoDB Virtual Database
AND m.temp < 50
SQL++
sensors: {{
{id:1, lat:32.8, lng:-117.1},
{id:2, lat:32.7, lng:-117.2} }}
SQL Wrapper
db.measurements.aggregate(
{ $match: {temp: {$lt: 50}} },
SQL
{ $match: {temp: {$not: PostgreSQL
null}} }
)
sensors:
id lat lng
1 32.8 -117.1
2 32.7 -117.2
measurements: [
{ sid: 2, temp: 70.1 },
{ sid: 2, temp: 49.2 },
{ sid: 1, temp: null } ]
MongoDB Wrapper
MongoDB
measurements: [
{ sid: 2, temp: 70.1 },
{ sid: 2, temp: 49.2 },
{ sid: 1, temp: null } ]
MongoDB
2
Query Languages for Graphs
The Unrestricted Graph Data Model
• Nodes correspond to entities
• Edges are binary, correspond to relationships
• Edges may be directed or undirected
• Nodes and edges may carry labels
– multiple labels OK in some graph data models such as neo4j
• Nodes and edges annotated with data
– both have sets of attributes (key-value pairs)
• A schema is not required to formulate queries
Example Graph
Vertex types:
• Product (name, category, price)
• Customer (ssn, name, address)
Edge types:
• Bought (discount, quantity)
• Customer c bought 100 units of product p at discount 5%:
modeled by edge
c -- (Bought {discount=5%, quantity=100}) p
Expressing Graph Analytics
Two Different Approaches
• High-level query languages à la SQL
• Low-level programming abstractions
– Programs written in C++, Java, Scala, Groovy…
• Initially adopted by disjoint communities (recall NoSQL
debates)
• Recent trend towards unification
Some Key Ingredients for
High-Level Query Languages
• Pioneered by academic work on Conjunctive Query (CQ)
extensions for graphs (in the 90’s)
– Path expressions (PEs) for navigation
– Variables for manipulating data found during navigation
– Stitching multiple PEs into complex navigation patterns 
conjunctive regular path queries (CRPQs)
• Beyond CRPQs, needed in modern applications:
– Aggregation of data encountered during navigation
 support for bag semantics as prerequisite
– Intermediate results assigned to nodes/edges
– Control flow support for class of iterative algorithms that converge to
result in multiple steps
• (e.g. PageRank-class, recommender systems, shortest paths, etc.)
Path Expressions
Path Expressions
• Express reachability via constrained paths
• Early graph-specific extension over conjunctive queries
• Introduced initially in academic research in early 90s
– StruQL (AT&T Research, Fernandez, Halevy, Suciu)
– WebSQL (Mendelzon, Mihaila, Milo)
– Lorel
(Widom et al)
• Today supported by languages of commercial systems
– Cypher, SparQL, Gremlin, GSQL
Path Expression Syntax
Notations vary. Adopting here that of SparQL W3C
Recommendation.
path  edge label
|
_
any edge label
|
^ edge label
max
|
|
|
|
path . path
path | path
path*
path*(min,max)
|
(path)
// wildcard,
//
inverse edge
// concatenation
// alternation
// 0 or more reps
// at least min, at most
Path Expression Examples (1)
• Pairs of customer and product they bought:
Bought
• Pairs of customer and product they were involved with (bought or
reviewed)
Bought|Reviewed
• Pairs of customers who bought same product
(lists customers with themselves)
Bought.^Bought
Path Expression Examples (2)
• Pairs of customers involved with same product (likeminded)
(Bought|Reviewed).(^Bought|^Reviewed)
• Pairs of customers connected via a chain of like-minded
customer pairs
((Bought|Reviewed).(^Bought|^Reviewed))*
Path Expression Semantics
• In most academic research, the semantics is defined in terms
of sets of node pairs
• Traditionally specified in two ways:
– Declaratively
– Procedurally, based on algebraic operations over relations
• These are equivalent
Classical Declarative Semantics
• Given:
– graph G
– path expression PE
• the meaning of PE on G, denoted PE(G) is
the set of node pairs (src, tgt)
s.t. there exists a path in G from src to tgt
whose concatenated labels spell out a word in L(PE)
L(PE) = language accepted by PE when seen as
regular expression over alphabet of edge labels
Classical Procedural Semantics
PE(G) is a binary relation over nodes, defined inductively as:
• E(G) = set of s-t node pairs of E edges in G
• _(G) = set of s-t node pairs of any edges in G
• ^E(G) = set of t-s node pairs of E edges in G
• P1.P2(G) = P1(G) o P2(G)
relational
composition
• P1|P2(G) = set union (P1(G), P2(G))
• P*(G) = reflexive transitive closure of P(G)
Finite due to
saturation
Conjunctive Regular Path Queries
• Replace relational atoms appearing in CQs with path
expressions.
• Explicitly introduce variables binding to source and target
nodes of path expressions.
• Allow multiple path expression atoms in query body.
• Variables can be used to stitch multiple path expression
atoms into complex patterns.
CRPQ Examples
• Pairs of customers who have bought same product (do not
list a customer with herself):
Q1(c1,c2) :- c1 –Bought.^Bought-> c2, c1 != c2
• Customers who have bought and also reviewed a product:
Q2(c) :- c –Bought-> p, c –Reviewed-> p
CRPQ Semantics
• Naturally extended from single path expressions, following
model of CQs
• Declarative
– lifting the notion of satisfaction of a path expression atom by a
source-target node pair to the notion of satisfaction of a
conjunction of atoms by a tuple
• Procedural
– based on SPRJ manipulation of the binary relations
yielded by the individual path expression atoms
Limitation of Set Semantics
• Common graph analytics need to aggregate data
– e.g. count the number of products two customers have in common
• Set semantics does not suffice
– baked-in duplicate elimination affects the aggregation
• As in SQL, practical systems resort to bag semantics
Path Expressions Under Bag
Semantics
PE(G) is a bag of node pairs, defined inductively as:
• E(G) = set bag of s-t node pairs of E edges in G
• _(G) = set bag of s-t node pairs of any edges in G
• ^E(G) = set bag of t-s node pairs of E edges in G
• P1.P2(G) = P1(G) o P2(G)
• P1|P2(G) = set bag union (P1(G), P2(G))
relational
composition for
bags
• P*(G) = reflexive transitive closure of P(G)
Not necessarily
finite under bag
semantics!
Issues with Bag Semantics
• Performance and semantic issues due to number of distinct
paths
• Multiplicity of s-t pair in query output reflects number of
distinct paths connecting s with t
– Even in DAGs, these can be exponentially many.
Chain of diamonds example:
– In cyclic graphs, these can be infinitely many
Solutions In Practice:
Bound Traversal Length
• Upper bound the length of the traversed path
– Recall bounded Kleene construct *(min,max)
– Bounds length and hence number of distinct paths considered
– Supported by Gremlin, Cypher, SparQL, GSQL, very common in
tutorial examples and in industrial practice
Solutions In Practice:
Restrict Cycle Traversal
• Simple paths only (no repeating vertices)
– Rules out paths that go around cycles
– Recommended in Gremlin style guides, tutorials, formal semantics
paper
– Gremlin’s simplePath () predicate supports this semantics
– Problem: membership of s-t pair in result is NP-hard
• Paths must not repeat edges
– Allows cyclic paths
– Rules out paths that go around same cycle more than once
– This is the Cypher semantics
Solutions In Practice:
Mix Bag and Set Semantics
• Bag semantics for star-free fragments of PE
• Set semantics for Kleene-starred fragments of PE
• Combine them using (bag-aware) joins
• Example:
p1.p2*.p3(G)
treated as
p1(G) o (distinct (p2*(G))) o p3(G)
• This is the SparQL semantics (in W3C Recommendation)
Solutions In Practice:
Leave it to User
• User explicitly programs desired semantics
• Path is first-class citizen, can be mentioned in query
• Can simulate each of the above semantics, e.g. by checking the
path for repeated nodes/edges
• Could lead to infinite traversals for buggy programs
• Supported by Gremlin, GSQL
– also partially by Cypher (modulo restriction that only edge nonrepeating paths are visible)
One Semantics I Would Prefer
• Allow paths to go around cycles, even multiple times
• Achieve finiteness by restriction to pumping-minimal paths
– in the sense of Pumping Lemma for Finite State Automata (FSA)
– PE are regular expressions, they have an equivalent FSA
representation
– As path is traversed, FSA state changes at every step
– Rule out paths in which a vertex is repeatedly reached in the same
FSA state
• Can be programmed by user in Gremlin and GSQL (costly!)
Aggregation
Let’s See it First as CQ Extension
• Count toys bought in common per customer pair
Q(c1, c2, count (p)) :- c1 –Bought-> p, c2 –Bought-> p,
p.category = “toys”, c1 < c2
• c1, c2: composite group key - no explicit group-by clause
• Standard syntax for aggregation-extended CQs and Datalog
• Rich literature on semantics (tricky for Datalog when aggregation
and recursion interleave).
Aggregation in Modern Graph QLs
• Cypher’s RETURN clause uses similar syntax as aggregationextended CQs
• Gremlin and SparQL use an SQL-style GROUP BY clause
• GSQL uses aggregating containers called “accumulators”
Flavor of Representative Languages
Running Example in CRPQ Form
• Recall:
count toys bought in common per customer pair
Q(c1, c2, count (p)) :- c1 –Bought-> p, c2 –Bought-> p,
p.category = “toys”, c1 < c2
SparQL
• Query language for the semantic web
– graphs corresponding to RDF data are directed, labeled graphs
• W3C Standard Recommendation
Running Example in SparQL
SELECT ?c1, ?c2, count (?p)
WHERE { ?c1 bought ?p.
?c2 bought ?p.
?p category ?cat.
FILTER (?cat == “toys” && ?c1 < ?c2) }
GROUP BY ?c1, ?c2
SparQL Semantics by Example
• Coincides with CRPQ version
Q(c1, c2, count (p)) :- c1 –Bought-> p, c2 –Bought-> p,
p.category = “toys”, c1 < c2
Cypher
• The query language of the neo4j commercial native graph db
system
• Essentially StruQL with some bells and whistles
• Also supported in a variety of other systems:
– SAP HANA Graph, Agens Graph,
Redis Graph, Memgraph,
CAPS (Cypher for Apache Spark),
ingraph, Gradoop,
Ruruki, Graphflow
Running Example in Cypher
MATCH (c1:Customer) –[:Bought]-> (p:Product)
<-[:Bought]- (c2:Customer)
WHERE p.category = “Toys” AND c1.name < c2.name
RETURN c1.name AS cust1,
c2.name AS cust2,
COUNT (p) AS inCommon
c1.name, c2.name are composite group key
– no explicit group-by clause, just like CQ
Cypher Semantics by Example
• Coincides with CRPQ version
Q(c1, c2, count (p)) :- c1 –Bought-> p, c2 –Bought-> p,
c1 < c2
• Modulo non-repeating edge restriction
–no effect here since repeated-edge paths satisfying
the two PE atoms would necessarily have c1 = c2
Gremlin
• Supported by major Apache projects
– TinkerPop and JanusGraph
• Also by commercial systems including
– TitanGraph (DataStax)
– Neptune (Amazon),
– Azure (Microsoft),
– IBM Graph
Gremlin Semantics
• Based on traversers, i.e. tokens that flow through graph binding
variables along the way
• A Gremlin program adorns the graph with a set of traversers
that co-exist simultaneously
• A program is a pipeline of steps, each step works on the set of
traversers whose state corresponds to this step
• Steps can be
– map steps (work in parallel on individual traversers)
– reduce steps (aggregate set of traversers into a single traverser)
Gremlin Semantics by Example
V()
place one traverser on each vertex
Gremlin Semantics by Example
V().hasLabel(‘Customer’)
filter traversers by label
Gremlin Semantics by Example
V().hasLabel(‘Customer’).as(‘c1’)
extend each traverser t:
bind variable ‘c1’ to the vertex where t resides
Gremlin Semantics by Example
V().hasLabel(‘Customer’).as(‘c1’)
.out(‘Bought’)
Traversers flow along out-edges of type ‘Bought’.
If multiple such edges exist for a Customer vertex v,
the traverser at v splits into one copy per edge, placed
at edge destination.
Gremlin Semantics by Example
V().hasLabel(‘Customer’).as(‘c1’)
.out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’)
filter traversers at destination of ‘Bought’ edges:
vertex label must be ‘Product’ and they must have a
property named ‘category’ of value ‘Toys’
Gremlin Semantics by Example
V().hasLabel(‘Customer’).as(‘c1’)
.out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’)
extend surviving traversers with binding of variable ‘p’ to
their location vertex.
now each surviving traverser has two variable bindings.
Gremlin Semantics by Example
V().hasLabel(‘Customer’).as(‘c1’)
.out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’)
.in(‘Bought’)
Surviving traversers cross incoming edges of type
‘Bought’. Multiple in-edges result in further splits.
Gremlin Semantics by Example
V().hasLabel(‘Customer’).as(‘c1’)
.out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’)
.in(‘Bought’).hasLabel(‘Customer’).as(‘c2’)
.select (‘c1’, ‘c2’,‘p’).by(‘name’)
for each traverser extract the tuple of bindings for variables
c1,c2,p, return its projection on ‘name’ property.
Gremlin Semantics by Example
V().hasLabel(‘Customer’).as(‘c1’)
.out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’)
.in(‘Bought’).hasLabel(‘Customer’).as(‘c2’)
.select (‘c1’, ‘c2’,‘p’).by(‘name’)
.where (‘c1’, lt(‘c2’))
filter these tuples according to where condition
Gremlin Semantics by Example
V().hasLabel(‘Customer’).as(‘c1’)
.out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’)
.in(‘Bought’).hasLabel(‘Customer’).as(‘c2’)
.select (‘c1’, ‘c2’,‘p’).by(‘name’)
.where (‘c1’, lt(‘c2’))
group tuples
.group().by(select(‘c1’,’c2’)).by(count())
first by() specifies group key
second by() specifies group
aggregation
GSQL
• The query language of TigerGraph, a native parallel graph db
system
• A recent start-up founded by UCSD DB lab’s PhD alum Yu Xu
• Full disclosure: I have been involved in design
GSQL Accumulators
• GSQL traversals collect and aggregate data by writing it into
accumulators
• Accumulators are containers (data types) that
– hold a data value
– accept inputs
– aggregate inputs into the data value using a binary operation
• May be built-in (sum, max, min, etc.) or user-defined
• May be
– global (a single container accessible from all traversal steps)
– local (one per node, accessible only when reached by traversal)
Running Example in GSQL
Global accum
GroupByAccum <string cust1, string cust2,
SumAccum<int> inCommon> @@res;
cust1, cust2 form
composite group key
SELECT
FROM
WHERE
ACCUM
inCommon is group value
(a sum aggregation)
_
Customer:c1 -(Bought>)- Product:p –(<Bought)- Customer:c2
p.category == “Toys” AND c1.name < c2.name
@@res += (c1.name, c2.name -> 1);
aggregate this input into
accumulator
create input associating value 1 to key
(c1.name, c2.name)
GSQL Semantics by Example
GroupByAccum <string cust1, string cust2,
SumAccum<int> inCommon> @@res;
For every distinct path satisfying FROM
pattern and WHERE condition…
SELECT
FROM
WHERE
ACCUM
_
Customer:c1 -(Bought>)- Product:p –(<Bought)- Customer:c2
p.category == “Toys” AND c1.name < c2.name
@@res += (c1.name, c2.name -> 1);
…execute ACCUM clause
Why Aggregate in Accumulators Instead of
Select-Group By Clauses?
revenue per customer
GroupByAccum <string cust, SumAccum<float> total> @@cSales;
GroupByAccum <string prod, SumAccum<float> total> @@pSales;
revenue per product
SELECT
FROM
ACCUM
_
Customer:c -(Bought>:b)- Product:p
float thisSalesRevenue = b.quantity*(1-b.discount)*p.price,
@@cSales += (c.name -> thisSalesRevenue),
@@pSales += (p.name -> thisSalesRevenue);
local variable, this is a let clause
multiple aggregations in one pass,
even on different group keys
Local Accumulators
• Minimize bottlenecks due to shared global accums, maximize
opportunities for parallel evaluation
SumAccum<float> @cSales, @pSales;
local accums, one instance per node
SELECT _
FROM Customer:c -(Bought>:b)- Product:p
ACCUM float thisSalesRevenue = b.quantity*(1-b.discount)*p.price,
c.@cSales += thisSalesRevenue,
p.@pSales += thisSalesRevenue;
groups are distributed, each node
accumulates its own group
Role of SELECT Clause? Compositionality
• queries can output set of nodes, stored in variables
• used by subsequent queries as traversal starting point:
S1 = SELECT t
FROM S0:s – pattern1 – T1:t
WHERE … ACCUM …
S2 = SELECT t
FROM S1:s – pattern2 – T2:t …
WHERE … ACCUM …
S3 = SELECT t
FROM S1:s – pattern3 – T3:t …
WHERE … ACCUM …
Variable S1 stores set of nodes
reached in traversal
Node set variable used in
S1 used
in subsequent
traversals
subsequent
traversal
(query chaining)
2
Autonomous Database System
(1) Automatic performance tuning
●
Identify the system bottleneck
○
○
○
○
○
●
CPU
Memory structures
IO capacity issue
Network latency
Index
Tuning model
○
○
1.
2.
Traditional ML models such as Lasso
Deep Neural Network
Ji Zhang et al. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement
Learning. Sigmod ’19.
Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017. Automatic Database
Management System Tuning Through Large-scale Machine Learning. In Proceedings of the 2017 ACM
International Conference on Management of Data (SIGMOD '17). ACM, New York, NY, USA, 1009-1024.
133
Thanks!
Download