Query Languages for MultiModel Databases Qingsong Guo UDBMS Group, University of Helsinki 2020.04.09 Outline 1 Data Models 2 Query Languages for MMDBMS 1 Various Data Model 1. Structured data: SQL 2. Semi-structured data: 3. Graph data: Traits for Database Query Languages • Declarative rather than procedural • We developed – fantastic algorithms – great systems – clever query processing and storage strategies – a lot of brilliant stuff • DBMSs have become fast – when you look at TPC-C: we are currently able to execute ~half a million transactions per second – to simple index-lookups, e.g., in a hash table, we are currently at 20 million operations per second 1 Relational Model and Languages 1. Structured data: SQL 2. Semi-structured data: 3. Graph data: Relational Model • A relational database consists of relations – Each row defines a relationship between a set of values – Table(analogy with relation) – Schema of a relation Example Instances • “Sailors” and “Reserves” relations for our examples. “bid”= boats. “sid”: sailors • We’ll use positional or named field notation, assume that names of fields in query results are `inherited’ from names of fields in query input relations. R1 sid bid day 22 101 10/10/96 58 103 11/12/96 S1 s id 22 31 58 S2 s i d s n a m e r a t i n g a g e 2 8 y u p p y 9 3 5 . 0 3 1 l u b b e r 8 5 5 . 5 4 4 g u p p y 5 3 5 . 0 5 8 r u s t y 1 0 3 5 . 0 snam e d u s tin lu b b e r ru s ty ra tin g 7 8 10 age 4 5 .0 5 5 .5 3 5 .0 Relational Model • E. F. Codd, A Relational Model of Data for Large Shared Data Banks, Communications of the ACM 13(6), June 1970 – set-based query language • relational query language – Normalization • avoid redundancy – data independence • between logical and physical level Formal Relational Query Languages • Two mathematical Query Languages form the basis for “real” languages (e.g. SQL), and for implementation: – Relational Algebra: More operational(procedural), and always used as internal representation for query evaluation plans. – Relational Calculus: Lets users describe what they want, rather than how to compute it. (Non-operational, declarative.) 2 Semistructured Data Model 1. Structured data: SQL 2. Semi-structured data: 3. Graph data: Semistructured data: Past and Present Richer (than relational) type systems • Object-Oriented data model • Nested Relational data model Schemaless and Schema-Optional data • Schemaless Labelled Graphs • more on graphs in the second part • XML as labelled tree Scalable nested & semistructured formats • (schemaless) JSON • (machine-oriented, columnar) Parquet, ORC, … • Google’s buffer protocols … Semistructured data Can be thought of as SQL data model extension and restriction removal - complex types: arrays, (nested) tuples, maps - rigid schema is not necessary XML and labelled trees… - also allows modeling of complex types - Lack of distinction between tuple and list complicates typical querying use cases Nested & Semistructured Query Languages Declarative + Expressive Power Semistructured Data Be like SQL – but for nested & semistructured data The string approach to extending databases with new types of data • Treat the nested and semistructured data as strings of a particular format, e.g. JSON • Introduce UDFs to process such strings - A JSON encoded tuple and an SQL tuple are two different things even if identical => We will not further deal with this approach SQL++ • SQL++ : A Backwards-Compatible SQL , which can access a SQL extension with nested and semistructured data – Queries exhibit XQuery and OQL abilities, yet backwards compatible with SQL-92 • Surveying the multiple and different languages and their semantics options with Configurable SQL++ – Options modify the semantics of the language in order to morph into one of the languages SQL++: http://arxiv.org/abs/1405.3631 http://db.ucsd.edu/wp-content/uploads/pdfs/375.pdf SQL++ Data Model • Extend SQL with arrays + nesting + heterogeneity Following mostly JSON’s notation but applicable to any format that can be abstracted into these concepts { 'location': 'Alpine', 'readings': [ { 'time': timestamp('2014-03-12T20:00:00'), 'ozone': 0.035, 'no2': 0.0050 }, { 'time': timestamp('2014-03-12T22:00:00'), 'ozone': 'm', 'co': 0.4 } ] } • Array nested inside a tuple • Heterogeneous tuples in collections • Arbitrary compositions of array, bag, tuple SQL++ Data Model • Extend SQL with arrays + nesting + heterogeneity + permissive structures [ [5, 10, 3], [21, 2, 15, 6], [4, {lo: 3, exp: 4, hi: 7}, 2, 13, 6] ] • Arbitrary compositions of array, bag, tuple SQL++ Data Model Can also think of as extension of JSON • Use single quotes for literals • Extend JSON with bags and enriched types { 'location': 'Alpine', 'readings': {{ { 'time': timestamp('2014-03-12T20:00:00'), 'ozone': 0.035, 'no2': 0.0050 }, { 'time': timestamp('2014-03-12T22:00:00'), 'ozone': 'm', For presentation 'co': 0.4 simplicity, we’ll } }} omit the '…' } • Bags {{ … }} • Enriched types SQL++ Data Model { { location: 'Alpine', location: string, readings: {{ readings: {{ { { time: timestamp('2014-03-12T20:00:00'), time: timestamp, ozone: 0.035, ozone: any, no2: 0.0050 * }, } { }} time: timestamp('2014-03-12T22:00:00'), } ozone: 'm', co: 0.4 } }} } • With schema • Or, schemaless • Or, partial schema Comparisons • OQL without schema • Simpler than XML and the XQuery data model • Unlike labeled trees (the favorite XML abstraction of XPath and XQuery research) makes the distinction between tuple constructor and list/array/bag constructor From SQL to SQL++ • Goal: Queries exhibit XQuery and OQL abilities, yet backwards compatible with SQL-92 • SQL Backwards-compatible because –easy to teach –easy to extend functionalities of SQL systems Backwards Compatibility with SQL readings { sid: { sid: { sid: }} : {{ 2, temp: 70.1 }, 2, temp: 49.2 }, 1, temp: null } {{ { sid : 2 } SELECT DISTINCT r.sid FROM readings AS r WHERE r.temp < 50 Find sensors that recorded a temperature below 50 sid 2 2 temp 70.1 49.2 1 null }} sid 2 From SQL to SQL++: What is the minimum needed changes? • Goal: Queries that input/output any complex type, yet backwards compatible with SQL • Adjust SQL syntax/semantics for heterogeneity, complex input/output values • Achieved with few extensions and mostly by SQL limitation removals – OQL influence • Minimal SQL++ core – SQL and the full SQL++ are syntactic sugar over a small SQL++ core From Tuple Variables to Element Variables readings : [ 1.3, 0.7, 0.3, 0.8 ] SELECT FROM WHERE ORDER BY LIMIT r AS co readings AS r r < 1.0 r DESC 2 Find the highest two sensor readings that are below 1.0 [ { co: 0.8 }, { co: 0.7 } ] Semantics: Bindings to Element Variables FROM readings AS r BoutFROM = BinWHERE = {{ ⟨ r : 1.3 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }} WHERE r < 1.0 readings : [ 1.3, 0.7, 0.3, 0.8 ] BoutWHERE = BinORDERBY = {{ ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }} ORDER BY r DESC BoutORDERBY = BinLIMIT = [ ⟨ r : 0.8 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩ ] LIMIT 2 BoutLIMIT = BinSELECT SELECT r AS co = [ ⟨ r : 0.8 ⟩ ⟨ r : 0.7 ⟩ ] [ { co: 0.8 }, { co: 0.7 } ] Environment ⊢ Query → Result FROM readings AS r BoutFROM = BinWHERE = {{ ⟨ r : 1.3 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }} ℾ=⟨ scale: 2, readings : [ 1.3, 0.7, 0.3, 0.8 ] ⟩ WHERE r < 1.0 BoutWHERE = BinORDERBY = {{ ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }} ORDER BY r DESC BoutORDERBY = BinLIMIT = [ ⟨ r : 0.8 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩ ] LIMIT 2 BoutLIMIT = BinSELECT = [ ⟨ r : 0.8 ⟩ ⟨ r : 0.7 ⟩ ] SELECT ELEMENT r * scale [ 1.6, 1.4 ] SQL++ SQL++ Core SELECT FROM WHERE ORDER BY LIMIT r AS co readings AS r r < 1.0 r DESC 2 SELECT ELEMENT {'co': r} FROM readings AS r WHERE r < 1.0 ORDER BY r DESC LIMIT 2 No need for homogeneity of bindings Environment ⊢ Query → Result FROM readings AS r ℾ=⟨ readings : {{ 0.3, { low : 0.2, high: 0.4 }, null, 'C' }} ⟩ b1 = ⟨ r : 0.3⟩ b2 = ⟨ r : { low : 0.2, high: 0.4 }⟩ b3 = ⟨ r : null ⟩ b4 = ⟨ r : 'C'⟩ SELECT ELEMENT r {{ 0.3, { low : 0.2, high: 0.4 }, null, 'C' }} Correlated Subqueries Everywhere sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ] SELECT FROM r AS co sensors s WHERE r < 1.0 ORDER BY r DESC LIMIT 2 AS s, AS r Find the highest two sensor readings that are below 1.0 [ { co: 0.9 }, { co: 0.8 } ] Correlated Subqueries Everywhere FROM sensors AS s, s AS r BoutFROM = BinWHERE sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ] ⟨ s : [1.3, ⟨ s : [1.3, ⟨ s : [0.7, ⟨ s : [0.7, ⟨ s : [0.7, ⟨ s : [0.3, ⟨ s : [0.3, ⟨ s : [0.3, ⟨ s : [0.7, ⟨ s : [0.7, }} WHERE r < 1.0 ORDER BY r DESC LIMIT 2 SELECT r AS co 2] 2] 0.7, 0.7, 0.7, 0.8, 0.8, 0.8, 1.4] 1.4] = {{ , r : 1.3 ⟩, , r : 2 ⟩, 0.9] , r : 0.7 ⟩, 0.9] , r : 0.7 ⟩, 0.9] , r : 0.9 ⟩, 1.1] , r : 0.3 ⟩, 1.1] , r : 0.8 ⟩, 1.1] , r : 1.1 ⟩, , r : 0.7 ⟩, , r : 1.4 ⟩ Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0 BoutWHERE = BinORDERBY sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ] ⟨ s : [0.7, ⟨ s : [0.7, ⟨ s : [0.7, ⟨ s : [0.3, ⟨ s : [0.3, ⟨ s : [0.7, }} ORDER BY r DESC LIMIT 2 SELECT r AS co 0.7, 0.7, 0.7, 0.8, 0.8, 1.4] = {{ 0.9] , r : 0.7 ⟩, 0.9] , r : 0.7 ⟩, 0.9] , r : 0.9 ⟩, 1.1] , r : 0.3 ⟩, 1.1] , r : 0.8 ⟩, , r : 0.7 ⟩ Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0 ORDER BY r DESC BoutORDERBY = BinLIMIT sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ] ⟨ s : [0.7, ⟨ s : [0.3, ⟨ s : [0.7, ⟨ s : [0.7, ⟨ s : [0.7, ⟨ s : [0.3, 0.7, 0.8, 0.7, 0.7, 1.4] 0.8, =[ 0.9] , r : 0.9 ⟩, 1.1] , r : 0.8 ⟩, 0.9] , r : 0.7 ⟩, 0.9] , r : 0.7 ⟩, , r : 0.7 ⟩, 1.1] , r : 0.3 ⟩ ] LIMIT 2 BoutLIMIT = BinSELECT =[ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩ ] SELECT r AS co [ { co: 0.9 }, { co: 0.8 } ] Composability: A Powerful Idea from Functional Programming Languages • Variables generate environments within which queries operate • Initial environment provides named top-level values • The OQL legacy: When evaluating a subquery names and variables behave no different Constructing Nested Results Discussion of Environments {{ ℾ0 = ⟨ sensors : {{ {sensor: 1}, {sensor: 2} }} , logs: {{ {sensor: co: {sensor: co: {sensor: co: }} ⟩ 1, 0.4}, 1, 0.2}, 2, 0.3}, FROM sensors AS s BoutFROM = BinSELECT = {{ ⟨ s : {sensor: 1} ⟩, ⟨ s : {sensor: 2} ⟩ }} SELECT s.sensor AS sensor, ( SELECT l.co AS co FROM logs AS l WHERE l.sensor = s.sensor ) AS readings { sensor : 1, readings: {{ {co:0.4}, {co:0.2} }} }, { sensor : 2, readings: {{ {co:0.3} }} } }} Constructing Nested Results ℾ1 = ⟨ s : {sensor : 1}, FROM logs AS l BoutFROM = BinSELECT sensors : {{ {sensor: 1}, {sensor: 2} }} , logs: {{ {sensor: co: {sensor: co: {sensor: co: }} ⟩ 1, 0.4}, 1, 0.2}, 2, 0.3}, = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩, ⟨ l : {sensor: 2, co: 0.3} ⟩ }} WHERE l.sensor = s.sensor BoutFROM = BinSELECT = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩ }} SELECT l.co AS co {{ {co:0.4}, {co:0.2} }} How to deal with order: Create order Vs Presume order • SQL assumes bags as inputs but may output lists by using ORDER BY • not enough when input has order • XQuery adopted order preservation semantics • eg, filtering a list will produce a list where qualified elements retain their original order • often becomes a burden eg, when order is dictated for join results Extending SQL with Utilization of Input Order x y SELECT sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ] r AS co, x AS x, y AS y FROM sensors AS s AT y, s AS r AT x WHERE r < 1.0 ORDER BY r DESC LIMIT 2 Find the highest two sensor readings that are below 1.0 [ { co: 0.9, x: 3, y: 2 }, { co: 0.8, x: 2, y: 3 } ] Order Preservation by Ordering according to the Input Order x y SELECT sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ] r AS co, x AS x, y AS y FROM sensors AS s AT y, s AS r AT x WHERE r < 1.0 ORDER BY y ASC, x ASC LIMIT 2 Find the first two sensor readings that are below 1.0 [ { co: 0.7, x: 1, y: 2 }, { co: 0.7, x: 2, y: 2 } ] (What if) Automated Order Preservation SELECT FROM r AS co sensors AS s, s AS r WHERE r < 1.0 ORDER BY PRESERVATION LIMIT 2 SELECT FROM r AS co sensors AS s AT y, s AS r AT x WHERE r < 1.0 ORDER BY y ASC, x ASC LIMIT 2 Iterating over Attribute-Value pairs Constructing tuples with data-dependent number of Attribute-Value pairs ℾ ⊢ ⟨ readings : {{ { no2: 0.6, co: 0.7, co2: [0.5, 2] }, { no2: 0.5, co: [0.4], co2: 1.3 } }} ⟩ FROM readings AS r SELECT ELEMENT ( r AS {g:v} Q FROM WHERE g LIKE "co%" SELECT ATTRIBUTE g:v ) ↓ V {{ { co: 0.7, co2: [0.5, 2] }, { co: [0.4], co2: 1.3 } }} INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN FROM readings AS {g:v}, v AS n [ { gas: num: }, { gas: num: }, { gas: num: } ℾ=⟨ readings : { co : [], {{ no2: [0.7], ⟨ g : 'no2', v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] ⟨ g : 'so2', v : [0.5, 2], n : 0.5 ⟩, } ⟨ g : 'so2', v : [0.5, 2], n : 2 ⟩ }} ⟩ SELECT ELEMENT {'gas': g, 'num': n } ] 'no2', 0.7 'so2, 0.5, 'so2', 2 INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN FROM readings AS {g:v} INNER CORRELATE v AS n [ { gas: num: }, { gas: num: }, { gas: num: } ℾ=⟨ readings : { {{ co : [], ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, no2: [0.7], ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, so2: [0.5, 2] ⟨ g : "so2", v : [0.5, 2], n : 2 ⟩ } }} ⟩ SELECT ELEMENT {'gas': g, 'num': n } ] 'no2', 0.7 'so2', 0.5, 'so2', 2 INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN FROM readings AS {g:v} OUTER CORRELATE v AS n [ { gas: num: }, { gas: num: }, { gas: num: }, { gas: num: } ℾ=⟨ readings : { co : [], {{ ⟨ g : "co", v : [], n : missing ⟩, no2: [0.7], ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ SELECT ELEMENT { gas: g, num: n } ] 'co', null 'no2', 0.7 'so2', 0.5, 'so2', 2 An OQLish GROUP BY, at a deeper level than SQL’s Environment ⊢ Query 1 (Core SQL++) FROM ℾ=⟨ co_readings : {{ { sensor: 1, co : 0.3 }, { sensor: 1, co : 0.5 }, { sensor: 2, co : 0.6 } }} ⟩ co_readings AS c → Query 2 (with SQL++ syntactic sugar) FROM Result co_readings AS c b1 = ⟨ c : {sensor: 1, co: 0.3} ⟩ b2 = ⟨ c : {sensor: 1, co: 0.5} ⟩ b3 = ⟨ c : {sensor: 2, co: 0.6} ⟩ {{ { GROUP BY c.sensor AS s; GROUP AS grp GROUP BY c.sensor b'1 = ⟨ s : 1, grp : {{ { c : {sensor: 1, co: 0.3} }, { c : {sensor: 1, co: 0.5} } }} ⟩ b'2 = ⟨ s : 2, grp : {{ { c : {sensor: 2, co: 0.6} } }} ⟩ SELECT ELEMENT { sensor:s, readings: ( SELECT ELEMENT g.c.co FROM grp AS g), number: count(grp), average: avg( SELECT ELEMENT g.c.co FROM grp AS g) } SELECT c.sensor AS sensor, ( SELECT ELEMENT g.c.co FROM group AS g) AS readings, count(*) AS number, avg(c.co) AS average sensor : readings: number : average : }, 1, {{ 0.3, 0.5 }} 2, 0.4 { sensor : readings: number : average : 2, {{ 0.6 }}, 1, 0.6 } }} Alike OQL and unlike SQL’s semantics, the SELECT clause semantics make sense just by knowing the incoming variables (and environment) ℾ ⊢ ⟨ sensors : {{ { sensor: "s1", reading: {co: 0.7, no2: 2 } } { sensor: "s2", reading: {co: 0.5, so2: 1.3} } }} ⟩ FROM sensors AS s INNER CORRELATE s.reading AS {g:v} GROUP BY g AS x Q ↓ V SELECT ELEMENT { gas: x, reading: ( FROM group AS p SELECT ATTRIBUTE p.s.sensor : p.v )} {{ ⟨ s : { ⟨s:{ ⟨s:{ ⟨s:{ sensor:"s1",reading:{co:0.7,no2:2} }, g : "co", v : 0.7 ⟩, sensor:"s1",reading:{co:0.7,no2:2} }, g : "no2", v : 2 ⟩, sensor:"s2",reading:{co:0.5,so2:1.3} }, g : "co", v : 0.5 ⟩, sensor:"s2",reading:{co:0.5,so2:1.3} }, g : "so2", v : 1.3 ⟩ }} {{ ⟨ x : "co", group : {{ {s:{ sensor:"s1",reading:{co:0.7,no2:2} }, g:"co",v:0.7}, {s:{ sensor:"s2",reading:{co:0.5,so2:1.3} },g:"co",v:0.5} }} ⟩, ⟨ x : "no2", group : {{ {s:{ sensor:"s1",reading:{co:0.7,no2:2} }, g:"no2",v:2} }} ⟩, ⟨ x : "so2", group : {{ {s:{ sensor:"s2",reading:{co:0.5,so2:1.3} },g:"so2",v:1.3} }} ⟩ }} {{ { gas: "co", reading: { s1: 0.7, s2: 0.5 }, { gas: "no2", reading: { s1: 2 }, { gas: "so2", reading: { s2: 1.3 } }} Summing Up: The SQL++ Changes to SQL • Goal: Queries that input/output any JSON type, yet backwards compatible with SQL • Adjust SQL syntax/semantics for heterogeneity, complex values • Achieved with few extensions and SQL limitation removals – Element Variables – Correlated Subqueries (in FROM and SELECT clauses) – Optional utilization of Input Order – Grouping indendent of Aggregation Functions – ... and a few more details What SQL aspects have been omitted • SQL’s * – Could be done using ranging over attributes – It is easily replaced by ability to project in SELECT full tuples (or full anythings) – An optimizable implementation needs schema • Consider the view V = SELECT * FROM R r, S s, T t WHERE … • How do you push a condition? – SELECT … FROM V v WHERE v.foo = sth • Selected SQL operations (UNION) that rely on attributes having a certain order in their tuples • SQL’s coercion of subqueries Operations in the absence of information and/or absence of typing What if in • FROM coll v –coll is not a bag/array (eg, is a tuple or a scalar) • SELECT x.foo AS bar –x has no foo • x.foo = x.bar – when different types or missing sides • For good or bad many options have been taken! SQL++ Removes Superficial Differences • Removing the current “Tower of Babel” effect • Providing formal syntax and semantics SELECT AVG(temp) AS tavg FROM readings GROUP BY sid SQL db.readings.aggregate( {$group: {_id: "$sid", tavg: {$avg:"$temp"}}}) readings -> group by sid = $.sid into { tavg: avg($.temp) }; Jaql for $r in collection("readings") group by $r.sid return { tavg: avg($r.temp) } JSONiq MongoDB a = LOAD 'readings' AS (sid:int, temp:float); b = GROUP a BY sid; c = FOREACH b GENERATE AVG(temp); DUMP c; Pig Surveyed features 15 feature matrices (1-11 dimensions each) classifying: • Data values • Schemas • Access and construct nested data • Missing information • Equality semantics • Ordering semantics • Aggregation • Joins • Set operators • Extensibility Methodology For each feature: 1. A formal definition of the feature in SQL++ 2. A SQL++ example 3. A feature matrix that classifies each dimension of the feature 4. A discussion of the results, partial support and unexpected behaviors Example: Data values 1. SQL++ example: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 { location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] } Example: Data values 2. SQL++ BNF for values: Example: Data values 3. Feature matrix: Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives Hive Bag of tuples No Yes No No Partial Yes Yes Jaql Any Value Yes Yes No No No Yes Yes Pig Bag of tuples Partial No Partial No Partial Yes Yes CQL Bag of tuples No Partial No Partial Partial No Yes Any Value Yes Yes No No No Yes Yes MongoDB Bag of tuples Yes Yes No No No Yes Yes N1QL Bag of tuples Yes Yes No No No Yes Yes SQL Bag of tuples No No No No No No Yes AQL Any Value Yes Yes Yes No No Yes Yes BigQuery Bag of tuples No No No No No Yes Yes MongoJDBC Bag of tuples Yes Yes No No No Yes Yes Any Value Yes Yes Yes Partial Yes Yes Yes JSONiq SQL++ The SQL/JSON way to incorporating access to JSON files and JSON columns For those who really want their FROM clause to only have collections of tuples • UDF’s and constructs for turning the file into a table – FROM … json(file, pattern) AS j … WHERE c AND c’ – assuming the goal is relational results – pattern can escalate to a full blown SQL++ FROM – and WHERE assuming conditions c need to be explicitly pushed into json computations • think CORRELATE that has both structural and value based conditions • Piggybacking on nested features of SQL for JSON columns – FROM … json(t.jsoncol, pattern) AS j … From Language to Algebra as an extension of the Relational Algebra • • • • • Operator SCAN for scanning FROM collections into bindings Operators for INNER CORRELATE, OUTER CORRELATE Operator for evaluating nested subqueries Operator for constructing (as needed in SELECT) Operators for picking the bindings that turn into an output collection or a tuple Client How options complicate middleware Federated SQL++ Queries SQL++ Results FORWARD Virtual Database SQL++ Query Processor SQL++ Virtual Views of Sources SQL++ Virtual View SQL++ Virtual View SQL++ Virtual View SQL++ Virtual View SQL++ Virtual View SQL Wrapper NewSQL Wrapper NoSQL Wrapper SQL-onHadoop Wrapper Java Objects Wrapper SQL-onHadoop Database Java In-Memory Objects Native queries SQL Database NewSQL Database Native results NoSQL Database Client SQL++ Virtual and/or Materialized Views SQL++ Queries SQL++ Results FORWARD Integration Database SQL++ Query Processor SQL++ View Definitions SQL++ Integrated Views SQL++ Integrated View SQL++ Integrated View SQL++ Added Value View SQL++ Virtual View SQL++ Virtual View SQL++ Virtual View SQL++ Virtual View SQL++ Virtual View SQL Database NewSQL Database NoSQL Database SQL-onHadoop Database Java In-Memory Objects Use Case 1: Hide the ++ source features behind SQL views BI Tool / Report Writer SELECT m.sid, t AS temp FROM SQL query measurements AS m, SQL result m.temperatures AS t FORWARD Virtual Database SQL View measurements: sid 2 2 1 temp 70.1 70.2 71.0 Couchbase query or its SQL++ equivalent measurements: {{ {sid: 2, temp: 70.1}, {sid: 2, temp: 70.2}, {sid: 1, temp: 71.0} }} Couchbase results measurements: {{ { sid: 2, temperatures: [70.1, 70.2] }, { sid: 1, temperatures: [71.0] } }} Couchbase Use Case 2: Semantics-Aware Pushdown "Sensors that recorded a temperature below 50" SELECT DISTINCT m.sid FROM measurements AS m WHERE m.temp < 50 SQL++ semantics for m.temp < 50 SQL++ query MongoDB Virtual Database Virtual View (Identical) MongoDB Wrapper MongoDB query ... { $match: {temp: {$lt: 50}}}, { $match: {temp: {$not: null}}} ... MongoDB semantics for temp $lt 50 SQL++ result MongoDB results MongoDB measurements: {{ { sid: 2, temp: 70.1 }, { sid: 2, temp: 49.2 }, { sid: 1, temp: null } }} Push-Down Challenges: Limited capabilities & semantic variations • How to efficiently push down computation? Limited capabilities • Sources like SQL and MongoDB cannot execute the entirety of SQL++ Many semantic variations • How to mediate incompatible features? • Step 1: understand the space of options! SQL++ captures Semantics Variations • Semantics of "less-than" comparison are different across sources: <sql , <mongodb , etc. • Config Parameters to capture these variations @lt { MongoDB } ( x < y ) @lt { complex : type_mismatch: null_lt_null : null_lt_value: ... } ( x < y ) deep, false, false, false, Semantics Variations In NoSQL, NewSQL, SQL-on-Hadoop variation of semantics for: • Paths • Equality • Comparisons And all the operators that use them: • Selections • Grouping • Ordering • Set operations Each of these features has a set of config parameters in Configurable SQL++ Survey around configuration options • • • • • Fast evolving space Configuration options itemize and explain the space options Rationalize discussion towards standard And if no standard at least understand where we stand Survey Example: SELECT clause 1. SQL++ example: – Projecting nested collections: SELECT TUPLE s.lat, s.long, (SELECT r.ozone FROM readings AS r WHERE r.location = s.location) FROM sensors AS s "Position and (nested) ozone readings of each sensor?" – Projecting non-tuples: SELECT ELEMENT ozone FROM readings "Bag of all the (scalar) ozone readings?" Example: SELECT clause 3. Feature matrix: Projecting tuples containing nested collections Projecting non-tuples Hive Partial No Jaql Yes Yes Pig Partial No CQL No No JSONiq Yes Yes MongoDB Partial Partial N1QL Partial Partial SQL No No AQL Yes Yes BigQuery No No MongoJDBC No No SQL++ Yes Yes Example: SELECT clause 4. Discussion of the results: • Not well supported features • 3 languages support them entirely (same cluster as for data values) Projecting tuples containing nested collections Projecting non-tuples Hive Partial No Jaql Yes Yes Pig Partial No CQL No No JSONiq Yes Yes MongoDB Partial Partial N1QL Partial Partial SQL No No AQL Yes Yes BigQuery No No MongoJDBC No No SQL++ Yes Yes Config Parameters We use config parameters to encompass and compare various semantics of a feature: • • • • Minimal number of independent dimensions 1 dimension = 1 config parameter SQL++ formalism parametrized by the config parameters Feature matrix classifies the values of each config parameter Example: Paths • Config parameters for tuple navigation: @tuple_nav { absent : missing, type_mismatch: error, } ( x.y ) Example: Paths • The feature matrix classifies, for each language, the value of each config parameter: Missing Hive Error Jaql Null Pig Error CQL Error JSONiq Missing# MongoDB Missing N1QL Missing SQL Error AQL Null BigQuery Error MongoJDBC Missing# SQL++ @path Type mismatch Error Error# Error Error Missing# Missing Missing Error Error Error Missing# @path Example: Equality Function • Config parameters for equality: @eq { complex : type_mismatch : null_eq_null : null_eq_value : missing_eq_missing: missing_eq_value : missing_eq_null : } ( x = y ) error, false, null, null, missing, missing, missing Example: Equality Function • The feature matrix classifies, for each language, the value of each config parameter: • No real cluster • Some languages have multiple (incompatible) equality functions • Some edge cases cannot happen due to other limitations (SQL has no complex values) Complex Hive (=, <=>) Err, Err Jaql Boolean Pig Boolean partial CQL Err JSONiq (=, deep-equal()) Err, Boolean MongoDB Boolean N1QL Boolean SQL N/A AQL Err BigQuery Err MongoJDBC Boolean SQL++ @equal Type mismatch Err, Err Null Err Err Err, False False False Err Err Err False @equal Null = Null Null = Value Null, True Null Null Err True, True True Null Null Null Null True @equal Null, False Null Null False/Null False, False False Null Null Null Null False @equal Missing = Missing = Missing = Missing Value Null N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A Missing, True Missing, False Missing, False True False False Missing Missing Missing N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A False False @equal @equal @equal Advise on consistent settings of the configuration options • Not every configuration option setting is sensible • Configuration options have to subscribe to common sense iff x = y then [x] = [y] if x = y and y = z then x = z – which enables rewritings • Have to be consistent with each other Eg, if true AND missing is missing, then false OR missing is missing Use Case 3: Integrated Query Processing "Sensors in a given area that recorded a low temperature?" SELECT s.lat, s.lng, m.temp FROM sensors AS s JOIN measurements AS m SQL++ Query Processor ON s.id = m.sid SQL++ WHERE (s.lat > 32.6 ANDSQL++ s.lat < 32.9 AND s.lng > -117.0 AND s.lng < -117.3) SQL Virtual Database MongoDB Virtual Database AND m.temp < 50 SQL++ sensors: {{ {id:1, lat:32.8, lng:-117.1}, {id:2, lat:32.7, lng:-117.2} }} SQL Wrapper db.measurements.aggregate( { $match: {temp: {$lt: 50}} }, SQL { $match: {temp: {$not: PostgreSQL null}} } ) sensors: id lat lng 1 32.8 -117.1 2 32.7 -117.2 measurements: [ { sid: 2, temp: 70.1 }, { sid: 2, temp: 49.2 }, { sid: 1, temp: null } ] MongoDB Wrapper MongoDB measurements: [ { sid: 2, temp: 70.1 }, { sid: 2, temp: 49.2 }, { sid: 1, temp: null } ] MongoDB 2 Query Languages for Graphs The Unrestricted Graph Data Model • Nodes correspond to entities • Edges are binary, correspond to relationships • Edges may be directed or undirected • Nodes and edges may carry labels – multiple labels OK in some graph data models such as neo4j • Nodes and edges annotated with data – both have sets of attributes (key-value pairs) • A schema is not required to formulate queries Example Graph Vertex types: • Product (name, category, price) • Customer (ssn, name, address) Edge types: • Bought (discount, quantity) • Customer c bought 100 units of product p at discount 5%: modeled by edge c -- (Bought {discount=5%, quantity=100}) p Expressing Graph Analytics Two Different Approaches • High-level query languages à la SQL • Low-level programming abstractions – Programs written in C++, Java, Scala, Groovy… • Initially adopted by disjoint communities (recall NoSQL debates) • Recent trend towards unification Some Key Ingredients for High-Level Query Languages • Pioneered by academic work on Conjunctive Query (CQ) extensions for graphs (in the 90’s) – Path expressions (PEs) for navigation – Variables for manipulating data found during navigation – Stitching multiple PEs into complex navigation patterns conjunctive regular path queries (CRPQs) • Beyond CRPQs, needed in modern applications: – Aggregation of data encountered during navigation support for bag semantics as prerequisite – Intermediate results assigned to nodes/edges – Control flow support for class of iterative algorithms that converge to result in multiple steps • (e.g. PageRank-class, recommender systems, shortest paths, etc.) Path Expressions Path Expressions • Express reachability via constrained paths • Early graph-specific extension over conjunctive queries • Introduced initially in academic research in early 90s – StruQL (AT&T Research, Fernandez, Halevy, Suciu) – WebSQL (Mendelzon, Mihaila, Milo) – Lorel (Widom et al) • Today supported by languages of commercial systems – Cypher, SparQL, Gremlin, GSQL Path Expression Syntax Notations vary. Adopting here that of SparQL W3C Recommendation. path edge label | _ any edge label | ^ edge label max | | | | path . path path | path path* path*(min,max) | (path) // wildcard, // inverse edge // concatenation // alternation // 0 or more reps // at least min, at most Path Expression Examples (1) • Pairs of customer and product they bought: Bought • Pairs of customer and product they were involved with (bought or reviewed) Bought|Reviewed • Pairs of customers who bought same product (lists customers with themselves) Bought.^Bought Path Expression Examples (2) • Pairs of customers involved with same product (likeminded) (Bought|Reviewed).(^Bought|^Reviewed) • Pairs of customers connected via a chain of like-minded customer pairs ((Bought|Reviewed).(^Bought|^Reviewed))* Path Expression Semantics • In most academic research, the semantics is defined in terms of sets of node pairs • Traditionally specified in two ways: – Declaratively – Procedurally, based on algebraic operations over relations • These are equivalent Classical Declarative Semantics • Given: – graph G – path expression PE • the meaning of PE on G, denoted PE(G) is the set of node pairs (src, tgt) s.t. there exists a path in G from src to tgt whose concatenated labels spell out a word in L(PE) L(PE) = language accepted by PE when seen as regular expression over alphabet of edge labels Classical Procedural Semantics PE(G) is a binary relation over nodes, defined inductively as: • E(G) = set of s-t node pairs of E edges in G • _(G) = set of s-t node pairs of any edges in G • ^E(G) = set of t-s node pairs of E edges in G • P1.P2(G) = P1(G) o P2(G) relational composition • P1|P2(G) = set union (P1(G), P2(G)) • P*(G) = reflexive transitive closure of P(G) Finite due to saturation Conjunctive Regular Path Queries • Replace relational atoms appearing in CQs with path expressions. • Explicitly introduce variables binding to source and target nodes of path expressions. • Allow multiple path expression atoms in query body. • Variables can be used to stitch multiple path expression atoms into complex patterns. CRPQ Examples • Pairs of customers who have bought same product (do not list a customer with herself): Q1(c1,c2) :- c1 –Bought.^Bought-> c2, c1 != c2 • Customers who have bought and also reviewed a product: Q2(c) :- c –Bought-> p, c –Reviewed-> p CRPQ Semantics • Naturally extended from single path expressions, following model of CQs • Declarative – lifting the notion of satisfaction of a path expression atom by a source-target node pair to the notion of satisfaction of a conjunction of atoms by a tuple • Procedural – based on SPRJ manipulation of the binary relations yielded by the individual path expression atoms Limitation of Set Semantics • Common graph analytics need to aggregate data – e.g. count the number of products two customers have in common • Set semantics does not suffice – baked-in duplicate elimination affects the aggregation • As in SQL, practical systems resort to bag semantics Path Expressions Under Bag Semantics PE(G) is a bag of node pairs, defined inductively as: • E(G) = set bag of s-t node pairs of E edges in G • _(G) = set bag of s-t node pairs of any edges in G • ^E(G) = set bag of t-s node pairs of E edges in G • P1.P2(G) = P1(G) o P2(G) • P1|P2(G) = set bag union (P1(G), P2(G)) relational composition for bags • P*(G) = reflexive transitive closure of P(G) Not necessarily finite under bag semantics! Issues with Bag Semantics • Performance and semantic issues due to number of distinct paths • Multiplicity of s-t pair in query output reflects number of distinct paths connecting s with t – Even in DAGs, these can be exponentially many. Chain of diamonds example: – In cyclic graphs, these can be infinitely many Solutions In Practice: Bound Traversal Length • Upper bound the length of the traversed path – Recall bounded Kleene construct *(min,max) – Bounds length and hence number of distinct paths considered – Supported by Gremlin, Cypher, SparQL, GSQL, very common in tutorial examples and in industrial practice Solutions In Practice: Restrict Cycle Traversal • Simple paths only (no repeating vertices) – Rules out paths that go around cycles – Recommended in Gremlin style guides, tutorials, formal semantics paper – Gremlin’s simplePath () predicate supports this semantics – Problem: membership of s-t pair in result is NP-hard • Paths must not repeat edges – Allows cyclic paths – Rules out paths that go around same cycle more than once – This is the Cypher semantics Solutions In Practice: Mix Bag and Set Semantics • Bag semantics for star-free fragments of PE • Set semantics for Kleene-starred fragments of PE • Combine them using (bag-aware) joins • Example: p1.p2*.p3(G) treated as p1(G) o (distinct (p2*(G))) o p3(G) • This is the SparQL semantics (in W3C Recommendation) Solutions In Practice: Leave it to User • User explicitly programs desired semantics • Path is first-class citizen, can be mentioned in query • Can simulate each of the above semantics, e.g. by checking the path for repeated nodes/edges • Could lead to infinite traversals for buggy programs • Supported by Gremlin, GSQL – also partially by Cypher (modulo restriction that only edge nonrepeating paths are visible) One Semantics I Would Prefer • Allow paths to go around cycles, even multiple times • Achieve finiteness by restriction to pumping-minimal paths – in the sense of Pumping Lemma for Finite State Automata (FSA) – PE are regular expressions, they have an equivalent FSA representation – As path is traversed, FSA state changes at every step – Rule out paths in which a vertex is repeatedly reached in the same FSA state • Can be programmed by user in Gremlin and GSQL (costly!) Aggregation Let’s See it First as CQ Extension • Count toys bought in common per customer pair Q(c1, c2, count (p)) :- c1 –Bought-> p, c2 –Bought-> p, p.category = “toys”, c1 < c2 • c1, c2: composite group key - no explicit group-by clause • Standard syntax for aggregation-extended CQs and Datalog • Rich literature on semantics (tricky for Datalog when aggregation and recursion interleave). Aggregation in Modern Graph QLs • Cypher’s RETURN clause uses similar syntax as aggregationextended CQs • Gremlin and SparQL use an SQL-style GROUP BY clause • GSQL uses aggregating containers called “accumulators” Flavor of Representative Languages Running Example in CRPQ Form • Recall: count toys bought in common per customer pair Q(c1, c2, count (p)) :- c1 –Bought-> p, c2 –Bought-> p, p.category = “toys”, c1 < c2 SparQL • Query language for the semantic web – graphs corresponding to RDF data are directed, labeled graphs • W3C Standard Recommendation Running Example in SparQL SELECT ?c1, ?c2, count (?p) WHERE { ?c1 bought ?p. ?c2 bought ?p. ?p category ?cat. FILTER (?cat == “toys” && ?c1 < ?c2) } GROUP BY ?c1, ?c2 SparQL Semantics by Example • Coincides with CRPQ version Q(c1, c2, count (p)) :- c1 –Bought-> p, c2 –Bought-> p, p.category = “toys”, c1 < c2 Cypher • The query language of the neo4j commercial native graph db system • Essentially StruQL with some bells and whistles • Also supported in a variety of other systems: – SAP HANA Graph, Agens Graph, Redis Graph, Memgraph, CAPS (Cypher for Apache Spark), ingraph, Gradoop, Ruruki, Graphflow Running Example in Cypher MATCH (c1:Customer) –[:Bought]-> (p:Product) <-[:Bought]- (c2:Customer) WHERE p.category = “Toys” AND c1.name < c2.name RETURN c1.name AS cust1, c2.name AS cust2, COUNT (p) AS inCommon c1.name, c2.name are composite group key – no explicit group-by clause, just like CQ Cypher Semantics by Example • Coincides with CRPQ version Q(c1, c2, count (p)) :- c1 –Bought-> p, c2 –Bought-> p, c1 < c2 • Modulo non-repeating edge restriction –no effect here since repeated-edge paths satisfying the two PE atoms would necessarily have c1 = c2 Gremlin • Supported by major Apache projects – TinkerPop and JanusGraph • Also by commercial systems including – TitanGraph (DataStax) – Neptune (Amazon), – Azure (Microsoft), – IBM Graph Gremlin Semantics • Based on traversers, i.e. tokens that flow through graph binding variables along the way • A Gremlin program adorns the graph with a set of traversers that co-exist simultaneously • A program is a pipeline of steps, each step works on the set of traversers whose state corresponds to this step • Steps can be – map steps (work in parallel on individual traversers) – reduce steps (aggregate set of traversers into a single traverser) Gremlin Semantics by Example V() place one traverser on each vertex Gremlin Semantics by Example V().hasLabel(‘Customer’) filter traversers by label Gremlin Semantics by Example V().hasLabel(‘Customer’).as(‘c1’) extend each traverser t: bind variable ‘c1’ to the vertex where t resides Gremlin Semantics by Example V().hasLabel(‘Customer’).as(‘c1’) .out(‘Bought’) Traversers flow along out-edges of type ‘Bought’. If multiple such edges exist for a Customer vertex v, the traverser at v splits into one copy per edge, placed at edge destination. Gremlin Semantics by Example V().hasLabel(‘Customer’).as(‘c1’) .out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’) filter traversers at destination of ‘Bought’ edges: vertex label must be ‘Product’ and they must have a property named ‘category’ of value ‘Toys’ Gremlin Semantics by Example V().hasLabel(‘Customer’).as(‘c1’) .out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’) extend surviving traversers with binding of variable ‘p’ to their location vertex. now each surviving traverser has two variable bindings. Gremlin Semantics by Example V().hasLabel(‘Customer’).as(‘c1’) .out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’) .in(‘Bought’) Surviving traversers cross incoming edges of type ‘Bought’. Multiple in-edges result in further splits. Gremlin Semantics by Example V().hasLabel(‘Customer’).as(‘c1’) .out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’) .in(‘Bought’).hasLabel(‘Customer’).as(‘c2’) .select (‘c1’, ‘c2’,‘p’).by(‘name’) for each traverser extract the tuple of bindings for variables c1,c2,p, return its projection on ‘name’ property. Gremlin Semantics by Example V().hasLabel(‘Customer’).as(‘c1’) .out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’) .in(‘Bought’).hasLabel(‘Customer’).as(‘c2’) .select (‘c1’, ‘c2’,‘p’).by(‘name’) .where (‘c1’, lt(‘c2’)) filter these tuples according to where condition Gremlin Semantics by Example V().hasLabel(‘Customer’).as(‘c1’) .out(‘Bought’).hasLabel(‘Product’).has(‘category’,’Toys’).as(‘p’) .in(‘Bought’).hasLabel(‘Customer’).as(‘c2’) .select (‘c1’, ‘c2’,‘p’).by(‘name’) .where (‘c1’, lt(‘c2’)) group tuples .group().by(select(‘c1’,’c2’)).by(count()) first by() specifies group key second by() specifies group aggregation GSQL • The query language of TigerGraph, a native parallel graph db system • A recent start-up founded by UCSD DB lab’s PhD alum Yu Xu • Full disclosure: I have been involved in design GSQL Accumulators • GSQL traversals collect and aggregate data by writing it into accumulators • Accumulators are containers (data types) that – hold a data value – accept inputs – aggregate inputs into the data value using a binary operation • May be built-in (sum, max, min, etc.) or user-defined • May be – global (a single container accessible from all traversal steps) – local (one per node, accessible only when reached by traversal) Running Example in GSQL Global accum GroupByAccum <string cust1, string cust2, SumAccum<int> inCommon> @@res; cust1, cust2 form composite group key SELECT FROM WHERE ACCUM inCommon is group value (a sum aggregation) _ Customer:c1 -(Bought>)- Product:p –(<Bought)- Customer:c2 p.category == “Toys” AND c1.name < c2.name @@res += (c1.name, c2.name -> 1); aggregate this input into accumulator create input associating value 1 to key (c1.name, c2.name) GSQL Semantics by Example GroupByAccum <string cust1, string cust2, SumAccum<int> inCommon> @@res; For every distinct path satisfying FROM pattern and WHERE condition… SELECT FROM WHERE ACCUM _ Customer:c1 -(Bought>)- Product:p –(<Bought)- Customer:c2 p.category == “Toys” AND c1.name < c2.name @@res += (c1.name, c2.name -> 1); …execute ACCUM clause Why Aggregate in Accumulators Instead of Select-Group By Clauses? revenue per customer GroupByAccum <string cust, SumAccum<float> total> @@cSales; GroupByAccum <string prod, SumAccum<float> total> @@pSales; revenue per product SELECT FROM ACCUM _ Customer:c -(Bought>:b)- Product:p float thisSalesRevenue = b.quantity*(1-b.discount)*p.price, @@cSales += (c.name -> thisSalesRevenue), @@pSales += (p.name -> thisSalesRevenue); local variable, this is a let clause multiple aggregations in one pass, even on different group keys Local Accumulators • Minimize bottlenecks due to shared global accums, maximize opportunities for parallel evaluation SumAccum<float> @cSales, @pSales; local accums, one instance per node SELECT _ FROM Customer:c -(Bought>:b)- Product:p ACCUM float thisSalesRevenue = b.quantity*(1-b.discount)*p.price, c.@cSales += thisSalesRevenue, p.@pSales += thisSalesRevenue; groups are distributed, each node accumulates its own group Role of SELECT Clause? Compositionality • queries can output set of nodes, stored in variables • used by subsequent queries as traversal starting point: S1 = SELECT t FROM S0:s – pattern1 – T1:t WHERE … ACCUM … S2 = SELECT t FROM S1:s – pattern2 – T2:t … WHERE … ACCUM … S3 = SELECT t FROM S1:s – pattern3 – T3:t … WHERE … ACCUM … Variable S1 stores set of nodes reached in traversal Node set variable used in S1 used in subsequent traversals subsequent traversal (query chaining) 2 Autonomous Database System (1) Automatic performance tuning ● Identify the system bottleneck ○ ○ ○ ○ ○ ● CPU Memory structures IO capacity issue Network latency Index Tuning model ○ ○ 1. 2. Traditional ML models such as Lasso Deep Neural Network Ji Zhang et al. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. Sigmod ’19. Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017. Automatic Database Management System Tuning Through Large-scale Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). ACM, New York, NY, USA, 1009-1024. 133 Thanks!