Database Systems 236363 NoSQL Source: http://geekandpoke.typepad.com/ “Definition” • Literally: NoSQL = Not Only SQL • Practically: Anything that deviates from standard Relational Database Management Systems (RDBMS) Reminder: What is an RDBMS? • Relational data model – Structured – Data represented through a collection of tables/relations • Relational query language + relational data manipulation language + relational data definition language – As a standard, SQL • Strongly consistent concurrency control – The notion of transactions – The ACID model – Taught in the 234322 course This is in fact where most of the hype is NoSQL = any deviation from RDBMS on any of these axes What’s Driving this Trend? 1. The relational data model is not perfectly suited for all applications – Semi-structured models (and corresponding query languages) are a better fit for some cases • We’ve already talked about XML • Other types of document data models – In other cases, a graph oriented data model (and query languages) are a better fit • For example, for exploring the social graphs of social networks – In others, a Column-Family or even a schema-less KeyValue semantics is enough • In the latter, both the key and value can be arbitrary objects What’s Driving this Trend? 2. Performance, distribution, and Web scale – Traditional databases cannot keep up with the performance required by very large scale analytics applications (OLAP) – Internet companies need to handle massive amounts of data and the data needs to be available all the time with fast response time • This leads to building large distributed data-centers • Some choose to prefer weak consistency – BASE rather than ACID – details in more advanced courses Warning: This is a hyped new domain meaning that there is great confusion over what is what with no precise agreed upon definitions Key-Value • Interface consists of – put(key,value) and get(key)– sometimes also scan() • Typically, the value can be of arbitrary type • The key can either be a string or an arbitrary type • The responsibility is delegated entirely to the application – Both for semantics enforcement and logic execution • These systems are performance and availability driven – Most such systems provide BASE instead of ACID – Details discussion in 234322 (File Systems), 236351 (Distributed Systems), 236620 (Big Data) – here, only a brief discussion On the Notion of Transactions • A transaction is a collection of operations that form a single logical unit – Reserve a seat on the plane • Verify that the seat is vacant with the price I was quoted and then reserve it, which also prevents others from reserving it – This might also involve charging a credit card – Order something from an online store • Verify that the item exists in stock – Operations in a transactional file systems • Moving a file from one directory to another – Involves verifying that the exists, creating a copy in the new directory, and deleting the old one – Usually, each SQL query forms a single transaction In Practice Things Get Complicated • In practice, computers are often parallel – Further, many systems these days are distributed, adding another element of parallelization • What happens when two people try to reserve the same seat on an airplane concurrently? – The common answer: only one should succeed • What happens if a transaction fails in the middle? – The seat was already taken – should we charge the credit card? – The credit company refused the payment – should we hold the seat? – The receiving file system directory is full – should we remove its old copy from the old directory? ACID vs. BASE • Traditional semantics – – – – Atomicity Consistency Isolation Durability Hard to implement efficiently, especially in a distributed system. The system may require to block during network interruptions to avoid violating the strong consistency requirement • Key-Value typical semantics – Basic Avaiability – Soft state – Eventual consistency When needed, willing to sacrifice strong consistency in favor of availability and performance More on this, in other courses (File Systems, Distributed Systems, Big Data) Column-Family • The data model here provides the abstraction of a multi family-column table – Each row is identified by a key – Each row consists of multiple column-families • Sometimes a column-family is called a collection – Each column-family consists of one or more columns • Yet, different rows may include different columns in the same column family – Data is typically immutable • Cells include multiple versions • The motivation here is again performance and availability – Decentralized implementations – Denormalization instead of normalization and joins – Some systems provide strong consistency and atomicity guarantees while others do not More on Column-Family • There are many variants of this model – The first well known example of this model is Google’s Big Table • Only used inside Google – A very well known open source implementation is called Cassandra • Initially developed by Facebook • Currently an Apache open source project • Cassandra Query Language (CQL) Graph Database • A property graph contains nodes and directed edges – Nodes represent entities – Edges represent (directed & labeled) relationships • Both can have properties – Properties are key-value pairs • Keys are strings; values are arbitrary data types • Best suited for highly connected data – RDBMS is better suited for aggregated data Graph Database • Example – Nodes can be users of a social network and edges can represent “friend” relationships – Nodes can represent users and books and edges represent “purchased” relationships – Nodes can represent users and restaurants and edges represent “recommended” relationships • The edge properties here can be the ranking as well as the textual review Queries on a Graph Database • The basic mechanism is called a Traverse – It starts from a given node and explores portions of the graph based on the query • For example – Who are the friends of friends of friends of Amy? – What is the average rating for a given movie given by users whose friendship-distance from me is at most 5 hops? Motivation for Graph Databases • Generality and convenience – Many things can be naturally modeled as a graph • Performance – The cost of joins does not increase with the total size of the data, but rather depends on the local part of the graph that is traversed by the query processor • Extendibility and flexibility – New node types and new relationship types can be added to an existing graph • Agility – The ability to follow agile programming and design methods Neo4j and Cypher • Neo4j is an open source graph database • Cypher is a widely used graph query languages, implemented in Neo4j – Simple to learn Cypher • The most basic Cypher query includes the following structure: – Starting node(s) • used to limit the search to certain areas of the graph – Pattern matching expression • w.r.t. the starting node(s) – Return expression • based on variables bound in the pattern matching Cypher Simple Example This is optional – will eventually be deprecated START a=node:user(name='Michael') MATCH (c)-[:KNOWS]->(b)-[:KNOWS]->(a), (c)-[:KNOWS]->(a) RETURN b, c Matching Nodes in Cypher • (a) : node a – If a is already bound, we search for this specific node; otherwise, any node which will then be bound to a • () : some node • (:Ntype) : some node of type Ntype • (a:Ntype) : node a of type Ntype • (a { prop:’value’ } ) : node a that has a property called prop with a value ‘value’ • (a:Ntype { prop:’value’ } ) : node a of type Ntype that has a property called prop with a value ‘value’ Matching Relationships in Cypher • (a)--(b) : nodes a and b are related by a relation • (a)-->(b) : node a has a relation to b • (a)<--(b) : node b has a relation to a • (a)-->() : node a has a relation to some node • (a)-[r]->(b) : a is related to b by the relation r • (a)-[:Rtype]->(b) : a is related to b by a relation of type Rtype • (a)-[:R1|:R2]->(b) : a is related to b by a relation of type R1 or type R2 • (a)-[r:Rtype]->(b) : a is related to b by a relation r of type Rtype Advanced Matching Relationships • (a)-->(b)<--(c) and (a)-->(b)-->(c) : multiple relations • (a)-[:Rtype*2]->(b) : (a) is 2 hops away from (b) over relations of type Rtype – If Rtype is not specified, can be any relation (and different relations in each hop) – When no number is given, it means any length path • (a)-[:Rtype*minHops..maxHops]-> (b) : (a) is at least minHops and at most maxHops away from (b) over relations of type Rtype – If minHops is not specified, default is 1 – If maxHops is not specified, default is infinity – Can even be 0! • (a)-[r*2]->(b) : (a) is 2 hops away from (b) over the sequence of relations r • (a)-[*{prop:val}]->(b) : we search for paths in which all relations have a property prop whose value is val More Advanced Matching Relationships • Named paths MATCH p=(a {prop:val} )-->() RETURN p • Shortest path – shortestPath((a)-[*minHops..maxHops]-(b)) • Finds the shortest path of length between minHops and MaxHops between (a) and (b) – allShorestPath((a)-[*]-(b)) • Finds all shortest paths between (a) and (b) Augmented Return • Column alias MATCH (a { name: "A" }) RETURN a.age AS SomethingTotallyDifferent • Unique results MATCH (a { name: "A" })-->(b) RETURN DISTINCT b • Other expressions – Any expression can be used as a return item — literals, predicates, properties, functions, and everything else MATCH (a { name: "A" }) RETURN a.age > 30, "I'm a literal",(a)-->() – The result is the collection of the value True, “I’m a literal”, and the result of evaluating the function (a)->() ORDER BY • Order results by properties MATCH (n) RETURN n ORDER BY n.age, n.name • Descending order MATCH (n) RETURN n ORDER BY n.name DESC • NULL is always order last in ascending order (default) and first in descending order – Note that missing node/relationship properties are evaluated to null Where Clauses • Provides criteria for filtering in pattern matching expression • Examples: MATCH (n) WHERE n.name = 'Peter' XOR n.age < 30 RETURN n MATCH (n) WHERE n.name =~ 'Tob.*' RETURN n MATCH (tobias { name: 'Tobias' }),(others) WHERE others.name IN ['Andres', 'Peter'] AND (tobias)<--(others) RETURN others Skip and Limit • Limit crops the suffix of the result • Skip eliminate the prefix MATCH (n) RETURN n ORDER BY n.name SKIP 1 LIMIT 2 • This expression results in returning the 2nd and 3rd elements of the previously computed result With • Used to manipulate the result sequence before it is passed on to the following query parts – One common usage of WITH is to limit the number of entries that are then passed on to other MATCH clauses – WITH is also used to separate reading from updating of the graph • Every part of a query must be either read-only or write-only • When going from a reading part to a writing part, the switch must be done with a WITH clause MATCH (david { name: "David" })--(otherPerson)-->() WITH otherPerson, count(*) AS foaf WHERE foaf > 1 RETURN otherPerson MATCH (n) WITH n ORDER BY n.name DESC LIMIT 3 RETURN collect(n.name) Union • Combines the results of two or more queries into a single result set that includes all the rows that belong to all queries in the union – The number and the names of the columns must be identical in all queries combined by using UNION • To keep all the result rows, use UNION ALL • Using just UNION will combine and remove duplicates from the result set MATCH (n:Actor) RETURN n.name AS name UNION ALL MATCH (n:Movie) RETURN n.title AS name MATCH (n:Actor) RETURN n.name AS name UNION MATCH (n:Movie) RETURN n.title AS name With duplicates Without duplicates CREATE (nodes) CREATE (n) • Creates a node n CREATE (n:Person) • Creates a node n of label Person CREATE (n:Person:Swedish) • Creates a node n with two labels: Person and Swedish CREATE (n:Person { name : 'Andres', title : 'Developer' }) • Creates a node n of label Person with properties name=‘Andres’ and title=‘Developer’ CREATE (a { name : 'Andres' }) • Creates a node with a property name=‘Andres” CREATE (relationships) MATCH (a:Person),(b:Person) WHERE a.name = 'Node A' AND b.name = 'Node B' CREATE (a)-[r:RELTYPE]->(b) RETURN r Creates a labeled relationship MATCH (a:Person),(b:Person) WHERE a.name = 'Node A' AND b.name = 'Node B' CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b) RETURN r Creates a relationship with properties CREATE p =(andres { name:'Andres' })-[:WORKS_AT]->(neo)<[:WORKS_AT]-(michael { name:'Michael' }) RETURN p Creates a full path (nodes + relationships) CREATE UNIQUE • Creates only the parts of the graphs that are missing in a CREATE query – Left for the interested students to explore on their own… Additional Cypher Clauses • DELETE – Delete nodes and relationships • Remove – Removes labels and properties • SET – Updating labels on nodes and properties on nodes and relationships • FOREACH – Performs an updating action on each item in a collection or a path MATCH p =(source)-[*]->(destination) WHERE source.name='A' AND destination.name='D' FOREACH (n IN nodes(p)| SET n.marked = TRUE ) A Note on Labels in Neo4j • A node can have multiple labels • Labels can be viewed as a combination of a tagging mechanism and is-a relationship – It enables choosing nodes based on their label(s) – In the future, it would enable imposing restrictions on properties and values • I.e., act also as a light-weight optional schema • A label can be assigned upon creation and using the SET expression • A label can be removed using the REMOVE expression Operators • Mathematical – +, -, *, /,%, ^ • Comparison – =,<>,<,>,>=,<= • Boolean – AND, OR, XOR, NOT • String – Concatenation through + • Collection – Concatenation through + – IN to check is an element exists in a collection Simple CASE Expression CASE test WHEN value THEN result [WHEN ...] [ELSE default] END Example: MATCH n RETURN CASE n.eyes WHEN 'blue' THEN 1 WHEN 'brown' THEN 2 ELSE 3 END AS result In CASE expressions, the evaluated test is compared against the value of the WHEN statements, one after the other, until the first one that matches. If none matches, then the default is returned if exists; otherwise, a NULL is returned. Generic CASE Expression CASE WHEN predicate THEN result [WHEN ...] [ELSE default] END Example MATCH n RETURN CASE WHEN n.eyes = 'blue' THEN 1 WHEN n.age < 40 THEN 2 ELSE 3 END AS result Here, each predicate is evaluated until the first one matches. If none match, the default value is returned if exists; otherwise, NULL. Collections • A literal collection is created by using brackets and separating the elements in the collection with commas RETURN [0,1,2,3,4,5,6,7,8,9] AS collection – The result is the collection [0,1,2,3,4,5,6,7,8,9] • Many ways of selecting elements from a collection, e.g., RETURN range(0,10)[3] RETURN range(0,10)[-3] RETURN range(0,10)[0..3] RETURN range(0,10)[0..-5] RETURN range(0,10)[..4] RETURN range(0,10)[-5..] - 3rd element (3 in this case) - 3rd from the end (8 here) - [0,1,2] - [0,1,2,3,4,5] - [0,1,2,3] - [6,7,8,9,10] More on Collections RETURN [x IN range(0,10)| x^3] AS result • Result: [0.0,1.0,8.0,27.0,64.0,125.0,216.0,343.0,512.0,729.0,1000. 0] RETURN [x IN range(0,10) WHERE x % 2 = 0] AS result • Result: [0,2,4,6,8,10] RETURN [x IN range(0,10) WHERE x % 2 = 0 | x^3] AS result • Result: ]0.0,8.0,64.0,216.0,512.0,1000.0[ Aggregation • Aggregate functions take multiple input values and calculate an aggregated value from them – E.g., avg(), min(), max(), count(), sum(), stdev() MATCH (me:Person)-->(friend:Person)-->(friend_of_friend:Person) WHERE me.name = 'A' RETURN count(DISTINCT friend_of_friend), count(friend_of_friend) Back to the Train Operation Example The graph includes the following elements: Station S_Name Height S_Type Serves Km Line L_Num Direction L_Type Arrives A_Time D_Time Platform Travels Train T_Num Days Gives Service T_Category Class Food Sample Queries • Which stations are served by line 1-South? MATCH (line:Line {L_Num:'1',Direction:'South'})-[:Serves]->(station:Station) RETURN station • Which lines have stations below sea level? MATCH (line:Line)-[:Serves]->(station:Station) WHERE station.height<0 RETURN DISTINCT line.L_Num,line.Direction Sample Queries • Which stations serve multiple lines? MATCH (line)-[:Serves]->(station) WITH station,count(line) as linesCount WHERE linesCount>1 RETURN station.S_Name • How can I reach from station A to B with the minimal number of train changes MATCH (a:Station {S_Name:‘A'}), (b:Station {S_Name:‘B'}), p=shortestPath((a)-[:Serves*]-(b)) RETURN nodes(p) Sample Queries • What is the highest station? MATCH (s:Station) RETURN s ORDER BY s.height DESC LIMIT 1 • Which trains serve all stations? MATCH (s:Station) WITH collect(s) AS sc MATCH (t:Train) WHERE ALL (x IN sc WHERE (t)-[:Arrives]->(x)) RETURN t How Do I Choose? • As a rule of thumb Source: http://www.neo4j.org/learn/nosql Are RDBMS Dead? (should I forget everything I learned in this course?) • Definitely not!!! 1. RDBMS and SQL is the default time-tested database technology 2. See previous slide Similarly to the fact that C++ or Java might be your default programming language, yet you might opt to use PHP, Ruby/Rails, Perl, Eiffel, Erlang, ML, etc. for various specific tasks 3. RDBMS are making leapfrog improvements in performance due to advances in storage technologies and other optimizations, making them suitable for high demanding OLAP applications • E.g., SAP’s HANA 4. Many modern Internet web sites rely on multiple databases, each of a different kind, for their various aspects Additional Reading • Graph Databases by Robinson, Webber, and Eifrem (O’Reilly) – free eBook • http://www.neo4j.org/ • http://www.neo4j.org/learn/cypher • http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html