Database Systems-9

advertisement
Database Systems
236363
NoSQL
Source: http://geekandpoke.typepad.com/
“Definition”
• Literally: NoSQL = Not Only SQL
• Practically: Anything that deviates from
standard Relational Database Management
Systems (RDBMS)
Reminder: What is an RDBMS?
• Relational data model
– Structured
– Data represented through a collection of tables/relations
• Relational query language + relational data manipulation
language + relational data definition language
– As a standard, SQL
• Strongly consistent concurrency control
– The notion of transactions
– The ACID model
– Taught in the 234322 course
This is in fact where
most of the hype is
NoSQL = any deviation from RDBMS on any of these axes
What’s Driving this Trend?
1. The relational data model is not perfectly suited for all
applications
– Semi-structured models (and corresponding query
languages) are a better fit for some cases
• We’ve already talked about XML
• Other types of document data models
– In other cases, a graph oriented data model (and query
languages) are a better fit
• For example, for exploring the social graphs of social networks
– In others, a Column-Family or even a schema-less KeyValue semantics is enough
• In the latter, both the key and value can be arbitrary objects
What’s Driving this Trend?
2. Performance, distribution, and Web scale
– Traditional databases cannot keep up with the
performance required by very large scale analytics
applications (OLAP)
– Internet companies need to handle massive
amounts of data and the data needs to be
available all the time with fast response time
• This leads to building large distributed data-centers
• Some choose to prefer weak consistency
– BASE rather than ACID – details in more advanced courses
Warning: This is a hyped new domain meaning that there is great confusion
over what is what with no precise agreed upon definitions
Key-Value
• Interface consists of
– put(key,value) and get(key)– sometimes also
scan()
• Typically, the value can be of arbitrary type
• The key can either be a string or an arbitrary type
• The responsibility is delegated entirely to the
application
– Both for semantics enforcement and logic execution
• These systems are performance and availability driven
– Most such systems provide BASE instead of ACID
– Details discussion in 234322 (File Systems), 236351
(Distributed Systems), 236620 (Big Data) – here, only a
brief discussion
On the Notion of Transactions
• A transaction is a collection of operations that
form a single logical unit
– Reserve a seat on the plane
• Verify that the seat is vacant with the price I was quoted and
then reserve it, which also prevents others from reserving it
– This might also involve charging a credit card
– Order something from an online store
• Verify that the item exists in stock
– Operations in a transactional file systems
• Moving a file from one directory to another
– Involves verifying that the exists, creating a copy in the new
directory, and deleting the old one
– Usually, each SQL query forms a single transaction
In Practice Things Get Complicated
• In practice, computers are often parallel
– Further, many systems these days are distributed, adding another
element of parallelization
• What happens when two people try to reserve the same seat on an
airplane concurrently?
– The common answer: only one should succeed
• What happens if a transaction fails in the middle?
– The seat was already taken – should we charge the credit card?
– The credit company refused the payment – should we hold the seat?
– The receiving file system directory is full – should we remove its old
copy from the old directory?
ACID vs. BASE
• Traditional semantics
–
–
–
–
Atomicity
Consistency
Isolation
Durability
Hard to implement efficiently,
especially in a distributed system.
The system may require to block
during network interruptions to
avoid violating the strong
consistency requirement
• Key-Value typical semantics
– Basic Avaiability
– Soft state
– Eventual consistency
When needed, willing to
sacrifice strong consistency
in favor of availability and
performance
More on this, in other courses (File Systems, Distributed Systems, Big Data)
Column-Family
• The data model here provides the abstraction of a multi
family-column table
– Each row is identified by a key
– Each row consists of multiple column-families
• Sometimes a column-family is called a collection
– Each column-family consists of one or more columns
• Yet, different rows may include different columns in the same column
family
– Data is typically immutable
• Cells include multiple versions
• The motivation here is again performance and availability
– Decentralized implementations
– Denormalization instead of normalization and joins
– Some systems provide strong consistency and atomicity
guarantees while others do not
More on Column-Family
• There are many variants of this model
– The first well known example of this model is
Google’s Big Table
• Only used inside Google
– A very well known open source implementation is
called Cassandra
• Initially developed by Facebook
• Currently an Apache open source project
• Cassandra Query Language (CQL)
Graph Database
• A property graph contains nodes and directed
edges
– Nodes represent entities
– Edges represent (directed & labeled) relationships
• Both can have properties
– Properties are key-value pairs
• Keys are strings; values are arbitrary data types
• Best suited for highly connected data
– RDBMS is better suited for aggregated data
Graph Database
• Example
– Nodes can be users of a social network and edges
can represent “friend” relationships
– Nodes can represent users and books and edges
represent “purchased” relationships
– Nodes can represent users and restaurants and
edges represent “recommended” relationships
• The edge properties here can be the ranking as well as
the textual review
Queries on a Graph Database
• The basic mechanism is called a Traverse
– It starts from a given node and explores portions
of the graph based on the query
• For example
– Who are the friends of friends of friends of Amy?
– What is the average rating for a given movie given
by users whose friendship-distance from me is at
most 5 hops?
Motivation for Graph Databases
• Generality and convenience
– Many things can be naturally modeled as a graph
• Performance
– The cost of joins does not increase with the total size of the data, but
rather depends on the local part of the graph that is traversed by the
query processor
• Extendibility and flexibility
– New node types and new relationship types can be added to an
existing graph
• Agility
– The ability to follow agile programming and design methods
Neo4j and Cypher
• Neo4j is an open source graph database
• Cypher is a widely used graph query
languages, implemented in Neo4j
– Simple to learn
Cypher
• The most basic Cypher query includes the
following structure:
– Starting node(s)
• used to limit the search to certain areas of the graph
– Pattern matching expression
• w.r.t. the starting node(s)
– Return expression
• based on variables bound in the pattern matching
Cypher Simple Example
This is optional – will eventually
be deprecated
START a=node:user(name='Michael')
MATCH (c)-[:KNOWS]->(b)-[:KNOWS]->(a), (c)-[:KNOWS]->(a)
RETURN b, c
Matching Nodes in Cypher
• (a) : node a
– If a is already bound, we search for this specific node; otherwise, any
node which will then be bound to a
• () : some node
• (:Ntype) : some node of type Ntype
• (a:Ntype) : node a of type Ntype
• (a { prop:’value’ } ) : node a that has a property called prop with a
value ‘value’
• (a:Ntype { prop:’value’ } ) : node a of type Ntype that has a property
called prop with a value ‘value’
Matching Relationships in Cypher
• (a)--(b) : nodes a and b are related by a relation
• (a)-->(b) : node a has a relation to b
• (a)<--(b) : node b has a relation to a
• (a)-->() : node a has a relation to some node
• (a)-[r]->(b) : a is related to b by the relation r
• (a)-[:Rtype]->(b) : a is related to b by a relation of type Rtype
• (a)-[:R1|:R2]->(b) : a is related to b by a relation of type R1 or type R2
• (a)-[r:Rtype]->(b) : a is related to b by a relation r of type Rtype
Advanced Matching Relationships
• (a)-->(b)<--(c) and (a)-->(b)-->(c) : multiple relations
• (a)-[:Rtype*2]->(b) : (a) is 2 hops away from (b) over relations of type
Rtype
– If Rtype is not specified, can be any relation (and different relations in each
hop)
– When no number is given, it means any length path
• (a)-[:Rtype*minHops..maxHops]-> (b) : (a) is at least minHops and at most
maxHops away from (b) over relations of type Rtype
– If minHops is not specified, default is 1
– If maxHops is not specified, default is infinity
– Can even be 0!
• (a)-[r*2]->(b) : (a) is 2 hops away from (b) over the sequence of relations r
• (a)-[*{prop:val}]->(b) : we search for paths in which all relations have a
property prop whose value is val
More Advanced Matching
Relationships
• Named paths
MATCH p=(a {prop:val} )-->()
RETURN p
• Shortest path
– shortestPath((a)-[*minHops..maxHops]-(b))
• Finds the shortest path of length between minHops and
MaxHops between (a) and (b)
– allShorestPath((a)-[*]-(b))
• Finds all shortest paths between (a) and (b)
Augmented Return
• Column alias
MATCH (a { name: "A" })
RETURN a.age AS SomethingTotallyDifferent
• Unique results
MATCH (a { name: "A" })-->(b)
RETURN DISTINCT b
• Other expressions
– Any expression can be used as a return item — literals, predicates,
properties, functions, and everything else
MATCH (a { name: "A" })
RETURN a.age > 30, "I'm a literal",(a)-->()
– The result is the collection of the value True, “I’m a literal”, and the result
of evaluating the function (a)->()
ORDER BY
• Order results by properties
MATCH (n)
RETURN n
ORDER BY n.age, n.name
• Descending order
MATCH (n)
RETURN n
ORDER BY n.name DESC
• NULL is always order last in ascending order (default) and first in
descending order
– Note that missing node/relationship properties are evaluated to null
Where Clauses
• Provides criteria for filtering in pattern matching expression
• Examples:
MATCH (n)
WHERE n.name = 'Peter' XOR n.age < 30
RETURN n
MATCH (n)
WHERE n.name =~ 'Tob.*'
RETURN n
MATCH (tobias { name: 'Tobias' }),(others)
WHERE others.name IN ['Andres', 'Peter'] AND (tobias)<--(others)
RETURN others
Skip and Limit
• Limit crops the suffix of the result
• Skip eliminate the prefix
MATCH (n)
RETURN n
ORDER BY n.name
SKIP 1
LIMIT 2
• This expression results in returning the 2nd and 3rd
elements of the previously computed result
With
• Used to manipulate the result sequence before it is passed on to
the following query parts
– One common usage of WITH is to limit the number of entries that are
then passed on to other MATCH clauses
– WITH is also used to separate reading from updating of the graph
• Every part of a query must be either read-only or write-only
• When going from a reading part to a writing part, the switch must be done
with a WITH clause
MATCH (david { name: "David" })--(otherPerson)-->()
WITH otherPerson, count(*) AS foaf
WHERE foaf > 1
RETURN otherPerson
MATCH (n)
WITH n
ORDER BY n.name DESC LIMIT 3
RETURN collect(n.name)
Union
• Combines the results of two or more queries into a single result set
that includes all the rows that belong to all queries in the union
– The number and the names of the columns must be identical in all
queries combined by using UNION
• To keep all the result rows, use UNION ALL
• Using just UNION will combine and remove duplicates from the result set
MATCH (n:Actor)
RETURN n.name AS name
UNION ALL MATCH (n:Movie)
RETURN n.title AS name
MATCH (n:Actor)
RETURN n.name AS name
UNION
MATCH (n:Movie)
RETURN n.title AS name
With duplicates
Without duplicates
CREATE (nodes)
CREATE (n)
• Creates a node n
CREATE (n:Person)
• Creates a node n of label Person
CREATE (n:Person:Swedish)
• Creates a node n with two labels: Person and Swedish
CREATE (n:Person { name : 'Andres', title : 'Developer' })
• Creates a node n of label Person with properties name=‘Andres’ and
title=‘Developer’
CREATE (a { name : 'Andres' })
• Creates a node with a property name=‘Andres”
CREATE (relationships)
MATCH (a:Person),(b:Person)
WHERE a.name = 'Node A' AND b.name = 'Node B'
CREATE (a)-[r:RELTYPE]->(b)
RETURN r
Creates a labeled relationship
MATCH (a:Person),(b:Person)
WHERE a.name = 'Node A' AND b.name = 'Node B'
CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b)
RETURN r
Creates a relationship with properties
CREATE p =(andres { name:'Andres' })-[:WORKS_AT]->(neo)<[:WORKS_AT]-(michael { name:'Michael' })
RETURN p
Creates a full path (nodes + relationships)
CREATE UNIQUE
• Creates only the parts of the graphs that are
missing in a CREATE query
– Left for the interested students to explore on their
own…
Additional Cypher Clauses
• DELETE
– Delete nodes and relationships
• Remove
– Removes labels and properties
• SET
– Updating labels on nodes and properties on nodes and
relationships
• FOREACH
– Performs an updating action on each item in a collection
or a path
MATCH p =(source)-[*]->(destination)
WHERE source.name='A' AND destination.name='D'
FOREACH (n IN nodes(p)| SET n.marked = TRUE )
A Note on Labels in Neo4j
• A node can have multiple labels
• Labels can be viewed as a combination of a tagging
mechanism and is-a relationship
– It enables choosing nodes based on their label(s)
– In the future, it would enable imposing restrictions on
properties and values
• I.e., act also as a light-weight optional schema
• A label can be assigned upon creation and using the
SET expression
• A label can be removed using the REMOVE expression
Operators
• Mathematical
– +, -, *, /,%, ^
• Comparison
– =,<>,<,>,>=,<=
• Boolean
– AND, OR, XOR, NOT
• String
– Concatenation through +
• Collection
– Concatenation through +
– IN to check is an element exists in a collection
Simple CASE Expression
CASE test
WHEN value THEN result
[WHEN ...]
[ELSE default]
END
Example:
MATCH n
RETURN CASE n.eyes
WHEN 'blue'
THEN 1
WHEN 'brown'
THEN 2
ELSE 3 END AS result
In CASE expressions, the evaluated
test is compared against the value
of the WHEN statements, one after
the other, until the first one that
matches. If none matches, then
the default is returned if exists;
otherwise, a NULL is returned.
Generic CASE Expression
CASE
WHEN predicate THEN result
[WHEN ...]
[ELSE default]
END
Example
MATCH n
RETURN CASE
WHEN n.eyes = 'blue'
THEN 1
WHEN n.age < 40
THEN 2
ELSE 3 END AS result
Here, each predicate is
evaluated until the first one
matches. If none match, the
default value is returned if
exists; otherwise, NULL.
Collections
• A literal collection is created by using brackets and
separating the elements in the collection with commas
RETURN [0,1,2,3,4,5,6,7,8,9] AS collection
– The result is the collection [0,1,2,3,4,5,6,7,8,9]
• Many ways of selecting elements from a collection,
e.g.,
RETURN range(0,10)[3]
RETURN range(0,10)[-3]
RETURN range(0,10)[0..3]
RETURN range(0,10)[0..-5]
RETURN range(0,10)[..4]
RETURN range(0,10)[-5..]
- 3rd element (3 in this case)
- 3rd from the end (8 here)
- [0,1,2]
- [0,1,2,3,4,5]
- [0,1,2,3]
- [6,7,8,9,10]
More on Collections
RETURN [x IN range(0,10)| x^3] AS result
• Result:
[0.0,1.0,8.0,27.0,64.0,125.0,216.0,343.0,512.0,729.0,1000.
0]
RETURN [x IN range(0,10) WHERE x % 2 = 0] AS result
• Result: [0,2,4,6,8,10]
RETURN [x IN range(0,10) WHERE x % 2 = 0 | x^3] AS result
• Result: ]0.0,8.0,64.0,216.0,512.0,1000.0[
Aggregation
• Aggregate functions take multiple input values
and calculate an aggregated value from them
– E.g., avg(), min(), max(), count(), sum(), stdev()
MATCH (me:Person)-->(friend:Person)-->(friend_of_friend:Person)
WHERE me.name = 'A'
RETURN count(DISTINCT friend_of_friend), count(friend_of_friend)
Back to the Train Operation Example
The graph includes the following elements:
Station
S_Name
Height
S_Type
Serves
Km
Line
L_Num
Direction
L_Type
Arrives
A_Time
D_Time
Platform
Travels
Train
T_Num
Days
Gives
Service
T_Category
Class
Food
Sample Queries
• Which stations are served by line 1-South?
MATCH (line:Line {L_Num:'1',Direction:'South'})-[:Serves]->(station:Station)
RETURN station
• Which lines have stations below sea level?
MATCH (line:Line)-[:Serves]->(station:Station)
WHERE station.height<0
RETURN DISTINCT line.L_Num,line.Direction
Sample Queries
• Which stations serve multiple lines?
MATCH (line)-[:Serves]->(station)
WITH station,count(line) as linesCount
WHERE linesCount>1
RETURN station.S_Name
• How can I reach from station A to B with the
minimal number of train changes
MATCH (a:Station {S_Name:‘A'}), (b:Station {S_Name:‘B'}),
p=shortestPath((a)-[:Serves*]-(b))
RETURN nodes(p)
Sample Queries
• What is the highest station?
MATCH (s:Station) RETURN s ORDER BY s.height DESC LIMIT 1
• Which trains serve all stations?
MATCH (s:Station)
WITH collect(s) AS sc
MATCH (t:Train)
WHERE ALL (x IN sc WHERE (t)-[:Arrives]->(x))
RETURN t
How Do I Choose?
• As a rule of thumb
Source: http://www.neo4j.org/learn/nosql
Are RDBMS Dead?
(should I forget everything I learned in this course?)
• Definitely not!!!
1. RDBMS and SQL is the default time-tested database
technology
2. See previous slide
Similarly to the fact that C++ or Java might be
your default programming language, yet you
might opt to use PHP, Ruby/Rails, Perl, Eiffel,
Erlang, ML, etc. for various specific tasks
3. RDBMS are making leapfrog improvements in
performance due to advances in storage technologies and
other optimizations, making them suitable for high
demanding OLAP applications
•
E.g., SAP’s HANA
4. Many modern Internet web sites rely on multiple
databases, each of a different kind, for their various
aspects
Additional Reading
• Graph Databases by Robinson, Webber, and Eifrem
(O’Reilly) – free eBook
• http://www.neo4j.org/
• http://www.neo4j.org/learn/cypher
• http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html
Download