Hbase, Hive and Pig

advertisement
NoSQL and Big Data Processing
Hbase, Hive and Pig, etc.
Adopted from slides by By Perry Hoekstra,
Jiaheng Lu, Avinash Lakshman, Prashant
Malik, and Jimmy Lin
History of the World, Part 1
• Relational Databases – mainstay of business
• Web-based applications caused spikes
– Especially true for public-facing e-Commerce sites
• Developers begin to front RDBMS with memcache or integrate
other caching mechanisms within the application (ie. Ehcache)
Scaling Up
•
•
•
•
•
Issues with scaling up when the dataset is just too big
RDBMS were not designed to be distributed
Began to look at multi-node database solutions
Known as ‘scaling out’ or ‘horizontal scaling’
Different approaches include:
– Master-slave
– Sharding
Scaling RDBMS – Master/Slave
• Master-Slave
– All writes are written to the master. All reads performed against
the replicated slave databases
– Critical reads may be incorrect as writes may not have been
propagated down
– Large data sets can pose problems as master needs to duplicate
data to slaves
Scaling RDBMS - Sharding
• Partition or sharding
–
–
–
–
Scales well for both reads and writes
Not transparent, application needs to be partition-aware
Can no longer have relationships/joins across partitions
Loss of referential integrity across shards
Other ways to scale RDBMS
• Multi-Master replication
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time
– This involves de-normalizing data
• In-memory databases
What is NoSQL?
• Stands for Not Only SQL
• Class of non-relational data storage systems
• Usually do not require a fixed table schema nor do they use
the concept of joins
• All NoSQL offerings relax one or more of the ACID properties
(will talk about the CAP theorem)
Why NoSQL?
• For data storage, an RDBMS cannot be the be-all/end-all
• Just as there are different programming languages, need to
have other data storage tools in the toolbox
• A NoSQL solution is more acceptable to a client now than
even a year ago
– Think about proposing a Ruby/Rails or Groovy/Grails solution
now versus a couple of years ago
How did we get here?
• Explosion of social media sites (Facebook, Twitter) with
large data needs
• Rise of cloud-based solutions such as Amazon S3 (simple
storage solution)
• Just as moving to dynamically-typed languages
(Ruby/Groovy), a shift to dynamically-typed data with
frequent schema changes
• Open-source community
Dynamo and BigTable
• Three major papers were the seeds of the NoSQL movement
– BigTable (Google)
– Dynamo (Amazon)
• Gossip protocol (discovery and error detection)
• Distributed key-value data store
• Eventual consistency
– CAP Theorem (discuss in a sec ..)
The Perfect Storm
• Large datasets, acceptance of alternatives, and dynamicallytyped data has come together in a perfect storm
• Not a backlash/rebellion against RDBMS
• SQL is a rich query language that cannot be rivaled by the
current list of NoSQL offerings
CAP Theorem
• Three properties of a system: consistency, availability and
partitions
• You can have at most two of these three properties for any
shared-data system
• To scale out, you have to partition. That leaves either
consistency or availability to choose from
– In almost all cases, you would choose availability over
consistency
The CAP Theorem
Availability
Consistency
Partition
tolerance
The CAP Theorem
Once a writer has written, all
readers will see that write
Availability
Consistency
Partition
tolerance
Consistency
• Two kinds of consistency:
– strong consistency – ACID(Atomicity Consistency Isolation
Durability)
– weak consistency – BASE(Basically Available Soft-state
Eventual consistency )
ACID Transactions
• A DBMS is expected to support “ACID
transactions,” processes that are:
– Atomic : Either the whole process is done or none
is.
– Consistent : Database constraints are preserved.
– Isolated : It appears to the user as if only one
process executes at a time.
– Durable : Effects of a process do not get lost if the
system crashes.
16
Atomicity
• A real-world event either happens or does
not happen
– Student either registers or does not register
• Similarly, the system must ensure that either
the corresponding transaction runs to
completion or, if not, it has no effect at all
– Not true of ordinary programs. A crash could
leave files partially updated on recovery
17
Commit and Abort
• If the transaction successfully completes it
is said to commit
– The system is responsible for ensuring that all
changes to the database have been saved
• If the transaction does not successfully
complete, it is said to abort
– The system is responsible for undoing, or rolling
back, all changes the transaction has made
18
Database Consistency
• Enterprise (Business) Rules limit the
occurrence of certain real-world events
– Student cannot register for a course if the current
number of registrants equals the maximum allowed
• Correspondingly, allowable database states
are restricted
cur_reg <= max_reg
• These limitations are called (static) integrity
constraints: assertions that must be satisfied
by all database states (state invariants).
19
Database Consistency
(state invariants)
• Other static consistency requirements are
related to the fact that the database might
store the same information in different ways
– cur_reg = |list_of_registered_students|
– Such limitations are also expressed as integrity
constraints
• Database is consistent if all static integrity
constraints are satisfied
20
Transaction Consistency
• A consistent database state does not necessarily
model the actual state of the enterprise
– A deposit transaction that increments the balance by
the wrong amount maintains the integrity constraint
balance  0, but does not maintain the relation between
the enterprise and database states
• A consistent transaction maintains database
consistency and the correspondence between the
database state and the enterprise state (implements
its specification)
– Specification of deposit transaction includes
balance = balance + amt_deposit ,
(balance is the next value of balance)
21
Dynamic Integrity Constraints
(transition invariants)
• Some constraints restrict allowable state
transitions
– A transaction might transform the database
from one consistent state to another, but the
transition might not be permissible
– Example: A letter grade in a course (A, B, C, D,
F) cannot be changed to an incomplete (I)
• Dynamic constraints cannot be checked
by examining the database state
22
Transaction Consistency
• Consistent transaction: if DB is in consistent
state initially, when the transaction completes:
– All static integrity constraints are satisfied (but
constraints might be violated in intermediate states)
• Can be checked by examining snapshot of database
– New state satisfies specifications of transaction
• Cannot be checked from database snapshot
– No dynamic constraints have been violated
• Cannot be checked from database snapshot
23
Isolation
• Serial Execution: transactions execute in sequence
– Each one starts after the previous one completes.
• Execution of one transaction is not affected by the
operations of another since they do not overlap in time
– The execution of each transaction is isolated from
all others.
• If the initial database state and all transactions are
consistent, then the final database state will be
consistent and will accurately reflect the real-world
state, but
• Serial execution is inadequate from a performance
perspective
24
Isolation
• Concurrent execution offers performance benefits:
– A computer system has multiple resources capable of
executing independently (e.g., cpu’s, I/O devices), but
– A transaction typically uses only one resource at a time
– Hence, only concurrently executing transactions can
make effective use of the system
– Concurrently executing transactions yield interleaved
schedules
25
begin trans
..
op1,1
..
op1,2
..
commit
Concurrent Execution
T1
op1,1 op1.2
sequence of db
operations output by T1
local computation
op1,1 op2,1 op2.2 op1.2
T2
op2,1 op2.2
DBMS
interleaved sequence of db
operations input to DBMS
local variables
26
Durability
• The system must ensure that once a transaction
commits, its effect on the database state is not
lost in spite of subsequent failures
– Not true of ordinary programs. A media failure after a
program successfully terminates could cause the file
system to be restored to a state that preceded the
program’s execution
27
Implementing Durability
• Database stored redundantly on mass storage
devices to protect against media failure
• Architecture of mass storage devices affects
type of media failures that can be tolerated
• Related to Availability: extent to which a
(possibly distributed) system can provide
service despite failure
• Non-stop DBMS (mirrored disks)
• Recovery based DBMS (log)
28
Consistency Model
• A consistency model determines rules for visibility and apparent
order of updates.
• For example:
–
–
–
–
–
–
–
–
Row X is replicated on nodes M and N
Client A writes row X to node N
Some period of time t elapses.
Client B reads row X from node M
Does client B see the write from client A?
Consistency is a continuum with tradeoffs
For NoSQL, the answer would be: maybe
CAP Theorem states: Strict Consistency can't be achieved at the
same time as availability and partition-tolerance.
Eventual Consistency
• When no updates occur for a long period of time,
eventually all updates will propagate through the
system and all the nodes will be consistent
• For a given accepted update and a given node,
eventually either the update reaches the node or the
node is removed from service
• Known as BASE (Basically Available, Soft state,
Eventual consistency), as opposed to ACID
The CAP Theorem
Availability
Consistency
Partition
tolerance
System is available during
software and hardware
upgrades and node failures.
Availability
• Traditionally, thought of as the server/process available
five 9’s (99.999 %).
• However, for large node system, at almost any point in
time there’s a good chance that a node is either down or
there is a network disruption among the nodes.
– Want a system that is resilient in the face of network disruption
The CAP Theorem
Availability
Consistency
Partition
tolerance
A system can continue to
operate in the presence of a
network partitions.
The CAP Theorem
Availability
Consistency
Partition
tolerance
Theorem: You can have
at most two of these
properties for any
shared-data system
What kinds of NoSQL
• NoSQL solutions fall into two major areas:
– Key/Value or ‘the big hash table’.
•
•
•
•
•
Amazon S3 (Dynamo)
Voldemort
Scalaris
Memcached (in-memory key/value store)
Redis
– Schema-less which comes in multiple flavors, column-based,
document-based or graph-based.
•
•
•
•
•
Cassandra (column-based)
CouchDB (document-based)
MongoDB(document-based)
Neo4J (graph-based)
HBase (column-based)
Key/Value
Pros:
–
–
–
–
very fast
very scalable
simple model
able to distribute horizontally
Cons:
- many data structures (objects) can't be easily modeled as key
value pairs
Schema-Less
Pros:
- Schema-less data model is richer than key/value pairs
- eventual consistency
- many are distributed
- still provide excellent performance and scalability
Cons:
- typically no ACID transactions or joins
Common Advantages
• Cheap, easy to implement (open source)
• Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be
partitioned
– Down nodes easily replaced
– No single point of failure
•
•
•
•
Easy to distribute
Don't require a schema
Can scale up and down
Relax the data consistency requirement (CAP)
What am I giving up?
•
•
•
•
•
joins
group by
order by
ACID transactions
SQL as a sometimes frustrating but still powerful query
language
• easy integration with other applications that support SQL
Big Table and Hbase
(C+P)
Data Model
• A table in Bigtable is a sparse, distributed,
persistent multidimensional sorted map
• Map indexed by a row key, column key, and a
timestamp
– (row:string, column:string, time:int64) 
uninterpreted byte array
• Supports lookups, inserts, deletes
– Single row transactions only
Image Source: Chang et al., OSDI 2006
Rows and Columns
• Rows maintained in sorted lexicographic order
– Applications can exploit this property for efficient
row scans
– Row ranges dynamically partitioned into tablets
• Columns grouped into column families
– Column key = family:qualifier
– Column families provide locality hints
– Unbounded number of columns
Bigtable Building Blocks
• GFS
• Chubby
• SSTable
SSTable

Basic building block of Bigtable

Persistent, ordered immutable map from keys to values


Sequence of blocks on disk plus an index for block lookup


Stored in GFS
Can be completely mapped into memory
Supported operations:


Look up value associated with key
Iterate key/value pairs within a key range
64K
block
64K
block
64K
block
SSTable
Index
Source: Graphic from slides by Erik Paulson
Tablet

Dynamically partitioned range of rows

Built from multiple SSTables
Tablet
64K
block
Start:aardvark
64K
block
64K
block
End:apple
SSTable
Index
Source: Graphic from slides by Erik Paulson
64K
block
64K
block
64K
block
SSTable
Index
Table

Multiple tablets make up the table

SSTables can be shared
Tablet
aardvark
Tablet
apple
SSTable SSTable
Source: Graphic from slides by Erik Paulson
apple_two_E
SSTable SSTable
boat
Architecture
• Client library
• Single master server
• Tablet servers
Bigtable Master
• Assigns tablets to tablet servers
• Detects addition and expiration of tablet
servers
• Balances tablet server load
• Handles garbage collection
• Handles schema changes
Bigtable Tablet Servers
• Each tablet server manages a set of tablets
– Typically between ten to a thousand tablets
– Each 100-200 MB by default
• Handles read and write requests to the tablets
• Splits tablets that have grown too large
Tablet Location
Upon discovery, clients cache tablet locations
Image Source: Chang et al., OSDI 2006
Tablet Assignment
• Master keeps track of:
– Set of live tablet servers
– Assignment of tablets to tablet servers
– Unassigned tablets
• Each tablet is assigned to one tablet server at a time
– Tablet server maintains an exclusive lock on a file in
Chubby
– Master monitors tablet servers and handles assignment
• Changes to tablet structure
– Table creation/deletion (master initiated)
– Tablet merging (master initiated)
– Tablet splitting (tablet server initiated)
Tablet Serving
“Log Structured Merge Trees”
Image Source: Chang et al., OSDI 2006
Compactions
• Minor compaction
– Converts the memtable into an SSTable
– Reduces memory usage and log traffic on restart
• Merging compaction
– Reads the contents of a few SSTables and the
memtable, and writes out a new SSTable
– Reduces number of SSTables
• Major compaction
– Merging compaction that results in only one SSTable
– No deletion records, only live data
Bigtable Applications
•
•
•
•
Data source and data sink for MapReduce
Google’s web crawl
Google Earth
Google Analytics
Lessons Learned
• Fault tolerance is hard
• Don’t add functionality before understanding
its use
– Single-row transactions appear to be sufficient
• Keep it simple!
HBase is an open-source,
distributed, column-oriented
database built on top of HDFS
based on BigTable!
HBase is ..
• A distributed data store that can scale horizontally to
1,000s of commodity servers and petabytes of
indexed storage.
• Designed to operate on top of the Hadoop
distributed file system (HDFS) or Kosmos File System
(KFS, aka Cloudstore) for scalability, fault tolerance,
and high availability.
Benefits
• Distributed storage
• Table-like in data structure
– multi-dimensional map
• High scalability
• High availability
• High performance
Backdrop
• Started toward by Chad Walters and Jim
• 2006.11
– Google releases paper on BigTable
• 2007.2
– Initial HBase prototype created as Hadoop contrib.
• 2007.10
– First useable HBase
• 2008.1
– Hadoop become Apache top-level project and HBase becomes
subproject
• 2008.10~
– HBase 0.18, 0.19 released
HBase Is Not …
• Tables have one primary index, the row key.
• No join operators.
• Scans and queries can select a subset of available
columns, perhaps by using a wildcard.
• There are three types of lookups:
– Fast lookup using row key and optional timestamp.
– Full table scan
– Range scan from region start to end.
HBase Is Not …(2)
• Limited atomicity and transaction support.
– HBase supports multiple batched mutations of
single rows only.
– Data is unstructured and untyped.
• No accessed or manipulated via SQL.
– Programmatic access via Java, REST, or Thrift APIs.
– Scripting via JRuby.
Why Bigtable?
• Performance of RDBMS system is good for
transaction processing but for very large scale
analytic processing, the solutions are
commercial, expensive, and specialized.
• Very large scale analytic processing
– Big queries – typically range or table scans.
– Big databases (100s of TB)
Why Bigtable? (2)
• Map reduce on Bigtable with optionally
Cascading on top to support some relational
algebras may be a cost effective solution.
• Sharding is not a solution to scale open source
RDBMS platforms
– Application specific
– Labor intensive (re)partitionaing
Why HBase ?
• HBase is a Bigtable clone.
• It is open source
• It has a good community and promise for the
future
• It is developed on top of and has good
integration for the Hadoop platform, if you are
using Hadoop already.
• It has a Cascading connector.
HBase benefits than RDBMS
• No real indexes
• Automatic partitioning
• Scale linearly and automatically with new
nodes
• Commodity hardware
• Fault tolerance
• Batch processing
Data Model
• Tables are sorted by Row
• Table schema only define it’s column families .
–
–
–
–
Each family consists of any number of columns
Each column consists of any number of versions
Columns only exist when inserted, NULLs are free.
Columns within a family are sorted and stored together
• Everything except table names are byte[]
• (Row, Family: Column, Timestamp)  Value
Column Family
Row key
TimeStamp
value
Members
• Master
–
–
–
–
Responsible for monitoring region servers
Load balancing for regions
Redirect client to correct region servers
The current SPOF
• regionserver slaves
– Serving requests(Write/Read/Scan) of Client
– Send HeartBeat to Master
– Throughput and Region numbers are scalable by region
servers
Architecture
ZooKeeper
• HBase depends on
ZooKeeper and by
default it manages a
ZooKeeper instance as
the authority on cluster
state
The -ROOT- table
holds the list
of .META. table
regions
Operation
The .META. table
holds the list of all
user-space regions.
Installation (1)
START Hadoop…
$ wget
http://ftp.twaren.net/Unix/Web/apache/hadoop/hbase/hbase0.20.2/hbase-0.20.2.tar.gz
$ sudo tar -zxvf hbase-*.tar.gz -C /opt/
$ sudo ln -sf /opt/hbase-0.20.2 /opt/hbase
$ sudo chown -R $USER:$USER /opt/hbase
$ sudo mkdir /var/hadoop/
$ sudo chmod 777 /var/hadoop
Setup (1)
$ vim /opt/hbase/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_CONF_DIR=/opt/hadoop/conf
export HBASE_HOME=/opt/hbase
export HBASE_LOG_DIR=/var/hadoop/hbase-logs
export HBASE_PID_DIR=/var/hadoop/hbase-pids
export HBASE_MANAGES_ZK=true
export HBASE_CLASSPATH=$HBASE_CLASSPATH:/opt/hadoop/conf
$ cd /opt/hbase/conf
$ cp /opt/hadoop/conf/core-site.xml ./
$ cp /opt/hadoop/conf/hdfs-site.xml ./
$ cp /opt/hadoop/conf/mapred-site.xml ./
<configuration>
<property>
<name> name </name>
<value> value </value>
</property>
</configuration>
Setup (2)
Name
value
hbase.rootdir
hdfs://secuse.nchc.org.tw:9000/hbase
hbase.tmp.dir
/var/hadoop/hbase-${user.name}
hbase.cluster.distributed
true
hbase.zookeeper.property 2222
.clientPort
hbase.zookeeper.quorum Host1, Host2
hbase.zookeeper.property /var/hadoop/hbase-data
.dataDir
Startup & Stop
$ start-hbase.sh
$ stop-hbase.sh
Testing (4)
$ hbase shell
> create 'test', 'data'
0 row(s) in 4.3066 seconds
> list
test
1 row(s) in 0.1485 seconds
> put 'test', 'row1', 'data:1', 'value1'
0 row(s) in 0.0454 seconds
> put 'test', 'row2', 'data:2', 'value2'
0 row(s) in 0.0035 seconds
> put 'test', 'row3', 'data:3', 'value3'
0 row(s) in 0.0090 seconds
> scan 'test'
ROW COLUMN+CELL
row1 column=data:1, timestamp=1240148026198,
value=value1
row2 column=data:2, timestamp=1240148040035,
value=value2
row3 column=data:3, timestamp=1240148047497,
value=value3
3 row(s) in 0.0825 seconds
> disable 'test'
09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test
0 row(s) in 6.0426 seconds
> drop 'test'
09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test
0 row(s) in 0.0210 seconds
> list
0 row(s) in 2.0645 seconds
Connecting to HBase
• Java client
– get(byte [] row, byte [] column, long timestamp, int
versions);
• Non-Java clients
– Thrift server hosting HBase client instance
• Sample ruby, c++, & java (via thrift) clients
– REST server hosts HBase client
• TableInput/OutputFormat for MapReduce
– HBase as MR source or sink
• HBase Shell
– JRuby IRB with “DSL” to add get, scan, and admin
– ./bin/hbase shell YOUR_SCRIPT
Thrift
$ hbase-daemon.sh start thrift
$ hbase-daemon.sh stop thrift
• a software framework for scalable cross-language services
development.
• By facebook
• seamlessly between C++, Java, Python, PHP, and Ruby.
• This will start the server instance, by default on port 9090
• The other similar project “rest”
References
• Introduction to Hbase
trac.nchc.org.tw/cloud/rawattachment/wiki/.../hbase_intro.ppt
ACID
Atomic: Either the whole process of a transaction is
done or none is.
Consistency: Database constraints (applicationspecific) are preserved.
Isolation: It appears to the user as if only one process
executes at a time. (Two concurrent transactions will
not see on another’s transaction while “in flight”.)
Durability: The updates made to the database in a
committed transaction will be visible to future
transactions. (Effects of a process do not get lost if
the system crashes.)
CAP Theorem
Consistency: Every node in the system contains the
same data (e.g. replicas are never out of data)
Availability: Every request to a non-failing node in
the system returns a response
Partition Tolerance: System properties
(consistency and/or availability) hold even when the
system is partitioned (communicate lost) and data is
lost (node lost)
Cassandra
Structured Storage System over a P2P Network
Why Cassandra?
• Lots of data
– Copies of messages, reverse indices of messages,
per user data.
• Many incoming requests resulting in a lot of
random reads and random writes.
• No existing production ready solutions in the
market meet these requirements.
Design Goals
• High availability
• Eventual consistency
– trade-off strong consistency in favor of high availability
• Incremental scalability
• Optimistic Replication
• “Knobs” to tune tradeoffs between consistency,
durability and latency
• Low total cost of ownership
• Minimal administration
innovation at scale
• google bigtable (2006)
– consistency model: strong
– data model: sparse map
– clones: hbase, hypertable
• amazon dynamo (2007)
– O(1) dht
– consistency model: client tune-able
– clones: riak, voldemort
cassandra ~= bigtable + dynamo
proven
• The Facebook stores 150TB of data on 150 nodes
web 2.0
• used at Twitter, Rackspace, Mahalo, Reddit,
Cloudkick, Cisco, Digg, SimpleGeo, Ooyala, OpenX,
others
Data Model
ColumnFamily1 Name : MailList
KEY
Name : tid1
Name : tid2
Name : tid3
Name : tid4
Value : <Binary>
Value : <Binary>
Value : <Binary>
Value : <Binary>
TimeStamp : t1
TimeStamp : t2
TimeStamp : t3
TimeStamp : t4
ColumnFamily2
Column Families
are declared
upfront are
SuperColumns
added and
modified
Columns
are added
dynamically
and modified
dynamically
Columns are added
and modified
Type : Simple
Sort : Name
dynamically
Name : WordList
Type : Super
Name : aloha
Sort : Time
Name : dude
C1
C2
C3
C4
C2
C6
V1
V2
V3
V4
V2
V6
T1
T2
T3
T4
T2
T6
ColumnFamily3 Name : System
Type : Super
Sort : Name
Name : hint1
Name : hint2
Name : hint3
Name : hint4
<Column List>
<Column List>
<Column List>
<Column List>
Write Operations
• A client issues a write request to a random
node in the Cassandra cluster.
• The “Partitioner” determines the nodes
responsible for the data.
• Locally, write operations are logged and then
applied to an in-memory version.
• Commit log is stored on a dedicated disk local
to the machine.
write op
Write cont’d
Key (CF1 , CF2 , CF3)
• Data size
Memtable ( CF1)
Commit Log
• Number of Objects
• Lifetime
Memtable ( CF2)
Binary serialized
Key ( CF1 , CF2 , CF3 )
Memtable ( CF2)
Data file on disk
K128 Offset
Dedicated Disk
<Key name><Size of key Data><Index of columns/supercolumns><
Serialized column family>
---
K256 Offset
---
K384 Offset
---
Bloom Filter
<Key name><Size of key Data><Index of columns/supercolumns><
Serialized column family>
(Index in memory)
BLOCK Index <Key Name> Offset, <Key Name> Offset
---
Compactions
K1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
-Sorted
---
K2 < Serialized data >
K4 < Serialized data >
K10 < Serialized data >
K5 < Serialized data >
K30 < Serialized data >
K10 < Serialized data >
--
--
DELETED
Sorted
---
MERGE SORT
Index File
K1 < Serialized data >
Loaded in memory
K2 < Serialized data >
K3 < Serialized data >
K1 Offset
K5 Offset
K30 Offset
Bloom Filter
Sorted
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
Data File
Sorted
---
Write Properties
•
•
•
•
•
No locks in the critical path
Sequential disk access
Behaves like a write back Cache
Append support without read ahead
Atomicity guarantee for a key
• “Always Writable”
– accept writes during failure scenarios
Read
Client
Query
Result
Cassandra Cluster
Closest replica
Read repair if
digests differ
Result
Replica A
Digest Query
Digest Response
Replica B
Digest Response
Replica C
Partitioning And Replication
h(key1)
1 0
E
A
N=3
C
h(key2)
F
B
D
1/2
93
Cluster Membership and Failure Detection
• Gossip protocol is used for cluster membership.
• Super lightweight with mathematically provable properties.
• State disseminated in O(logN) rounds where N is the number of nodes in
the cluster.
• Every T seconds each member increments its heartbeat counter and
selects one other member to send its list to.
• A member merges the list with its own list .
Accrual Failure Detector
• Valuable for system management, replication, load balancing etc.
• Defined as a failure detector that outputs a value, PHI, associated with
each process.
• Also known as Adaptive Failure detectors - designed to adapt to changing
network conditions.
• The value output, PHI, represents a suspicion level.
• Applications set an appropriate threshold, trigger suspicions and perform
appropriate actions.
• In Cassandra the average time taken to detect a failure is 10-15 seconds
with the PHI threshold set at 5.
Information Flow in the Implementation
Performance Benchmark
• Loading of data - limited by network
bandwidth.
• Read performance for Inbox Search in
production:
Search Interactions Term Search
Min
7.69 ms
7.78 ms
Median
15.69 ms
18.27 ms
Average
26.13 ms
44.41 ms
MySQL Comparison
• MySQL > 50 GB Data
Writes Average : ~300 ms
Reads Average : ~350 ms
• Cassandra > 50 GB Data
Writes Average : 0.12 ms
Reads Average : 15 ms
Lessons Learnt
• Add fancy features only when absolutely
required.
• Many types of failures are possible.
• Big systems need proper systems-level
monitoring.
• Value simple designs
Future work
•
•
•
•
•
Atomicity guarantees across multiple keys
Analysis support via Map/Reduce
Distributed transactions
Compression support
Granular security via ACL’s
Hive and Pig
Need for High-Level Languages
• Hadoop is great for large-data processing!
– But writing Java programs for everything is
verbose and slow
– Not everyone wants to (or can) write Java code
• Solution: develop higher-level data processing
languages
– Hive: HQL is like SQL
– Pig: Pig Latin is a bit like Perl
Hive and Pig
• Hive: data warehousing application in Hadoop
– Query language is HQL, variant of SQL
– Tables stored on HDFS as flat files
– Developed by Facebook, now open source
• Pig: large-scale data processing system
– Scripts are written in Pig Latin, a dataflow language
– Developed by Yahoo!, now open source
– Roughly 1/3 of all Yahoo! internal jobs
• Common idea:
– Provide higher-level language to facilitate large-data
processing
– Higher-level language “compiles down” to Hadoop jobs
Hive: Background
• Started at Facebook
• Data was collected by nightly cron jobs into
Oracle DB
• “ETL” via hand-coded python
• Grew from 10s of GBs (2006) to 1 TB/day new
data (2007), now 10x that
Source: cc-licensed slide by Cloudera
Hive Components
•
•
•
•
Shell: allows interactive queries
Driver: session handles, fetch, execute
Compiler: parse, plan, optimize
Execution engine: DAG of stages (MR, HDFS,
metadata)
• Metastore: schema, location in HDFS, SerDe
Source: cc-licensed slide by Cloudera
Data Model
• Tables
– Typed columns (int, float, string, boolean)
– Also, list: map (for JSON-like data)
• Partitions
– For example, range-partition tables by date
• Buckets
– Hash partitions within ranges (useful for sampling,
join optimization)
Source: cc-licensed slide by Cloudera
Metastore
• Database: namespace containing a set of
tables
• Holds table definitions (column types, physical
layout)
• Holds partitioning information
• Can be stored in Derby, MySQL, and many
other relational databases
Source: cc-licensed slide by Cloudera
Physical Layout
• Warehouse directory in HDFS
– E.g., /user/hive/warehouse
• Tables stored in subdirectories of warehouse
– Partitions form subdirectories of tables
• Actual data stored in flat files
– Control char-delimited text, or SequenceFiles
– With custom SerDe, can use arbitrary format
Source: cc-licensed slide by Cloudera
Hive: Example

Hive looks similar to an SQL database

Relational join on two tables:


Table of word counts from Shakespeare collection
Table of word counts from the bible
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;
the
I
and
to
of
a
you
my
in
is
25848
23031
19671
18038
16700
14170
12702
11297
10797
8882
Source: Material drawn from Cloudera training VM
62394
8854
38985
13526
34654
8057
2720
4135
12445
6884
Hive: Behind the Scenes
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;
(Abstract Syntax Tree)
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF bible k) (= (. (TOK_TABLE_OR_COL s)
word) (. (TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (. (TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (.
(TOK_TABLE_OR_COL k) freq))) (TOK_WHERE (AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k)
freq) 1))) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (. (TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))
(one or more of MapReduce jobs)
Hive: Behind the Scenes
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s
TableScan
alias: s
Filter Operator
predicate:
expr: (freq >= 1)
type: boolean
Reduce Output Operator
key expressions:
expr: word
type: string
sort order: +
Map-reduce partition columns:
expr: word
type: string
tag: 0
value expressions:
expr: freq
type: int
expr: word
type: string
k
TableScan
alias: k
Filter Operator
predicate:
expr: (freq >= 1)
type: boolean
Reduce Output Operator
key expressions:
expr: word
type: string
sort order: +
Map-reduce partition columns:
expr: word
type: string
tag: 1
value expressions:
expr: freq
type: int
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://localhost:8022/tmp/hive-training/364214370/10002
Reduce Output Operator
key expressions:
expr: _col1
type: int
sort order: tag: -1
value expressions:
expr: _col0
type: string
expr: _col1
type: int
expr: _col2
type: int
Reduce Operator Tree:
Extract
Limit
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
Stage: Stage-0
expr: ((_col0 >= 1) and (_col2 >= 1))
Fetch Operator
type: boolean
limit: 10
Select Operator
expressions:
expr: _col1
type: string
expr: _col0
type: int
expr: _col2
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Example Data Analysis Task
Find users who tend to visit “good” pages.
Visits
Pages
url
time
url
Amy
www.cnn.com
8:00
www.cnn.com
0.9
Amy
www.crap.com
8:05
www.flickr.com
0.9
Amy
www.myblog.com
10:00
www.myblog.com
0.7
Amy
www.flickr.com
10:05
www.crap.com
0.2
Fred
cnn.com/index.htm 12:00
...
Pig Slides adapted from Olston et al.
pagerank
...
user
Conceptual Dataflow
Load
Visits(user, url, time)
Load
Pages(url, pagerank)
Canonicalize URLs
Join
url = url
Group by user
Compute Average Pagerank
Filter
avgPR > 0.5
Pig Slides adapted from Olston et al.
System-Level Dataflow
Visits
load
Pages
...
...
load
canonicalize
join by url
...
group by user
...
the answer
Pig Slides adapted from Olston et al.
compute average pagerank
filter
MapReduce Code
import
import
import
import
java.io.IOException;
java.util.ArrayList;
java.util.Iterator;
java.util.List;
reporter.setStatus("OK");
import
import
import
import
import
import
import
import
import
import
import
import
import
import
import
i m po r t
import
import
import
import
import
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.io.Writable;
org.apache.hadoop.io.WritableComparable;
org.apache.hadoop.mapred.FileInputFormat;
org.apache.hadoop.mapred.FileOutputFormat;
org.apache.hadoop.mapred.JobConf;
org.apache.hadoop.mapred.KeyValueTextInputFormat;
o r g . ap a c h e . h a d o o p . m a p r e d . M a p p e r ;
org.apache.hadoop.mapred.MapReduceBase;
org.apache.hadoop.mapred.OutputCollector;
org.apache.hadoop.mapred.RecordReader;
org.apache.hadoop.mapred.Reducer;
org.apache.hadoop.mapred.Reporter;
org.apache.hadoop.mapred.SequenceFileInputFormat;
org.apache.hadoop.mapred.SequenceFileOutputFormat;
org.apache.hadoop.mapred.TextInputFormat;
org.apache.hadoop.mapred.jobcontrol.Job;
o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J oo
bn
Ct r o l ;
org.apache.hadoop.mapred.lib.IdentityMapper;
public class MRExample {
public static class LoadPages extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text>
// Do the cross product and collect the values
for (String s1 : first) {
for (String s2 : second) {
String outval = key + "," + s1 + "," +
oc.collect(null, new Text(outval));
reporter.setStatus("OK");
}
}
}
}
public static class LoadJoined extends MapReduceBase
implements Mapper<Text, Text, Text, LongWritable>
{
}
}
public static class ReduceUrls extends MapReduceBase
implements Reducer<Text, LongWritable, WritableComparable,
Writable> {
public
file
//
}
}
public static class LoadAndFilterUsers extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public
void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String value = line.substring
f(
irstComma + 1);
int age = Integer.parseInt(value);
if (age < 18 || age > 25) return;
String key = line.substring(0, firstComma);
Text outKey = new Text(key);
/ / P r e p e n d a n i n d e x t o t h e v a l u e s oe wk n o w w h i c h
// it came from.
Text outVal = new Text("2" + value);
oc.collect(outKey, outVal);
//
store
it
// accordingly.
List<String> first = new ArrayList<String>();
List<String> second = new ArrayList<String>();
while (iter.hasNext()) {
Text t = iter.next();
S t r i n g v a l u e = t . tSot r i n g ( ) ;
if (value.charAt(0) == '1')
first.add(value.substring(1));
else second.add(value.substring(1));
Pig Slides adapted from Olston et al.
void reduce(
T e x t ky
e,
Iterator<LongWritable> iter,
OutputCollector<WritableComparable, Writable>
Reporter reporter) throws IOException {
Add up all the values we see
oc,
long sum = 0;
w hi l e ( i t e r . h a s N e x t ( ) ) {
sum += iter.next().get();
reporter.setStatus("OK");
}
oc.collect(key, new LongWritable(sum));
}
}
public static class LoadClicks extends MapReduceBase
im p l e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e , L o n g W r i t a b l e ,
Text> {
public
file
void map(
WritableComparable key,
Writable val,
OutputCollector<LongWritable, Text> oc,
R e p o r t e r r e p o r t e r )t h r o w s I O E x c e p t i o n {
oc.collect((LongWritable)val, (Text)key);
}
}
public static class LimitClicks extends MapReduceBase
implements Reducer<LongWritable, Text, LongWritable,
{
void reduce(Text key,
Iterator<Text> iter,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
For each value, figure out which file it's
{
void map(
Text k,
Text val,
O u t p u t C o l lc
et o r < T e x t , L o n g W r i t a b l e > o c ,
Reporter reporter) throws IOException {
// Find the url
String line = val.toString();
int firstComma = line.indexOf(',');
i n t s e c o n d C o m m a = l i n e . i n d e x O f ( ' , ' , f i rCsotm m a ) ;
String key = line.substring(firstComma, secondComma);
// drop the rest of the record, I don't need it anymore,
// just pass a 1 for the combiner/reducer to sum instead.
Text outKey = new Text(key);
oc.collect(outKey, new LongWritable(1L));
void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
S t r i n g k e y = l i n e . s usbt r i n g ( 0 , f i r s t C o m m a ) ;
String value = line.substring(firstComma + 1);
Text outKey = new Text(key);
// Prepend an index to the value so we know which
// it came from.
Text outVal = new Text("
"1 + v a l u e ) ;
oc.collect(outKey, outVal);
public
s2;
public
public
}
}
public static class Join extends MapReduceBase
implements Reducer<Text, Text, Text, Text>
lp.setOutputKeyClass(Text.class);
lp.setOutputValueClass(Text.class);
lp.setMapperClass(LoadPages.class);
FileInputFormat.addInputPath(lp, new
P a t h ( " /u s e r / g a t e s / p a g e s " ) ) ;
FileOutputFormat.setOutputPath(lp,
new Path("/user/gates/tmp/indexed_pages"));
lp.setNumReduceTasks(0);
Job loadPages = new Job(lp);
}
Text>
int count = 0;
public void reduce(
LongWritable key,
Iterator<Text> iter,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
from
and
// Only output the first 100 records
w h i l e ( c o u n t< 1 0 0 & & i t e r . h a s N e x t ( ) )
oc.collect(key, iter.next());
count++;
}
{
JobConf lfu = new JobConf(MRExample.class);
l f u . se t J o b N a m e ( " L o a d a n d F i l t e r U s e r s " ) ;
lfu.setInputFormat(TextInputFormat.class);
lfu.setOutputKeyClass(Text.class);
lfu.setOutputValueClass(Text.class);
lfu.setMapperClass(LoadAndFilterUsers.class);
F i l e I n p u t F o r m a t . a dI
dn p u t P a t h ( l f u , n e w
Path("/user/gates/users"));
FileOutputFormat.setOutputPath(lfu,
new Path("/user/gates/tmp/filtered_users"));
lfu.setNumReduceTasks(0);
Job loadUsers = new Job(lfu);
J o b C o n f j o i n = n e w J o b C o n fM(R E x a m p l e . c l a s s ) ;
join.setJobName("Join Users and Pages");
join.setInputFormat(KeyValueTextInputFormat.class);
join.setOutputKeyClass(Text.class);
join.setOutputValueClass(Text.class);
join.setMapperClass(IdentityMa
pp
er.class);
join.setReducerClass(Join.class);
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/indexed_pages"));
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/filtered_users"));
F i l e O u t p u t F o r m a t . st
eO u t p u t P a t h ( j o i n , n e w
Path("/user/gates/tmp/joined"));
join.setNumReduceTasks(50);
Job joinJob = new Job(join);
joinJob.addDependingJob(loadPages);
joinJob.addDependingJob(loadUsers);
JobConf group = new JobConf(MR
xE
ample.class);
group.setJobName("Group URLs");
group.setInputFormat(KeyValueTextInputFormat.class);
group.setOutputKeyClass(Text.class);
group.setOutputValueClass(LongWritable.class);
group.setOutputFormat(SequenceF
li
eOutputFormat.class);
group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
group.setReducerClass(ReduceUrls.class);
FileInputFormat.addInputPath(group, new
Path("/user/gates/tmp/joined"));
FileOutputFormat.setOutputPath(group, new
Path("/user/gates/tmp/grouped"));
group.setNumReduceTasks(50);
Job groupJob = new Job(group);
groupJob.addDependingJob(joinJob);
JobConf top100 = new JobConf(MRExample.class);
top100.setJobName("Top 100 sites");
top100.setInputFormat(SequenceFileInputFormat.class);
top100.setOutputKeyClass(LongWritable.class);
top100.setOutputValueClass(Text.class);
t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p uotrFm a t . c l a s s ) ;
top100.setMapperClass(LoadClicks.class);
top100.setCombinerClass(LimitClicks.class);
top100.setReducerClass(LimitClicks.class);
FileInputFormat.addInputPath(top100, new
Path("/user/gates/tmp/grouped"));
FileOutputFormat.setOutputPath(top100, new
Path("/user/gates/top100sitesforusers18to25"));
top100.setNumReduceTasks(1);
Job limit = new Job(top100);
limit.addDependingJob(groupJob);
{
}
}
public static void main(String[] args) throws IOException
JobConf lp = new JobConf(MRExample.class);
l p . s et J o b N a m e ( " L o a d P a g e s " ) ;
lp.setInputFormat(TextInputFormat.class);
18
to
{
}
}
JobControl jc = new JobControl("Find
25");
jc.addJob(loadPages);
jc.addJob(loadUsers);
jc.addJob(joinJob);
jc.addJob(groupJob);
jc.addJob(limit);
jc.run();
t o1
p0 0
sites
for
users
Pig Latin Script
Visits = load
‘/data/visits’ as (user, url, time);
Visits = foreach Visits generate user, Canonicalize(url), time;
Pages = load
‘/data/pages’ as (url, pagerank);
VP = join
Visits by url, Pages by url;
UserVisits = group
VP by user;
UserPageranks = foreach UserVisits generate user,
AVG(VP.pagerank) as avgpr;
GoodUsers = filter UserPageranks by avgpr > ‘0.5’;
store
GoodUsers into '/data/good_users';
Pig Slides adapted from Olston et al.
Java vs. Pig Latin
1/20 the lines of code
1/16 the development time
300
180
160
140
120
100
80
60
40
20
0
M in u te s
250
200
150
100
50
0
Hadoop
Pig
Hadoop
Performance on par with raw Hadoop!
Pig Slides adapted from Olston et al.
Pig
Pig takes care of…

Schema and type checking

Translating into efficient physical dataflow


Exploiting data reduction opportunities


(e.g., early partial aggregation via a combiner)
Executing the system-level dataflow


(i.e., sequence of one or more MapReduce jobs)
(i.e., running the MapReduce jobs)
Tracking progress, errors, etc.
Hive + HBase?
Integration

Reasons to use Hive on HBase:




A lot of data sitting in HBase due to its usage in a real-time
environment, but never used for analysis
Give access to data in HBase usually only queried through
MapReduce to people that don’t code (business analysts)
When needing a more flexible storage solution, so that rows can
be updated live by either a Hive job or an application and can be
seen immediately to the other
Reasons not to do it:


Run SQL queries on HBase to answer live user requests (it’s still a
MR job)
Hoping to see interoperability with other SQL analytics systems
Integration

How it works:

Hive can use tables that already exist in HBase or manage its own
ones, but they still all reside in the same HBase instance
Hive table definitions
Points to an existing table
Manages this table from Hive
HBase
Integration

How it works:

When using an already existing table, defined as EXTERNAL, you
can create multiple Hive tables that point to it
Hive table definitions
Points to some column
Points to other
columns,
different names
HBase
Integration

How it works:

Columns are mapped however you want, changing names and giving
types
Hive table definition
HBase table
persons
people
name STRING
d:fullname
age INT
d:age
siblings MAP<string, string>
d:address
f:
Integration

Drawbacks (that can be fixed with brain juice):





Binary keys and values (like integers represented on 4 bytes)
aren’t supported since Hive prefers string representations, HIVE1634
Compound row keys aren’t supported, there’s no way of using
multiple parts of a key as different “fields”
This means that concatenated binary row keys are completely
unusable, which is what people often use for HBase
Filters are done at Hive level instead of being pushed to the region
servers
Partitions aren’t supported
Data Flows

Data is being generated all over the place:




Apache logs
Application logs
MySQL clusters
HBase clusters
Data Flows

Moving application log files
Transforms format
Dumped into
HDFS
Read nightly
Wild log file
Tail’ed
continuou
sly
Inserted into
Parses into HBase format
HBase
Data Flows

Moving MySQL data
Dumped
nightly with
CSV import
HDFS
MySQL
Tungsten
replicator
Inserted into
Parses into HBase format
HBase
Data Flows

Moving HBase data
HBase Prod
CopyTable MR job
Read in parallel
HBase MR
Imported in parallel into
* HBase replication currently only works for a single slave cluster, in our case HBase
replicates to a backup cluster.
Use Cases

Front-end engineers


Research engineers



They need some statistics regarding their latest product
Ad-hoc queries on user data to validate some assumptions
Generating statistics about recommendation quality
Business analysts




Statistics on growth and activity
Effectiveness of advertiser campaigns
Users’ behavior VS past activities to determine, for example, why
certain groups react better to email communications
Ad-hoc queries on stumbling behaviors of slices of the user base
Use Cases

Using a simple table in HBase:
CREATE EXTERNAL TABLE blocked_users(
userid INT,
blockee INT,
blocker INT,
created BIGINT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,f:blockee,f:blocker,f:created")
TBLPROPERTIES("hbase.table.name" = "m2h_repl-userdb.stumble.blocked_users");
HBase is a special case here, it has a unique row key map with :key
Not all the columns in the table need to be mapped
Use Cases

Using a complicated table in HBase:
CREATE EXTERNAL TABLE ratings_hbase(
userid INT,
created BIGINT,
urlid INT,
rating INT,
topic INT,
modified BIGINT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key#b@0,:key#b@1,:key#b@2,default:rating#b,default:topic#b,default:modified#b")
TBLPROPERTIES("hbase.table.name" = "ratings_by_userid");
#b means binary, @ means position in composite key (SU-specific hack)
Graph Databases
136
NEO4J (Graphbase)
• A graph is a collection nodes (things) and edges (relationships) that connect
pairs of nodes.
• Attach properties (key-value pairs) on nodes and relationships
•Relationships connect two nodes and both nodes and relationships can hold an
arbitrary amount of key-value pairs.
• A graph database can be thought of as a key-value store, with full support for
relationships.
• http://neo4j.org/
137
NEO4J
138
NEO4J
139
NEO4J
140
NEO4J
141
NEO4J
142
NEO4J
Properties
143
NEO4J Features
• Dual license: open source and commercial
•Well suited for many web use cases such as tagging, metadata annotations,
social networks, wikis and other network-shaped or hierarchical data sets
• Intuitive graph-oriented model for data representation. Instead of static and
rigid tables, rows and columns, you work with a flexible graph network
consisting of nodes, relationships and properties.
• Neo4j offers performance improvements on the order of 1000x
or more compared to relational DBs.
• A disk-based, native storage manager completely optimized for storing
graph structures for maximum performance and scalability
• Massive scalability. Neo4j can handle graphs of several billion
nodes/relationships/properties on a single machine and can be sharded to
scale out across multiple machines
•Fully transactional like a real database
•Neo4j traverses depths of 1000 levels and beyond at millisecond speed.
(many orders of magnitude faster than relational systems)
144
Download