NoSQL Databases and Data in the Cloud

advertisement
NoSQL Databases and Data in the Cloud
Robert Thew
Internet and Web Systems II
Spring 2011
Abstract
With the rapid increase in data being generated by the Web, Relational Databases
Management Systems (RDBMS) have had their limitations exposed. The problems
have spurred the development of a class of databases that abandoned the standard
SQL approach and have embraced many different approaches to dealing with
today’s data issues.
The Problem With Relational Databases
Relational databases have rigid schemas that are difficult to modify once millions of
records have been added to them. They have scalability issues as well when tables
contain millions of records, let alone billions. Joins can become virtually impossible
as the number of relations grow.
Various partition schemes have been developed to address the scalability issues.
Vertical Partitioning groups tables by functional areas and places them on different
nodes. Horizontal Partitioning - also called sharding – splits tables by keys and
places these shards on different nodes. Both methods can improve performance but
they have the same downside – links between tables get broken because of the
partitioning and the databases are no longer truly relational.
RDBMS’s put a premium on Data Integrity, but that is not the most important issue
to many of the biggest web sites – High Availability and Scalability are. (Bain, 2009)
NoSQL databases have been created to fill the needs that are not being met by
RDBMS’s.
NoSQL
NoSQL is a term that describes a data storage system that is not a typical relational
database management system. It is not a defined language like SQL, it is just an
informal term for a group of databases. (NoSQL.org)
Many types of databases fall under the NoSQL label. Being defined by what is it not,
rather than what it is, means that this term can cover a great variety of database
systems:
Document Stores: CouchDB, MongoDB
Key/Value Stores: BigTable, Cassandra
Graph Databases: Neo4j, AllegroGraph
Tabular Databases: HBase, Hypertable
Each type of DBMS has its own strengths and weaknesses. Having different types of
database means that developers aren’t reliant on a one-size-fits-all solution and can
choose a DBMS that is a good match for whatever problems they are facing.
Terminology
In discussing NoSQL databases, some terms are frequently used. They are defined
below.
JSON
JavaScript Object Notation (JSON) is a simple text-based standard for representing
objects. It is based on JavaScript and any JSON object can be instantiated in
JavaScript using an Eval statement. At heart it is a key/value tuple notation with
some parentheses to group the tuples into objects. (Introducing JSON)
MapReduce
MapReduce is a software framework developed by Google for efficient and
distributed processing of large data sets. It consists of two steps: Map and Reduce. In
the Map step, the data set is partitioned into series of smaller sets and distributed to
worker nodes. The worker nodes may perform the Map step themselves, to further
partition and distribute the data, resulting in a tree structure of nodes. In the Reduce
step, each worker node processes its data set and passes the results up to its parent.
This allows for each node to be run in a parallel process or run by different CPUs.
(Williams)
REST
Representational State Transfer (REST) is an alternative to SOAP that has been
designed to be a simpler and more consistent method for transferring resource
representations between clients and servers. When using SOAP, developers define
their own API for handling objects, but with REST, the basic Create, Read, Update,
and Delete (CRUD) operations follow the same pattern for every instance.
A system that implements the REST architecture is said to be RESTful. The
simplicity of REST has made it a popular choice to replace SOAP in web
development. Ruby on Rails implements the REST architecture by default.
(Rodriguez)
Types Of Database Management Systems
There are many different types of NoSQL database management systems. The ones
discussed in this paper can be grouped into four categories.
Document Stores
CouchDB
A non-relational document store. The documents are JSON documents.
Developed in 1995 by Damien Katz. In 2008 it became a top-level Apache project.
It was originally written in c++ but was ported to Erlang because of that language’s
concurrency and greater emphasis on fault tolerance. Couch has a RESTful API that
passes messages in JSON form. A CouchDB instance can be accessed using HTTP
requests. (CouchDB.org)
Documents in CouchDB are in JSON format, consisting of named fields and values,
like:
“FirstName”: “Robert”
“LastName”: “Thew”
“Date”: “4/15/2011”
“Platforms”: [“Linux”, “Windows”, “Mac OS”]
The values can be strings, numbers, dates, lists or hashes. Each document has a
unique ID, but that is the extent of the schema. The document contents have a
format, but no schema. This solves the problem that relational databases have with
data that keeps evolving over time. The longer a relational database remains in use,
the harder it is to make changes to its schema. Oftentimes this is not a problem with
the database itself, but with the various interfaces that depend on the database. A
simple change to the database that can be done with a single Alter command may
end up breaking front-end applications and websites, resulting in the need for
extensive code changes.
Having no schema eliminates these evolving data problems from the database and
can also lesson the front-end dependencies. Knowing that there is no set schema
changes a developers approach to creating interfaces to the database.
CouchDB doesn’t use locking on its documents, the way relational databases do on
their records. Given their document model, they use a system more akin to version
control. If two clients edit the same document, the last client to save their updates
will receive a conflict warning. They then have the option of loading the latest
update and reapplying their edits, then saving again.
To apply some structure to the document data, views can be constructed. Views are
defined using JavaScript as the Mapping part of a map-reduce system.
As should be somewhat obvious from the use of JavaScript, HTTP and a RESTful API,
CouchDB has been designed to be used on the web.
MongoDB
Like CouchDB, MongoDB is a JSON document store. MongoDB is written in C++ and
was released in 2009. (MongoDB.org)
MongoDB is more focused on performance than CouchDB. It doesn’t use a RESTful
API, but has language-specific database drivers for each client language. C++ can run
faster than Erlang, but it can lose data upon crashes, unlike CouchDB.
(MongoDB.org)
CouchDB does query optimization with automatic indexing, unlike CouchDB, where
indexes have to be defined by users. Mongo also does Horizontal Scaling using
sharding based on Keys.
CouchDB uses Map/Reduce exclusively for building views. MongoDB has a more
extensive JavaScript based query system, with Map/Reduce being just one part of it.
Key/Value Stores
BigTable
BigTable is a proprietary DBS developed by Google in 2004. It is not available
outside of the Google App Engine Datastore. BigTable is built with scalability in
mind, and the ability to run on cheap hardware. It is built to be scaled across
thousands of server nodes, and to allow for rapid increase in the number of nodes,
for fast increases in scale. (Hitchcock)
The tables in BigTable are multidimensional maps. Tables consist of rows and
columns, and each cell in the table has a time version. There can be many versions of
each cell with different times, so a history of changes can be tracked. The tables are
split on row boundaries into tablets, each about 200 MB.
BigTable could also be considered a Tabular DBMS.
Cassandra
Cassandra is an Open-Source DBS that was developed by Facebook in 2008 but has
since become a top-level Apache project. Cassandra follows the BigTable data
model, and like BigTable is designed to scale across many cheap servers.
(Cassandra.org)
Data gets automatically replicated across multiple nodes, so it is fault-tolerant, and
any server node that fails can be replaced with no downtime or loss of data.
In its data model, Keys map to Columns, which store column names, values, and a
timestamp. Columns are grouped together into Column Families, which are kind of
like tables. Columns can be sorted by name or by timestamp.
Aside from Facebook, Cassandra is also used by sites high-volume sites such as
Twitter, Digg, and Reddit.
Graph Databases
Neo4J
Neo4J uses a Graph model for storing its data and is the widest deployed DBMS
using this model. It is built in Java and was released in 2007. This means it has nodes
and links between nodes, and both the nodes and links can have properties
associated with them. The benefit of such a model is it results in high performance
when traversing connected datasets. (Neo4J.org)
In a relational database, joins are a time-expensive operation, and when dealing
with massive datasets, they can be crippling. The Graph model, using node
traversals rather than joins, excels at handling these operations.
Social websites, in particular, derive much of their value through the relationships
that connect their users.
Generating recommendations is another good fit for the Graph Model.
Recommendations can be derived from linking patterns.
AllegroGraph
AllegroGraph is an open-source project created by Franz, Inc. that was released in
2010. It is a Graph model database that stores Resource Description Framework
(RDF) triples. A DBMS that stores RDF is called a Triplestore. A Triple is like a tuple,
but it stores three pieces of information, such as an object having an attribute or a
relation that connects two objects. Some examples would be “Rob is Blonde”, “Bill
likes Pizza”, or “John loves Mary”.
In the database, the triples get represented as a set of connected links. Like Neo4J,
AllegroGraph excels at relational joins. (Franz)
AllegroGraph is currently being used by Pfizer to enable searching across project
data for drug development. (Marx, 2010)
Tabular Databases
HBase
HBase is modeled after Google’s BigTable, but it is open-source. It is designed to run
Hadoop Distributed Filesystem, providing BigTable like capabilities for Hadoop.
Both Hadoop and HBase are Apache projects. (HBase.org)
Hadoop is a software framework that is designed for reliable, scalable, distributed
applications. Like Apache itself, there are many subprojects built for the Hadoop
framework, and HBase is the database for Hadoop.
Hadoop is built to handle extreme tables sizes, billions of rows and millions of
columns, which would cripple a relational DBMS. Hadoop was written in Java and
exposes a RESTful web service.
Hypertable
Like HBase, Hypertable is an open-source DBMS inspired by BigTable. It is written
in C++ for performance reasons. It is a cross-platform DBMS and can run on
distributed file systems like Hadoop. (Hypertable)
The Rise of NoSQL
There are many reasons for the rise in the use of NoSQL databases, some
technological and some economic. The growth of data that is being stored in the
cloud has fueled the rise as well, as developers were pushed past the capabilities of
Relational DBMS’s and had to look for, or build alternatives. (North, 2010)
Scalability is an expensive problem to fix. There’s the hardware expense that comes
from adding more server nodes, but when using a commercial DBMS, there is also
the cost of licenses for the added server nodes. These costs have driven some
companies, like Google with BigTable, to create their own BDMS that is both tailored
to their particular set of problems, and – for them, at least – license free. These
efforts have inspired others in the Open Source community to build free versions of
these proprietary systems.
SQL databases have been used a general tool for solving any database problem. They
have this ability but it quickly runs into limits as the amount of data grows. The new
generation of NoSQL databases are much more specialized. Now, developers can
examine what problems they need to have solved, which capabilities are important
to them and which are not, and choose the DBMS that matches their problem set.
For a company like Twitter or Facebook, scalability and high availability are much
more important that data consistency. For others the main problem to be solved is
distributed processing.
Conclusion
All the databases detailed above have a similarity other than NoSQL – they have all
been developed since the rise of the web to deal with the issues that have arisen
because of the vast amounts of data being generated every day on the Internet. They
are all, to some degree, Web databases. What they mean for developers is choice, the
opportunity to chose the right tool for the job rather than trying to force one tool to
do all jobs. There’s a saying ‘When all you have is a hammer, all your problems begin
to look like nails”, but that’s not really true. When all you have is a hammer, you
don’t have any choice but to treat all your problems like nails. You know they’re not
nails, but you hammer away regardless. Well now we have some saws and drills and
wrenches to go along with our hammers.
Works Cited
Bain, T. (2009, 02 12). Is the Relational Database Doomed? Retrieved from
ReadWrite Enterprise: http://www.readwriteweb.com/enterprise/2009/02/is-therelational-database-doomed.php
Cassandra.org. (n.d.). Cassandra. Retrieved from Cassandra:
http://cassandra.apache.org/
CouchDB.org. (n.d.). The CouchDB Project. Retrieved from The CouchDB Project:
http://couchdb.apache.org/
Franz. (n.d.). AllegroGraph. Retrieved from AllegroGraph:
http://www.franz.com/agraph/allegrograph/
HBase.org. (n.d.). HBase. Retrieved from HBase: http://hbase.apache.org/
Hitchcock, A. (n.d.). Google's BigTable. Retrieved from Google Blogscoped:
http://blogoscoped.com/archive/2005-10-23-n61.html
Hypertable. (n.d.). Hypertable. Retrieved from Hypertable:
http://www.hypertable.org/
Introducing JSON. (n.d.). Retrieved from JSON: http://www.json.org/
Marx, V. (2010, May). Pfizer Partners with IO, Franz on Semantic Proof of Concept to
Build Bridges Between Data Resources. Retrieved from GenomeWeb:
http://www.genomeweb.com/informatics/pfizer-partners-io-franz-semanticproof-concept-build-bridges-between-data-resou
MongoDB.org. (n.d.). Comparing MongoDB and CouchDB. Retrieved from MongDB:
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
MongoDB.org. (n.d.). MongoDB. Retrieved from MongoDB:
http://www.mongodb.org/
Neo4J.org. (n.d.). Neo4J. Retrieved from Neo4J: http://neo4j.org/
North, K. (2010, May 22). The NoSQL Alternative. Retrieved April 24, 2011, from
Information Week:
http://www.informationweek.com/news/development/architecturedesign/showArticle.jhtml?articleID=224900559
NoSQL.org. (n.d.). NoSQL. Retrieved from NoSQL: http://nosql-databases.org/
Rodriguez, A. (n.d.). RESTful Web Services: The Basics. Retrieved from IBM Developer
Works: https://www.ibm.com/developerworks/webservices/library/ws-restful/
Williams, M. (n.d.). Understanding MapReduce. Retrieved from Matt Williams:
http://wordflows.com/matt/2009/01/18/understanding-mapreduce/
Download