NoSQL Databases and Data in the Cloud Robert Thew Internet and Web Systems II Spring 2011 Abstract With the rapid increase in data being generated by the Web, Relational Databases Management Systems (RDBMS) have had their limitations exposed. The problems have spurred the development of a class of databases that abandoned the standard SQL approach and have embraced many different approaches to dealing with today’s data issues. The Problem With Relational Databases Relational databases have rigid schemas that are difficult to modify once millions of records have been added to them. They have scalability issues as well when tables contain millions of records, let alone billions. Joins can become virtually impossible as the number of relations grow. Various partition schemes have been developed to address the scalability issues. Vertical Partitioning groups tables by functional areas and places them on different nodes. Horizontal Partitioning - also called sharding – splits tables by keys and places these shards on different nodes. Both methods can improve performance but they have the same downside – links between tables get broken because of the partitioning and the databases are no longer truly relational. RDBMS’s put a premium on Data Integrity, but that is not the most important issue to many of the biggest web sites – High Availability and Scalability are. (Bain, 2009) NoSQL databases have been created to fill the needs that are not being met by RDBMS’s. NoSQL NoSQL is a term that describes a data storage system that is not a typical relational database management system. It is not a defined language like SQL, it is just an informal term for a group of databases. (NoSQL.org) Many types of databases fall under the NoSQL label. Being defined by what is it not, rather than what it is, means that this term can cover a great variety of database systems: Document Stores: CouchDB, MongoDB Key/Value Stores: BigTable, Cassandra Graph Databases: Neo4j, AllegroGraph Tabular Databases: HBase, Hypertable Each type of DBMS has its own strengths and weaknesses. Having different types of database means that developers aren’t reliant on a one-size-fits-all solution and can choose a DBMS that is a good match for whatever problems they are facing. Terminology In discussing NoSQL databases, some terms are frequently used. They are defined below. JSON JavaScript Object Notation (JSON) is a simple text-based standard for representing objects. It is based on JavaScript and any JSON object can be instantiated in JavaScript using an Eval statement. At heart it is a key/value tuple notation with some parentheses to group the tuples into objects. (Introducing JSON) MapReduce MapReduce is a software framework developed by Google for efficient and distributed processing of large data sets. It consists of two steps: Map and Reduce. In the Map step, the data set is partitioned into series of smaller sets and distributed to worker nodes. The worker nodes may perform the Map step themselves, to further partition and distribute the data, resulting in a tree structure of nodes. In the Reduce step, each worker node processes its data set and passes the results up to its parent. This allows for each node to be run in a parallel process or run by different CPUs. (Williams) REST Representational State Transfer (REST) is an alternative to SOAP that has been designed to be a simpler and more consistent method for transferring resource representations between clients and servers. When using SOAP, developers define their own API for handling objects, but with REST, the basic Create, Read, Update, and Delete (CRUD) operations follow the same pattern for every instance. A system that implements the REST architecture is said to be RESTful. The simplicity of REST has made it a popular choice to replace SOAP in web development. Ruby on Rails implements the REST architecture by default. (Rodriguez) Types Of Database Management Systems There are many different types of NoSQL database management systems. The ones discussed in this paper can be grouped into four categories. Document Stores CouchDB A non-relational document store. The documents are JSON documents. Developed in 1995 by Damien Katz. In 2008 it became a top-level Apache project. It was originally written in c++ but was ported to Erlang because of that language’s concurrency and greater emphasis on fault tolerance. Couch has a RESTful API that passes messages in JSON form. A CouchDB instance can be accessed using HTTP requests. (CouchDB.org) Documents in CouchDB are in JSON format, consisting of named fields and values, like: “FirstName”: “Robert” “LastName”: “Thew” “Date”: “4/15/2011” “Platforms”: [“Linux”, “Windows”, “Mac OS”] The values can be strings, numbers, dates, lists or hashes. Each document has a unique ID, but that is the extent of the schema. The document contents have a format, but no schema. This solves the problem that relational databases have with data that keeps evolving over time. The longer a relational database remains in use, the harder it is to make changes to its schema. Oftentimes this is not a problem with the database itself, but with the various interfaces that depend on the database. A simple change to the database that can be done with a single Alter command may end up breaking front-end applications and websites, resulting in the need for extensive code changes. Having no schema eliminates these evolving data problems from the database and can also lesson the front-end dependencies. Knowing that there is no set schema changes a developers approach to creating interfaces to the database. CouchDB doesn’t use locking on its documents, the way relational databases do on their records. Given their document model, they use a system more akin to version control. If two clients edit the same document, the last client to save their updates will receive a conflict warning. They then have the option of loading the latest update and reapplying their edits, then saving again. To apply some structure to the document data, views can be constructed. Views are defined using JavaScript as the Mapping part of a map-reduce system. As should be somewhat obvious from the use of JavaScript, HTTP and a RESTful API, CouchDB has been designed to be used on the web. MongoDB Like CouchDB, MongoDB is a JSON document store. MongoDB is written in C++ and was released in 2009. (MongoDB.org) MongoDB is more focused on performance than CouchDB. It doesn’t use a RESTful API, but has language-specific database drivers for each client language. C++ can run faster than Erlang, but it can lose data upon crashes, unlike CouchDB. (MongoDB.org) CouchDB does query optimization with automatic indexing, unlike CouchDB, where indexes have to be defined by users. Mongo also does Horizontal Scaling using sharding based on Keys. CouchDB uses Map/Reduce exclusively for building views. MongoDB has a more extensive JavaScript based query system, with Map/Reduce being just one part of it. Key/Value Stores BigTable BigTable is a proprietary DBS developed by Google in 2004. It is not available outside of the Google App Engine Datastore. BigTable is built with scalability in mind, and the ability to run on cheap hardware. It is built to be scaled across thousands of server nodes, and to allow for rapid increase in the number of nodes, for fast increases in scale. (Hitchcock) The tables in BigTable are multidimensional maps. Tables consist of rows and columns, and each cell in the table has a time version. There can be many versions of each cell with different times, so a history of changes can be tracked. The tables are split on row boundaries into tablets, each about 200 MB. BigTable could also be considered a Tabular DBMS. Cassandra Cassandra is an Open-Source DBS that was developed by Facebook in 2008 but has since become a top-level Apache project. Cassandra follows the BigTable data model, and like BigTable is designed to scale across many cheap servers. (Cassandra.org) Data gets automatically replicated across multiple nodes, so it is fault-tolerant, and any server node that fails can be replaced with no downtime or loss of data. In its data model, Keys map to Columns, which store column names, values, and a timestamp. Columns are grouped together into Column Families, which are kind of like tables. Columns can be sorted by name or by timestamp. Aside from Facebook, Cassandra is also used by sites high-volume sites such as Twitter, Digg, and Reddit. Graph Databases Neo4J Neo4J uses a Graph model for storing its data and is the widest deployed DBMS using this model. It is built in Java and was released in 2007. This means it has nodes and links between nodes, and both the nodes and links can have properties associated with them. The benefit of such a model is it results in high performance when traversing connected datasets. (Neo4J.org) In a relational database, joins are a time-expensive operation, and when dealing with massive datasets, they can be crippling. The Graph model, using node traversals rather than joins, excels at handling these operations. Social websites, in particular, derive much of their value through the relationships that connect their users. Generating recommendations is another good fit for the Graph Model. Recommendations can be derived from linking patterns. AllegroGraph AllegroGraph is an open-source project created by Franz, Inc. that was released in 2010. It is a Graph model database that stores Resource Description Framework (RDF) triples. A DBMS that stores RDF is called a Triplestore. A Triple is like a tuple, but it stores three pieces of information, such as an object having an attribute or a relation that connects two objects. Some examples would be “Rob is Blonde”, “Bill likes Pizza”, or “John loves Mary”. In the database, the triples get represented as a set of connected links. Like Neo4J, AllegroGraph excels at relational joins. (Franz) AllegroGraph is currently being used by Pfizer to enable searching across project data for drug development. (Marx, 2010) Tabular Databases HBase HBase is modeled after Google’s BigTable, but it is open-source. It is designed to run Hadoop Distributed Filesystem, providing BigTable like capabilities for Hadoop. Both Hadoop and HBase are Apache projects. (HBase.org) Hadoop is a software framework that is designed for reliable, scalable, distributed applications. Like Apache itself, there are many subprojects built for the Hadoop framework, and HBase is the database for Hadoop. Hadoop is built to handle extreme tables sizes, billions of rows and millions of columns, which would cripple a relational DBMS. Hadoop was written in Java and exposes a RESTful web service. Hypertable Like HBase, Hypertable is an open-source DBMS inspired by BigTable. It is written in C++ for performance reasons. It is a cross-platform DBMS and can run on distributed file systems like Hadoop. (Hypertable) The Rise of NoSQL There are many reasons for the rise in the use of NoSQL databases, some technological and some economic. The growth of data that is being stored in the cloud has fueled the rise as well, as developers were pushed past the capabilities of Relational DBMS’s and had to look for, or build alternatives. (North, 2010) Scalability is an expensive problem to fix. There’s the hardware expense that comes from adding more server nodes, but when using a commercial DBMS, there is also the cost of licenses for the added server nodes. These costs have driven some companies, like Google with BigTable, to create their own BDMS that is both tailored to their particular set of problems, and – for them, at least – license free. These efforts have inspired others in the Open Source community to build free versions of these proprietary systems. SQL databases have been used a general tool for solving any database problem. They have this ability but it quickly runs into limits as the amount of data grows. The new generation of NoSQL databases are much more specialized. Now, developers can examine what problems they need to have solved, which capabilities are important to them and which are not, and choose the DBMS that matches their problem set. For a company like Twitter or Facebook, scalability and high availability are much more important that data consistency. For others the main problem to be solved is distributed processing. Conclusion All the databases detailed above have a similarity other than NoSQL – they have all been developed since the rise of the web to deal with the issues that have arisen because of the vast amounts of data being generated every day on the Internet. They are all, to some degree, Web databases. What they mean for developers is choice, the opportunity to chose the right tool for the job rather than trying to force one tool to do all jobs. There’s a saying ‘When all you have is a hammer, all your problems begin to look like nails”, but that’s not really true. When all you have is a hammer, you don’t have any choice but to treat all your problems like nails. You know they’re not nails, but you hammer away regardless. Well now we have some saws and drills and wrenches to go along with our hammers. Works Cited Bain, T. (2009, 02 12). Is the Relational Database Doomed? Retrieved from ReadWrite Enterprise: http://www.readwriteweb.com/enterprise/2009/02/is-therelational-database-doomed.php Cassandra.org. (n.d.). Cassandra. Retrieved from Cassandra: http://cassandra.apache.org/ CouchDB.org. (n.d.). The CouchDB Project. Retrieved from The CouchDB Project: http://couchdb.apache.org/ Franz. (n.d.). AllegroGraph. Retrieved from AllegroGraph: http://www.franz.com/agraph/allegrograph/ HBase.org. (n.d.). HBase. Retrieved from HBase: http://hbase.apache.org/ Hitchcock, A. (n.d.). Google's BigTable. Retrieved from Google Blogscoped: http://blogoscoped.com/archive/2005-10-23-n61.html Hypertable. (n.d.). Hypertable. Retrieved from Hypertable: http://www.hypertable.org/ Introducing JSON. (n.d.). Retrieved from JSON: http://www.json.org/ Marx, V. (2010, May). Pfizer Partners with IO, Franz on Semantic Proof of Concept to Build Bridges Between Data Resources. Retrieved from GenomeWeb: http://www.genomeweb.com/informatics/pfizer-partners-io-franz-semanticproof-concept-build-bridges-between-data-resou MongoDB.org. (n.d.). Comparing MongoDB and CouchDB. Retrieved from MongDB: http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB MongoDB.org. (n.d.). MongoDB. Retrieved from MongoDB: http://www.mongodb.org/ Neo4J.org. (n.d.). Neo4J. Retrieved from Neo4J: http://neo4j.org/ North, K. (2010, May 22). The NoSQL Alternative. Retrieved April 24, 2011, from Information Week: http://www.informationweek.com/news/development/architecturedesign/showArticle.jhtml?articleID=224900559 NoSQL.org. (n.d.). NoSQL. Retrieved from NoSQL: http://nosql-databases.org/ Rodriguez, A. (n.d.). RESTful Web Services: The Basics. Retrieved from IBM Developer Works: https://www.ibm.com/developerworks/webservices/library/ws-restful/ Williams, M. (n.d.). Understanding MapReduce. Retrieved from Matt Williams: http://wordflows.com/matt/2009/01/18/understanding-mapreduce/