From SQL to NoSQL Xiao Yu Mar 2012 1 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign RDBMS • The predominant choice in storing data › Not so true for data mining PhDs since we put everything in txt files. • First formulated in 1969 by Codd › We are using RDBMS everywhere 2 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Slide from neo technology, “A NoSQL Overview and the Benefits of Graph Databases" 3 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign When RDBMS met Web 2.0 Slide from Lorenzo Alberton, "NoSQL Databases: Why, what and when" 4 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign What’s Wrong with Relational DB? • Nothing is wrong. You just need to use the right tool. • Relational is hard to scale. › Easy to scale reads › Hard to scale writes 5 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign The Death of RDBMS? 6 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign What’s NoSQL? • The misleading term “NoSQL” is short for “Not Only SQL”. • non-relational, schema-free, non-(quite)acid • horizontally scalable, distributed, easy replication support • simple API 7 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Four (emerging) NoSQL Categories • Key-value stores › Based on DHTs/ Amazon’s Dynamo paper * › Data model: (global) collection of K-V pairs › Example: Voldemort • Column Families › BigTable clones ** › Data model: big table, column families › Example: HBase, Cassandra, Hypertable *G DeCandia et al, Dynamo: Amazon's Highly Available Key-value Store, SOSP 07 ** F Chang et al, Bigtable: A Distributed Storage System for Structured Data, OSDI 06 8 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Four (emerging) NoSQL Categories • Document databases › Inspired by Lotus Notes › Data model: collections of K-V Collections › Example: CouchDB, MongoDB • Graph databases › Inspired by Euler & graph theory › Data model: nodes, rels, K-V on both › Example: AllegroGraph, VertexDB, Neo4j 9 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Focus of Different Data Models Slide from neo technology, “A NoSQL Overview and the Benefits of Graph Databases" 10 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CAP theorem RDBMS NoSQL (most) Partition Tolerance Consistency Availability 11 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign When to use NoSQL? • Bigness • Massive write performance › Twitter generates 7TB / per day (2010) • • • • Fast key-value access Flexible schema or data types Schema migration Write availability › Writes need to succeed no matter what (CAP, partitioning) • • • • • • • • 12 Easier maintainability, administration and operations No single point of failure Generally available parallel computing Programmer ease of use Use the right data model for the right problem Avoid hitting the wall Distributed systems support from http://highscalability.com/ Tunable CAP tradeoffs Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Key-Value Stores id hair_color age height 1923 Red 18 6’0” 3371 Blue 34 NA … … … … Table in relational db Store/Domain in Key-Value db Find users whose age is above 18? Find all attributes of user 1923? Find users whose hair color is Red and age is 19? (Join operation) Calculate average age of all grad students? 13 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Example of Voldemort 14 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Voldemort in LinkedIn Sid Anand, LinkedIn Data Infrastructure (QCon London 2012) 15 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign RO Store Usage Pattern Sid Anand, LinkedIn Data Infrastructure (QCon London 2012) 16 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Voldemort vs MySQL Sid Anand, LinkedIn Data Infrastructure (QCon London 2012) 17 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Column Families – BigTable Alike F Chang, et al, Bigtable: A Distributed Storage System for Structured Data, osdi 06 18 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign BigTable Data Model The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. 19 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign More on Row and Column • • • • • • Rows stored in lexicographic order by row key Table dynamically split into “Tablets” Each tablet contains key [startKey, endKey) Tablets are distributed on different nodes All date in the same CF are usually same type Data in same CF are compressed and stored together • CF in a specific row is sorted 20 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign BigTable API Examples adds one anchor to www.cnn.com and deletes a different anchor uses a Scanner abstraction to iterate over all anchors in a particular row 21 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign BigTable Performance 22 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Document Database - mongoDB Table in relational db Documents in a collection Open source, document db Json-like document with dynamic schema Initial release 2009 23 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign mongoDB Product Deployment And much more… 24 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign mongoDB Features • • • • • • • • • 25 Document-oriented storage Full Index Support Replication & High Availability Auto-Sharding Querying Fast In-Place Updates ? Map/Reduce GridFS Commercial Support From http://www.mongodb.org/ Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign sum(checkout) From Gabriele Lana, CouchDB Vs MongoDB 26 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign And mongoDB is fast 100000 Indexed Queries avg med dev total mongoDB 0.00025 0.00021 0.00019 24.99955 mySQL 0.00199 0.00032 0.00518 199.32546 100 Non-Indexed Queries avg med dev total mongoDB 0.05662 0.03704 0.19267 5.66164 mySQL 1.99975 1.68468 1.99266 199.97469 http://www.idiotsabound.com/did-i-mention-mongodb-is-fast-way-to-go-mongo 27 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Graph Database Data Model Abstraction: •Nodes •Relations •Properties 28 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Neo4j - Build a Graph Slide from neo technology, “A NoSQL Overview and the Benefits of Graph Databases" 29 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Neo4j – Traverse a Graph Slide from neo technology, “A NoSQL Overview and the Benefits of Graph Databases" 30 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign A Debatable Performance Evaluation Comparing Apple to Orange 31 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Conclusion • Use the right data model for the right problem 32 Data Mining Meeting Mar, 2012 Data and Information Systems Laboratory University of Illinois Urbana-Champaign