Big Data & NoSQL Technologies Friday, June 15, 2012 By: Dale T. Anderson Principal Consultant DB Best, Technologies, LLC As a follow-up to my last blog (NoSQL .vs. Row .vs. Column) let’s take a closer look at Big Data and the emerging ‘NoSQL’ technologies in today’s marketplace. As a refresher, two points to reiterate first: NoSQL means ‘Not Only SQL’, as opposed to ‘Not SQL’, as many perceive There are three main variations of NoSQL out there: Key Value Document Store Column Store It is these variants that we can now examine in more detail, plus have a first look at some of the vendors playing in this field. As organizations increasingly capture large amounts of both structured and nonstructured data, the NoSQL phenomena has come at a perfect time. Database architects today are wise to consider several factors in choosing the right tool for the job as Big Data is found everywhere: o o o o o o o o o Daily Weather Energy geophysical Pharmaceutical drug testing Telecom traffic Social media messaging On-line gamming Stock Market Court documents Email/text messaging So what is Big Data? This new industry ‘catch phrase’ currently has about as many definitions as there are people talking about it. Perhaps the best definition may be: “Massive datasets that are organized, manipulated, and managed by tools, processes, procedures, and storage facilities” Realistically today’s Big Data will be tomorrow’s little data, as it is growing at an increasing pace. Big Data offers a competitive advantage presenting a formidable opportunity, yet it also presents a daunting challenge to IT experts as a transformative technology. Making Big Data complexity even more complex is the many commonly used document/data storage formats. Read about them in another blog (Episode 1- Introduction and definitions) posted by my associate, Julius Gabby. Big Data datasets are largely greater than 100GB yet are expected to easily reach the terabyte to petabyte range and beyond (exabytes & zettabytes). So when a traditional RDBMS just isn’t the answer, NoSQL may in fact be. In processing big data one should consider what each NoSQL technology provides and then choose the right vendor. Let’s take a look: Key Value This variant is best suited for fast transactions where append/remove operations are essential, like an online shopping cart. Usually a persistent, inmemory data store, web applications that need heavy I/O operations are good candidates for Key Value vendors like: REDIS (www.redis.io) An open source, advanced key-value store, often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. Written in C/C++ this product is blazingly fast which makes it great for real-time data collection. Riak (wiki.basho.com) A powerful open-source, distributed database, that scales capacity predictably and simplifies development through features that quickly prototype, test, and deploy applications. A ‘Master-less’ clustering (no node is special), no-sharding, most boring database you’ll ever run in production (their words, not mine). Written in Erlang and C this product provides transparent faulttolerant/fail-over capability and a robust and flexible API which makes it great for point-of-sale and factory control systems. VoltDB (www.voltdb.com) A scalable in-memory database providing full ACID transactional consistency and ultra-high throughput self-referred to as the NewSQL (setting itself apart from other NoSQL vendors). Using Java stored procedures this product relies upon partitioning and replication to achieve high-availability data snapshots and durable command logging (for crash recovery) which makes it great for capital markets, digital networks, network services, and online games. Document Store This variant is best suited for highly unstructured data where more functionality is needed. Data is restructured into document objects using named-value pairs thus supporting more detailed information as yet undefined. Usually data is grouped into collections with simple query mechanisms, so web applications that need performance and have frequently changing data structures are good candidates for Document Store vendors like: MongoDB (www.mongodb.org) From “humongous” this scalable, highperformance, open-source NoSQL database features document-oriented storage (JSON-like), full index support, replication, and fast in-place updates. Written in C/C++ this product is best for dynamic queries, dynamic data structures, and if you prefer indexes over Map/Reduce. CouchDB (couchdb.apache.org) An open-source database that focuses on ease of use storing data in a collection of JSON documents each maintaining its own schema definitions. ACID semantics implement eventual consistency which avoids locking database files during writes. Written in Java, this product is ideal for web based applications that handle huge amounts of loosely structured data. Column Store This variant is focused upon massive amounts of unstructured data across distributed systems (think Facebook and Google), trading ACID semantics for advantages in performance, availability, and operational manageability. Also called ‘Extensible Record Store’, applications that write more than read data are good candidates for Column Store vendors like: Cassandra (cassandra.apache.org) An open-source distributed database management system designed to handle very large amounts of data spread across many servers while providing a highly available service with no single point of failure. Written in Java, with linear scalability and proven fault-tolerance coupled with column indexes, this product is best for no-transactional real-time data analysis. HBase (hbase.apache.org) HBase is the Hadoop database which is a distributed, scalable, Big Data Store modeled after Google’s BigTable technology. Running on top of HDFS (Hadoop Distributed File System) accessed with Map/Reduce, compression, in-memory operation, and space-efficient probabilistic data structures are some key features. Written in Java, this product is great when you need random, real-time read/write access to your Big Data. Remember that NoSQL is not competitive to traditional RDBMS technologies (either row or column based), it is complementary. And as such have both strengths and weaknesses. Let’s look at these too: NoSQL Strengths The clear winner when you need the ability to store and look up Big Data Is Application focused Supports HUGE data capacity Fast Data Ingestion (loads) Fast Lookup Speeds (across clusters) Streaming Data NoSQL Weakness Conceivably an expensive infrastructure Is very complex Engineering talent still hard to find Generally there is no SQL interface Limited programmatic interfaces Inadequate for Analytic Queries (aggregations, metrics, BI) There you have it; in a nutshell, right?? Well, I for one believe there will be tremendous growth in this marketplace for several years to come, misinformation will likely proliferate, and many Big Data projects may unknowingly risk failure depending upon how well informed the architects are and how well business expectations are set. Perhaps having the right tool is agreeably the right thing, but having the right data modeling and methodology is also critical. Think Data Vault; more on that in a future blog! Other future blogs will include a deep dive into Hadoop/Hive/HBase and Column based databases. So come back for more…