Big Data & NoSQL Technologies

advertisement
Big Data & NoSQL Technologies
Friday, June 15, 2012
By: Dale T. Anderson
Principal Consultant
DB Best, Technologies, LLC
As a follow-up to my last blog (NoSQL .vs. Row .vs. Column) let’s take a closer look at Big Data and the
emerging ‘NoSQL’ technologies in today’s marketplace. As a refresher, two points to reiterate first:


NoSQL means ‘Not Only SQL’, as opposed to ‘Not SQL’, as many perceive
There are three main variations of NoSQL out there:
 Key Value
 Document Store
 Column Store
It is these variants that we can now examine in more detail, plus have a first look at some of the vendors
playing in this field. As organizations increasingly capture large amounts of both structured and nonstructured data, the NoSQL phenomena has come at a perfect time. Database architects today are wise
to consider several factors in choosing the right tool for the job as Big Data is found everywhere:
o
o
o
o
o
o
o
o
o
Daily Weather
Energy geophysical
Pharmaceutical drug testing
Telecom traffic
Social media messaging
On-line gamming
Stock Market
Court documents
Email/text messaging
So what is Big Data? This new industry ‘catch phrase’ currently has about as many definitions as there
are people talking about it. Perhaps the best definition may be:
“Massive datasets that are organized, manipulated, and managed by tools, processes,
procedures, and storage facilities”
Realistically today’s Big Data will be tomorrow’s little data, as it is growing at an increasing pace. Big
Data offers a competitive advantage presenting a formidable opportunity, yet it also presents a daunting
challenge to IT experts as a transformative technology. Making Big Data complexity even more complex
is the many commonly used document/data storage formats. Read about them in another blog
(Episode 1- Introduction and definitions) posted by my associate, Julius Gabby.
Big Data datasets are largely greater than 100GB yet are expected to easily reach the terabyte to
petabyte range and beyond (exabytes & zettabytes). So when a traditional RDBMS just isn’t the answer,
NoSQL may in fact be. In processing big data one should consider what each NoSQL technology provides
and then choose the right vendor. Let’s take a look:

Key Value
This variant is best suited for fast transactions where append/remove
operations are essential, like an online shopping cart. Usually a persistent, inmemory data store, web applications that need heavy I/O operations are good
candidates for Key Value vendors like:
 REDIS (www.redis.io)
An open source, advanced key-value store,
often referred to as a data structure server since
keys can contain strings, hashes, lists, sets and sorted sets. Written in
C/C++ this product is blazingly fast which makes it great for real-time
data collection.
 Riak (wiki.basho.com)
A powerful open-source, distributed database,
that scales capacity predictably and simplifies
development through features that quickly prototype, test, and deploy
applications. A ‘Master-less’ clustering (no node is special), no-sharding,
most boring database you’ll ever run in production (their words, not
mine). Written in Erlang and C this product provides transparent faulttolerant/fail-over capability and a robust and flexible API which makes it
great for point-of-sale and factory control systems.
 VoltDB (www.voltdb.com)
A scalable in-memory database providing full
ACID transactional consistency and ultra-high
throughput self-referred to as the NewSQL (setting itself apart from
other NoSQL vendors). Using Java stored procedures this product relies
upon partitioning and replication to achieve high-availability data
snapshots and durable command logging (for crash recovery) which
makes it great for capital markets, digital networks, network services,
and online games.

Document Store
This variant is best suited for highly unstructured data where more functionality
is needed. Data is restructured into document objects using named-value pairs
thus supporting more detailed information as yet undefined. Usually data is
grouped into collections with simple query mechanisms, so web applications
that need performance and have frequently changing data structures are good
candidates for Document Store vendors like:
 MongoDB (www.mongodb.org)
From “humongous” this scalable, highperformance, open-source NoSQL database
features document-oriented storage (JSON-like), full index support,
replication, and fast in-place updates. Written in C/C++ this product is
best for dynamic queries, dynamic data structures, and if you prefer
indexes over Map/Reduce.
 CouchDB (couchdb.apache.org)
An open-source database that focuses on ease
of use storing data in a collection of JSON
documents each maintaining its own schema
definitions.
ACID semantics implement
eventual consistency which avoids locking
database files during writes. Written in Java,
this product is ideal for web based applications that handle huge
amounts of loosely structured data.

Column Store
This variant is focused upon massive amounts of unstructured data across
distributed systems (think Facebook and Google), trading ACID semantics for
advantages in performance, availability, and operational manageability. Also
called ‘Extensible Record Store’, applications that write more than read data are
good candidates for Column Store vendors like:
 Cassandra (cassandra.apache.org)
An
open-source
distributed
database
management system designed to handle very
large amounts of data spread across many servers while providing a
highly available service with no single point of failure. Written in Java,
with linear scalability and proven fault-tolerance coupled with column
indexes, this product is best for no-transactional real-time data analysis.
 HBase (hbase.apache.org)
HBase is the Hadoop database which is a
distributed, scalable, Big Data Store modeled
after Google’s BigTable technology. Running
on top of HDFS (Hadoop Distributed File System) accessed with
Map/Reduce, compression, in-memory operation, and space-efficient
probabilistic data structures are some key features. Written in Java, this
product is great when you need random, real-time read/write access to
your Big Data.
Remember that NoSQL is not competitive to traditional RDBMS technologies (either row or column
based), it is complementary. And as such have both strengths and weaknesses. Let’s look at these too:

NoSQL Strengths
 The clear winner when you need the ability to store and look up Big Data
 Is Application focused
 Supports HUGE data capacity
 Fast Data Ingestion (loads)
 Fast Lookup Speeds (across clusters)
 Streaming Data

NoSQL Weakness
 Conceivably an expensive infrastructure
 Is very complex
 Engineering talent still hard to find
 Generally there is no SQL interface
 Limited programmatic interfaces
 Inadequate for Analytic Queries (aggregations, metrics, BI)
There you have it; in a nutshell, right?? Well, I for one believe there will
be tremendous growth in this marketplace for several years to come,
misinformation will likely proliferate, and many Big Data projects may
unknowingly risk failure depending upon how well informed the
architects are and how well business expectations are set.
Perhaps having the right tool is agreeably the right thing, but having the right data modeling and
methodology is also critical. Think Data Vault; more on that in a future blog!
Other future blogs will include a deep dive into Hadoop/Hive/HBase and Column based databases. So
come back for more…
Download