NoSQL for the SQL Server Pro Lynn Langit Feb 2013 – SDC, Sweden Is NoSQL just Hadoop? • HUGE Hype factor over last few years Apache Hadoop is a software framework that supports dataintensive distributed applications under a free license • enables applications to work with thousands of nodes and petabytes of data • was inspired by Google's MapReduce and Google File System (GFS) papers Hadoop in the Enterprise Working with Hadoop Common Tools / Languages • Java (JDK) / Eclipse • MapReduce • Map (query/format) • Reduce (aggregate) • plug-in for Eclipse (Java) • Pig (ETL -- Java) • Hive (HQL Query) • HBase tables • Others • Mahout (analyze) • Karmasphere (analyze) • R (analyze) Demo -HDInsight– Cluster Allocation What is the relationship? NoSQL BigData BigData = Exponentially More Data • Retail Example -> ‘Feedback Economy’ – Number of transactions – Number of behaviors (collected every minute) 2500 2000 1500 Purchases Locations 1000 Phone data 500 0 12:00 12:30 1:00 1:30 2:00 2:30 BigData = ‘Next State’ Questions Collecting Behavioral data • What could happen? • Why didn’t this happen? • When will the next new thing happen? • What will the next new thing be? • What happens? Demo - HDInsight - MapReduce Hitting (Relational) Walls • CA – Highly-available consistency • CP – Enforced consistency • AP – Eventual consistency So many NoSQL options • More than just the Elephant in the room • Over 120+ types of NoSQL databases Flavors of NoSQL Key / Value Database • Schema-less • State (Persistent or Volatile) • Examples – AWS Dynamo DB – Riak Column Database • Wide, sparse column sets • Examples: – – – – – – Cassandra HBase BigTable GAE HR DS Azure Tables SQL 2012 Tabular Model More about Column Databases • Type A – – – – Column-families Non-relational Sparse Examples: HBase, Cassandra, xVelocity (SQL 2012 Tabular) • Type B – – – – Column-stores Relational Dense Example: • SQL Server 2012 Columnstore index Demo - Document Database (Mongo DB) • document-oriented (collection of JSON documents) w/semi structured data – Encodings include BSON, JSON, XML… • binary forms – PDF, Microsoft Office documents -Word, Excel…) Demo - Graph Database (Neo4j) • a lot of many-to-many relationships • recursive self-joins • when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data So which type of NoSQL? Back to CAP… CP = NoSQL/column Hadoop Big Table H-base MemCacheDB Consistency Availability Partitioning CA = SQL/RDBMS SQL Sever / Oracle MySQL AP = NoSQL/documen t or key/value DynamoDB CouchDB Cassandra Voldemort Which type of NoSQL for which type of data? Type of Data Type of NoSQL solution Example Log files Wide Column HBase Product Catalogs Key Value on disk DynamoDB User profiles Key Value in memory Redis Startups Document MongoDB Social media connections Graph Neo4j LOB w/Transactions NONE! Use RDBMS SQL Server Cloud-hosted NoSQL up to 50x CHEAPER The reality…two pivots Storage Methods Storage Locations • SQL (RDBMS) • NoSQL • On premises • Cloud-hosted NoSQL (Cloud) BLOB Storage Buckets • Amazon – S3 or Glacier – The gold standard • Google – Cloud Storage – Free for developers • Microsoft Azure BLOBS • DropBox, Box… Cloud-hosted RDBMS • AWS RDS – SQL Server, mySQL, Oracle – Medium cost – Solid feature set, i.e. backup, snapshot – Use existing tooling • Google – mySQL – Lowest cost – Most limited RDBMS functionality • Microsoft – SQLAzure – Highest cost Demo - AWS RDS • SQL Server, MySQL or Oracle • Essential to understand pricing models Cloud Offerings– RDBMS AND NoSQL AWS Google Microsoft Cloud RDBMS RDS – all major mySQL SQL Azure NoSQL buckets S3 or Glacier Cloud Storage Azure Blobs NoSQL databases DynamoDB H/R Data on GAE Azure Tables Streaming ML or (Mahout) Custom EC2 Prospective Search & Prediction API StreamInsight Document or Graph MongoDB on EC2 Freebase MongoDB on Windows Azure Hadoop Elastic MapReduce using S3 & EC2 none HDInsight Dremel/Warehousi ng RedShift BigQuery none Data Scientists… Comparing… Karmasphere Studio for AWS Hadoop Connector to Excel Google BigQuery • Hadoop-like (Dremel) based service • For massive amounts of data • SQL-like query language Dremel Realized => Impala • Interactive Hadoop? Other types of cloud data services Hosting public datasets • Pay to read • Earn revenue by offering for read Cleaning / matching (your) data • ETL – Microsoft Data Explorer, Google Refine • Data Quality – Windows Azure Data Market, InfoChimps, DataMarket.com NoSQL To-Do List Understand CAP & types of NoSQL databases • Use NoSQL when business needs designate • Use the right type of NoSQL for your business problem Try out NoSQL on the cloud • Quick and cheap for behavioral data • Mashup cloud datasets • Good for specialized use cases, i.e. dev, test , training environments Learn noSQL access technologies • New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc… The Changing Data Landscape • recipes) www.TeachingKidsProgramming.org • • • Free Courseware ( Do a Recipe Teach a Kid (Ages 10 ++) Java or Microsoft SmallBasic Toward Data Craftsmanship… Follow me @LynnLangit RSS my blog www.LynnLangit.com Hire me • To help build your BI/Big Data solution • To teach your team next gen BI • To learn more about using NoSQL solutions