BI in the Cloud

advertisement
NoSQL for the SQL Server Pro
Lynn Langit
Feb 2013 – SDC, Sweden
Is NoSQL just Hadoop?
• HUGE Hype factor over last few years
Apache Hadoop is a software framework that supports dataintensive distributed applications under a free license
• enables applications to work with thousands of nodes and petabytes of data
• was inspired by Google's MapReduce and Google File System (GFS) papers
Hadoop in the Enterprise
Working with Hadoop
Common Tools /
Languages
• Java (JDK) / Eclipse
• MapReduce
• Map (query/format)
• Reduce (aggregate)
• plug-in for Eclipse (Java)
• Pig (ETL -- Java)
• Hive (HQL Query)
• HBase tables
• Others
• Mahout (analyze)
• Karmasphere (analyze)
• R (analyze)
Demo -HDInsight– Cluster Allocation
What is the relationship?
NoSQL
BigData
BigData = Exponentially More Data
• Retail Example -> ‘Feedback Economy’
– Number of transactions
– Number of behaviors (collected every minute)
2500
2000
1500
Purchases
Locations
1000
Phone data
500
0
12:00
12:30
1:00
1:30
2:00
2:30
BigData = ‘Next State’ Questions
Collecting
Behavioral
data
• What could happen?
• Why didn’t this happen?
• When will the next new thing
happen?
• What will the next new thing be?
• What happens?
Demo - HDInsight - MapReduce
Hitting (Relational) Walls
• CA
– Highly-available
consistency
• CP
– Enforced consistency
• AP
– Eventual consistency
So many NoSQL options
• More than just the Elephant in the room
• Over 120+ types of NoSQL databases
Flavors of NoSQL
Key / Value Database
• Schema-less
• State (Persistent or Volatile)
• Examples
– AWS Dynamo DB
– Riak
Column Database
• Wide, sparse column sets
• Examples:
–
–
–
–
–
–
Cassandra
HBase
BigTable
GAE HR DS
Azure Tables
SQL 2012
Tabular Model
More about Column Databases
• Type A
–
–
–
–
Column-families
Non-relational
Sparse
Examples: HBase, Cassandra, xVelocity (SQL 2012 Tabular)
• Type B
–
–
–
–
Column-stores
Relational
Dense
Example:
• SQL Server 2012 Columnstore index
Demo - Document Database (Mongo DB)
• document-oriented (collection of
JSON documents) w/semi structured
data
– Encodings include BSON, JSON, XML…
• binary forms
– PDF, Microsoft Office documents -Word, Excel…)
Demo - Graph Database (Neo4j)
• a lot of many-to-many relationships
• recursive self-joins
• when your primary objective is quickly
finding connections, patterns and
relationships between the objects
within lots of data
So which type of NoSQL? Back to CAP…
CP = NoSQL/column
Hadoop
Big Table
H-base
MemCacheDB
Consistency
Availability
Partitioning
CA =
SQL/RDBMS
SQL Sever /
Oracle
MySQL
AP =
NoSQL/documen
t or key/value
DynamoDB
CouchDB
Cassandra
Voldemort
Which type of NoSQL for which
type of data?
Type of Data
Type of NoSQL solution
Example
Log files
Wide Column
HBase
Product Catalogs
Key Value on disk
DynamoDB
User profiles
Key Value in memory
Redis
Startups
Document
MongoDB
Social media
connections
Graph
Neo4j
LOB w/Transactions
NONE! Use RDBMS
SQL Server
Cloud-hosted NoSQL up to 50x CHEAPER
The reality…two pivots
Storage Methods
Storage Locations
• SQL (RDBMS)
• NoSQL
• On premises
• Cloud-hosted
NoSQL (Cloud) BLOB Storage Buckets
• Amazon – S3 or Glacier
– The gold standard
• Google – Cloud Storage
– Free for developers
• Microsoft Azure BLOBS
• DropBox, Box…
Cloud-hosted RDBMS
• AWS RDS – SQL Server,
mySQL, Oracle
– Medium cost
– Solid feature set, i.e.
backup, snapshot
– Use existing tooling
• Google – mySQL
– Lowest cost
– Most limited RDBMS
functionality
• Microsoft – SQLAzure
– Highest cost
Demo - AWS RDS
• SQL Server, MySQL or Oracle
• Essential to understand pricing models
Cloud Offerings– RDBMS AND NoSQL
AWS
Google
Microsoft
Cloud RDBMS
RDS – all major
mySQL
SQL Azure
NoSQL buckets
S3 or Glacier
Cloud Storage
Azure Blobs
NoSQL databases
DynamoDB
H/R Data on GAE
Azure Tables
Streaming ML or
(Mahout)
Custom EC2
Prospective Search
&
Prediction API
StreamInsight
Document or Graph
MongoDB on EC2
Freebase
MongoDB on
Windows Azure
Hadoop
Elastic MapReduce
using S3 & EC2
none
HDInsight
Dremel/Warehousi
ng
RedShift
BigQuery
none
Data Scientists…
Comparing…
Karmasphere Studio for AWS
Hadoop Connector to Excel
Google BigQuery
• Hadoop-like (Dremel) based service
• For massive amounts of data
• SQL-like query language
Dremel Realized => Impala
• Interactive Hadoop?
Other types of cloud data services
Hosting public datasets
• Pay to read
• Earn revenue by offering for
read
Cleaning / matching
(your) data
• ETL – Microsoft Data
Explorer, Google Refine
• Data Quality – Windows
Azure Data Market,
InfoChimps, DataMarket.com
NoSQL To-Do List
Understand CAP & types of NoSQL databases
• Use NoSQL when business needs designate
• Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud
• Quick and cheap for behavioral data
• Mashup cloud datasets
• Good for specialized use cases, i.e. dev, test , training environments
Learn noSQL access technologies
• New query languages, i.e. MapReduce, R, Infer.NET
• New query tools (vendor-specific) – Google Refine, Amazon
Karmasphere, Microsoft Excel connectors, etc…
The Changing Data Landscape
• recipes)
www.TeachingKidsProgramming.org
•
•
•
Free Courseware (
Do a Recipe  Teach a Kid (Ages 10 ++)
Java or Microsoft SmallBasic 
Toward Data Craftsmanship…
Follow me @LynnLangit
RSS my blog
www.LynnLangit.com
Hire me
• To help build your BI/Big Data solution
• To teach your team next gen BI
• To learn more about using NoSQL solutions
Download