NoSQL_AND_Big_Data - GlobalsDB

advertisement
Big Data, NoSQL . . . So What?
Iran Hutchinson
Me
• I work for InterSystems who:
– Drives http://globalsdb.org NoSQL project.
– Has 20+ years of NoSQL production
deployments
– Has 20+ years of Big Data production
deployments
– Built a ~250 million Euro business on the above
• Email: iran.hutchinson@intersystems.com
• Twitter: #iranic
#iranic
Big Data is …
• Important data in varying formats and volumes that is being
generated across all areas affecting your business that is
generally not centrally correlated or managed.
• Examples include:
– Word Files, PowerPoint, PDFs
– Emails, Instant Messaging, Texts
– Blogs and Social Media
– Automated data from machine activities
– Stream data from financial stock markets
#iranic
Some Big Data Numbers
• Source: McKinsey Global Institute
• 5 Billion mobile phones used in 2010
• 30 Billion pieces of info shared on Facebook each month
• 40% projected growth in global data generated
• 235 Terabytes collected by US Library of Congress 04/11
– 15 out of 17 sectors in US have more data stored per
company than this.
#iranic
Some Big Data Numbers …
• Source: McKinsey Global Institute
• $300 Billion in potential value in US Healthcare system
• €250 Billion in Europe’s public sector administration
• $600 Billion in annual consumer surplus using location data
• 60% Potential increase in retail operating margins
• 140,000 – 190,000 analytical talent positions in US
• 1.5 Million data-savvy managers needed in US
#iranic
Case Study: Credit Suisse
• Key Challenges:
– Revamp order routing architecture
– Revamp order management architecture
– Serve current demand and scale to new levels
– Address downtime challenges
#iranic
Case Study: Credit Suisse …
• Big Data in the form of volumes of transactions
• Leveraged Caché’s:
– In-memory architecture for performance
– On-disk resiliency for availability
– Distributed architecture for data coherency
• Can easily process 1,000,000,000 transactions
– During business hours
#iranic
Case Study: European Space Agency (ESA)
• Key Challenges
– Make the largest, most precise 3-D map of our Galaxy
– Monitor 1,000,000,000 stars over 5 years, precisely
charting position, movement, and brightness
– Along the way discover hundreds of thousands of new
celestial objects
#iranic
Case Study: ESA Continued …
• Challenge Calculation:
• Capture data for 1 Billion Celestial Objects
• http://www.intersystems.com/cache/whitepapers/pdf/Charting_th
e_Galaxy.pdf
X
X
1,000,000,000
100
600
60,000,000,000,000
Solution:
#iranic
objects
observations per object
bytes per observation
(60TB)
Caché/XEP, delivering 100,000+
sustained inserts per second per server,
stored as real objects with SQL access
Enabling Technology
• Focus on Caché
• A quick look at the architecture
#iranic
Enabling Technology …
• Java + C database kernel run in same process
#iranic
Enabling Technology …
• ECP, Distributed Computing
#iranic
Enabling Technology …
• Multiple, simultaneous data to disk writers
#iranic
Who is this Guy?
• Edgar Frank “Ted” Codd
• Known for 12 Rules (0 ~ 12) for Relational Data Systems
#iranic
NoSQL … Breaking the Rules
• Rule 1: The information Rule
– All information is represented in 1 and only 1 way,
namely by values in column positions within rows of
tables
• Rule 12: The no subversion Rule
– If the system provides a low-level (record-at-a-time)
interface, then that interface cannot be used to subvert
the system i.e. relational security or integrity constraints.
#iranic
Why NoSQL?
• No to ACID transactions
• No to the impedance mismatch with SQL
• Dealing with Big Data and Web Scale
• High prices from RDBMS vendors
• Use commodity hardware
• Flexible data models
• It’s a cool movement ….
#iranic
Is NoSQL a new Concept?
• No
• Remember MUMPS?
– SET ^Car("Door","Color")="BLUE”
• Remember Multi-value/PICK
– MATWRITE array.variable ON file.variable,id. ….
• Ever heard of the NoSQL RDB?
– Carlo Strozzi
– http://www.strozzi.it/cgibin/CSA/tw7/I/en_US/nosql/Home%20Page
#iranic
CAP Theorem
• Consistent
– A service that is consistent operates fully or not.
• Availability
– The service is available to operate fully or not.
• Partition Tolerance
– Managing data on multiple nodes. 1 node is 1 partition
so it works or does not when it comes to processing
data.
• Significant as you can get 2 of these only …
#iranic
CAP Theorem …
• Arguments and links
– http://www.julianbrowne.com/article/viewer/brewerscap-theorem
– http://ksat.me/a-plain-english-introduction-to-captheorem/
– http://voltdb.com/company/blog/clarifications-captheorem-and-data-related-errors
#iranic
CAP Theorem …: Consistency
DB1
DB7
DB2
DB6
DB3
DB5
#iranic
DB4
CAP Theorem …: Consistency
Spoke
DB1
Spoke
DB4
Hub
Spoke
DB3
#iranic
Spoke
DB2
CAP Theorem …: Consistency
DB1
DB3
#iranic
DB2
Distributed computing
• Fallacies (Peter Deutsch)
– The network is reliable
– Latency is zero
– Bandwidth is infinite
– The network is secure
– Topology doesn’t change
– There is one administrator
– Transport cost is zero
– The network is homogeneous
• Remember JINI? (See Apache River project)
#iranic
NoSQL: Which Model to Use?
Key-Value
Graph
Data
Column
#iranic
Document
NoSQL: Which project?
• http://nosql-database.org/ lists 122 today.
• Depends on your model selection.
• Most likely choose well-known project.
• Don’t forget about shared risk!
#iranic
NoSQL: Querying
• Some solutions have no querying
• When available query languages differ
• Lack of general AD-Hoc querying – “no” SQL
• Have you heard of UnQL?
– http://www.unqlspec.org/display/UnQL/Home
• NOTE: Toad for Cloud
#iranic
NoSQL: How to Succeed?
• Know your application
• Don’t forget the past lessons
• Consider a hybrid approach
• Fight the desire to Roll-Your-Own-DB
• Start small but significant
#iranic
NoSQL: Hybrid Approach 1
• Two Systems
• NoSQL System
• SQL/RDBMS
NoSQL
#iranic
Data
Mapper /
Translator
SQL/RDBMS
NoSQL: Hybrid Approach 2
• One system does both
NoSQL and SQL
Relational
?
Key-Value
Data
Graph
Document
Column
#iranic
GlobalsDB.org Project
• Name comes from the underlying data structure
– Multi-dimensional array
– Basis for commercial Caché data system
• Free for development and production deployment
• NoSQL DB with Java and Node.js APIs
• Code base is same as commercial product
• APIs are open sourced or being open sourced
• Database kernel is not open source
#iranic
A “Global” Definition
• A Global is persistent sparse multi-dimensional
array, which consists of one or more storage
elements or "nodes". Each node is identified by a
node reference (which is, essentially, its logical
address)
– simple =="some data”
– complex["subscript-1", "subscript-2"] =="some data”
• Example
– product[item,type,os,proccessor] == quantity
– product[“computer”,”laptop”,”Mac”,”i7”] == 3
#iranic
GlobalsDB Architecture
• Current Architecture
#iranic
GlobalsDB, NoSQL, Big Data
• http://nosql.mypopescu.com/
• http://highscalability.com/
• http://nosqltapes.com/
• http://globalsdb.wordpress.com
#iranic
Download