DAMA-Big Data (R)evolution Presentation

advertisement
“Big Data” - Technical Architecture
Roni Schuling - Enterprise Architecture
Tom Scroggins – IS Domain Architecture
Principal Financial Group
“Big Data” - Technical Architecture
AGENDA
•
•
•
•
•
•
Foundational Definitions & where these
technologies came from
• Big Data
• NoSQL
• Hadoop
Business & Technical Drivers
How they are being used in many companies
Predictions for the future
Challenges & Obstacles
Questions
“Big Data” - Technical Architecture
Foundational Definition – Big Data
•
•
Big data is an evolving term that
describes any voluminous
amount of structured, semistructured and unstructured data
that has the potential to be
mined for information.
Big data can be characterized by
3Vs: the extreme volume of data,
the wide variety of types of data
and the velocity at which the data
must be must processed. There
are many other aspects as well
such as: Viscosity, Complexity,
Ambiguity.
Data in a corporation that cannot be processed using traditional
data management techniques and technologies can be broadly
classified as Big Data.
“Big Data” - Technical Architecture
“Big Data” - Technical Architecture
Big Data ≠ Hadoop
Big Data ≠ NoSQL
Hadoop ≠ NoSQL
Hadoop & NoSQL are key technologies for working with Big Data effectively.
“Big Data” - Technical Architecture
“Big Data” - Technical Architecture
Foundational Definition - NoSQL
• NoSQL database, also called Not Only SQL, is
an approach to data management and
database design that's useful for very large
sets of distributed data.
• NoSQL seeks to solve the scalability and big
data performance issues that relational
databases weren’t designed to address.
• NoSQL is especially useful when an
enterprise needs to access and analyze
massive amounts of unstructured data or
data that's stored remotely on multiple
virtual servers in the cloud
• However - NoSQL is not just about Big Data
“Big Data” - Technical Architecture
Where this technology came from - NoSQL
1970 1980 1990 2000
2005
2007
2010
2014+
Polygot
Persistence
Document DB Inspired
by Lotus Notes
Key Value Store
Replicate Data
during 24x7
Availability
Enterprise will have
a variety of different
data storage
technologies for
different kinds of
data & application
needs
Need to Store Tabular Data in
Distributed System
Many Innovators In The 2005 to 2010 Timeframe
“Big Data” - Technical Architecture
Market view
of what’s out
there – we
do NOT have
all of these at
PFG today.
There are
over 150
NoSQL
databases in
the market –
these are just
a few of the
top ones.
“Big Data” - Data Architecture at PFG
Foundational Definition - Hadoop
• Hadoop is a open source, Java-based programming
framework that supports the processing of large data
sets in a distributed computing environment.
• It is part of the Apache project sponsored by the
Apache Software Foundation.
• Hadoop makes it possible to run applications on
systems with thousands of nodes involving thousands
of terabytes.
• Its distributed file system facilitates rapid data
transfer rates among nodes and allows the system to
continue operating uninterrupted in case of a node
failure. This approach lowers the risk of catastrophic
system failure, even if a significant number of nodes
become inoperative.
“Big Data” - Data Architecture at PFG
Where this technology came from - Hadoop
1995
2004
2005
1995 – 2005: Yahoo! Search team builds 4+
generations of systems to crawl & index
the WWW. 20 Billion pages!
Google
publishes
Google File
System &
MapReduce
papers
•
•
Yahoo! Staffs
‘Juggernaut’,
open source
DFS &
MapReduce
Doug Cutting
builds Nutch
DFS &
MapReduce,
joins Yahoo!
2006
2010
2014+
Juggernaut & Nutch
join forces –
Hadoop is born!
Other
Internet
companies
add tools /
frameworks
to enhance
Hadoop
Service providers
step into the
market – provide
training, support,
& hosting
“Big Data” - Technical Architecture
The Hadoop
Vendor
Landscape - 2014
“Big Data” - Technical Architecture
“Big Data” - Technical Architecture
Business Drivers
• Provide access to all data needed for
analytics (internal or external)
• Provide the ability to realistically interact
with greater ‘depths’ of data – IE: tens of
years instead of a couple of months
• Provide a greater “speed to insight” for all
types of requests
• Lower the total cost of ownership across
the enterprise for analytics
• Allow for exploration of our data in ways
we never anticipated to identify
differentiating understanding of
customers and markets
There’s an Imbalance today….
“Big Data” - Technical Architecture
Technical Drivers Current technical capabilities don’t align with changing expectations
“Big Data” - Technical Architecture
How they are being used today
NoSQL
Not focused on Big Data….yet
• Many companies using or at least
experimenting with MongoDB
Document store for web
applications that only needs to
persist the content for the lifespan
of that interaction.
• Using NoSQL stores for user
preferences to personalize what is
presented on a web page for their
interaction.
• Beginning to organization social
streams of data
Hadoop
• Interrogating our web logs to better
understand the behavior of people
interacting with a website.
• Merging that semi-structured web
activity with other structured legacy
data.
• Massive storage of data for
exploration and discovery – often
using interoperability with analytic
consumption tools.
“Big Data” - Technical Architecture
NoSQL
Plans for the future
• Database for web applications that
need that speed of development
and nimbleness.
• Layering of NoSQL solutions on top
of Hadoop to improve searchability
and performance.
• Exploration of Graph NoSQL
solutions for analytics on
hierarchical type data .
Hadoop
• Expansion of web activity data
(more logs, more data in logs, more
use cases.)
• Speech-to-text translation of Call
Recordings and text analysis/Natural
Language processing to determine
call topics and caller sentiment.
• Extraction of text from documents
to aid in analysis.
• ‘Data Lake’ solutioning – both for
ingestion and archive.
“Big Data” - Technical Architecture
Lake of Data
Data Refinery
“Big Data” - Technical Architecture
Data Refinery
“Big Data” - Technical Architecture
Many Kinds of data in our organization
Conceptually for illustration – not a vetted/approved picture of the PFG environment
“Big Data” - Technical Architecture
Conceptual Workload Isolation Today…
Conceptually for illustration – not a vetted/approved picture of the PFG environment
“Big Data” - Technical Architecture
Conceptual Workload Isolation in the Future…
Conceptually for illustration – not a vetted/approved picture of the PFG environment
“Big Data” - Technical Architecture
“Big Data” - Technical Architecture
Big Data
technologies
are broader
than just
Hadoop &
NoSQL – but
those are the
key starting
points for us.
Market view
of what’s out
there – we
do NOT have
all of these at
PFG today.
“Big Data” - Technical Architecture
Challenges and Obstacles to overcome
•
•
•
•
•
Security
Governance
Clear Use Cases
Integration Points
Hosting models
“Big Data” - Technical Architecture
Q&A
Kapur.Gurwinder@principal.com
•NoSQL Data Architecture& Best Practices
Data View - Overview
We are in a Database Revolution
•
Existing paradigms are being challenged
o Models
o Hardware
o Software
o Languages
•
Will tweaking current data solutions be enough?
•NoSQL Data Architecture& Best Practices
Data View - Overview
•NoSQL Data Architecture& Best Practices
Data View – Five Data Paradigms
•NoSQL Data Architecture& Best Practices
Data View – Five Data Paradigms
Relational Model
PROs
•
•
•
•
•
•
Most flexible queries & updates
Reuse data structures in any
context
Great DB-to-DB integration
Mature tools
Standard query language
Easy to hire expertise
CONs
•
•
•
•
•
Design-time, static relationships
Design-time, static structures: design first then load data
Hard to normalize model
Requires code to integrate relational data with object-oriented code
Cannot query for relevance
•NoSQL Data Architecture& Best Practices
Data View – Five Data Paradigms
Dimensional Model
PROs
•
•
•
•
•
•
Queries facts in context
Self-service, ad hoc queries
High-performance platforms
Mature tools and integration
Standard query language
Turns data into information
CONs
•
•
•
•
•
Expensive platforms
Design-time, static relationships
Design-time, static structures: design first then load data
Cannot query for relevance
Cannot query for answers that are not built into the model
•NoSQL Data Architecture& Best Practices
Data View – Five Data Paradigms
What’s wrong (aka challenging) with SQL DB’s?
•NoSQL Data Architecture& Best Practices
Data View – Five Data Paradigms
Key Value / Column Family Models
PROs
•
•
•
•
•
•
•
•
Fast puts and gets
Massive scalability
Easy to shard & replicate
Data colocation
Simple to model
Inexpensive
Data in transactional context
Developer in control
CONs
•
•
•
•
•
•
•
Carefully design key
Shred JSON into flat columns
Secondary indexes required to query outside of hierarchical key
No standard query API or language
Hand code all joins in app
Immature tools and platform
Hard to integrate and hire
•NoSQL Data Architecture& Best Practices
Data View – Five Data Paradigms
Document Model
PROs
•
•
•
•
•
•
Fast development
“Schemaless”, run-time designed,
rich, JSON and/or XML data
structures
Queries everything in context
Self-service, ad hoc queries
Turns data into information
Can query for relevance
CONs
•
•
•
•
Defensive programming for unexpected data structures
Expensive platforms, immature tools, and hard to integrate
Non-standard Query Languages, and hard to hire expertise
Not as fast as Column-Family / Key-Value databases
•NoSQL Data Architecture& Best Practices
Data View – Five Data Paradigms
Graph Model
PROs
•
•
•
•
•
•
Unlimited flexibility – model any structure
Run time definition of types &
relationships
Relate anything to anything in any way
Query relationship patterns
Standard Query Language (SPARQL)
Creates maximum context around data
CONs
•
•
•
•
•
Hard to model at such a low level
Hard to integrate with other systems
Immature tools
Hard to hire expertise
Cannot query for relevance because original document context is
not preserved
•NoSQL Data Architecture& Best Practices
Data ViewData
– FiveView
Data Paradigms
..
What’s
wrong (aka challenging) with NoSQL DB’s?
•NoSQL Data Architecture& Best Practices
Data ViewData
– FiveView
Data Paradigms
•NoSQL Data Architecture& Best Practices
Data View
Modeling Takeaways
Each model has a specialized purpose
•
Dimensional
analytics
Business intelligence reporting and
•
Relational
standard
Flexible queries, joins, updates, mature,
•
Column / Key-Value
•
Document
JSON/XML,
Fast Development, “schemaless”
searchable
•
Graph / RDF
Modeling anything at runtime including
relationships
Simple, fast puts and gets, massively scalable
•NoSQL Data Architecture& Best Practices
Data View –Data
HowView
do you choose?
..
How
do you choose?
How much Durability do you need?
 Durable data survives system failures & can be
recovered after unwanted deletion
How much Atomicity do you need?
 An atomic transaction is all or nothing, sets of data
and/or sets of commands.
How much Isolation do you need?
 Isolation prevents concurrent transactions from
affecting each others.
How much Consistency do you need (or when do you need it)?
 Consistency exists when data is committed and
consistent with all data rules at a point in time.
•NoSQL Data Architecture& Best Practices
Data ViewData
– HowView
do you choose?
..
Durability
•
•
•
•
Can you live with writing advanced code to compensate?
o Trusting all developers to properly check for partial
transaction failures, current physical layout of the data
cluster, and write code to propagate data across the
cluster.
Can you live with lost data?
o No logs, archives, mirroring, etc….
Can you live with accidental deletion of data?
o No point in time recovery feature
Can you live with scripting your own backup & recovery
solutions?
•NoSQL Data Architecture& Best Practices
Data ViewData
– HowView
do you choose?
..
Atomicity
• Can you live with modifying single documents at a time?
• Can you live with partially successful transactions?
o You can achieve higher availability because transactions
can partially succeed.
• Can you live with inconsistent and incomplete data?
o Is it OK to not know when data anomalies are caused
by bugs in your code or are temporarily inconsistent
because they haven’t been synchronized yet?
• Can you live with writing advanced code to compensate?
o Custom solutions for atomic rollback, handling of
transactions that fail, find & fix inconsistent data.
•NoSQL Data Architecture& Best Practices
Data ViewData
– HowView
do you choose?
..
Isolation
• Can you live with modifying single documents at a time?
• Can you live with inaccurate queries?
o Without isolation, query results are inaccurate because
concurrent transactions can change data while
processing it.
• Can you live with race conditions and dead locks?
• Can you live with writing advanced code to compensate?
o Your own versioning system, code to hide concurrent
updates, inserts and deletes from queries, handle race
conditions and deadlocks.
•NoSQL Data Architecture& Best Practices
Data ViewData
– HowView
do you choose?
..
Consistency - Do you need
complete consistency?
Not necessarily – instead, you may prefer:
•
•
•
•
•
•
•
•
Absolute fastest performance at lowest hardware cost
Highest global data availability at lowest hardware cost
Working with one document at a time
Writing advanced code to create your own consistency
model
Eventually consistent data
Some inconsistent data that can’t be reconciled
Some missing data that can’t be recovered
Some inconsistent query results
•NoSQL Data Architecture& Best Practices
Data ViewData
– HowView
do you choose?
..
What
do you need most?
• Highest performance for queries and transactions
• Highest data availability across multiple data centers
• Less data loss (eg. Durability)
• More query accuracy & less deadlocks (eg. Isolation)
• More data integrity (eg. Atomicity)
• Less code to compensate for lack of ACID compliance
•NoSQL Data Architecture& Best Practices
Key Points
RDBM’s will always have an important place in our architecture.
NoSQL implementations have a benefit to our future. Once you have a list of NoSQL
databases that meet your modeling needs, choose the one that best meets your need
for velocity and volume.
It is not a one-or-the-other ‘all in’ choice to make.
Download