Big Data Management Data g ig

advertisement
Distributed D
Database
es and Big
g Data
Big Data Management
September 2013
Alberto Abelló & Oscar Romero
1
Distributed D
Database
es and Big
g Data
Knowledge objectives
1.
2.
3
3.
4.
5.
6.
Give a definition of Big Data
Name eight features of cloud databases
Give a definition of Distributed Database
Recognize the problem of impedance
mismatch
Name different kinds of NOSLQ databases
Recognize the main problems of NOSQL
databases
September 2013
Alberto Abelló & Oscar Romero
2
Distributed D
Database
es and Big
g Data
Understanding Objectives
1.
2.
Estimate the cost of a distributed query
Transfrom the value in a schemaless
database into a relational one
September 2013
Alberto Abelló & Oscar Romero
3
Distributed D
Database
es and Big
g Data
Motivation
“Without data you are just another person with
an opinion.”
opinion ”
William Edwards Deming
“It is a capital mistake to theorize before one
has data.
data ”
Sherlock Holmes (A Study in Scarlett)
Prescriptive
Predictive
Descriptive
September 2013
Alberto Abelló & Oscar Romero
4
Bain & Company
September 2013
5
Distributed D
Database
es and Big
g Data
September 2013
6
Distributed D
Database
es and Big
g Data
Velocity
 Volume
 Variety
…
 Variability
V i bilit
 Validity/Veracity
 Value

From IBM “Understanding Big Data”
Millions of Terabytes
Distributed D
Database
es and Big
g Data
Big Data definition
year
September 2013
Alberto Abelló & Oscar Romero
7
Distributed D
Database
es and Big
g Data
Bigbench
September 2013
Alberto Abelló & Oscar Romero
8
Distributed D
Database
es and Big
g Data
Big Data sources

Structured






Created (i.e., business data)
Provoked (e.g., customer feedback)
Transacted
Compiled (e.g., demographics)
Experimental (e.g., sampling customers)
Unstructured


Captured (e.g., search words)
User-generated
User
generated (e.g.,
(e g social networks)
September 2013
Alberto Abelló & Oscar Romero
9
Distributed D
Database
es and Big
g Data
Types of Big Data Analyzed in Industry
September 2013
Alberto Abelló & Oscar Romero
10
Distributed D
Database
es and Big
g Data
Big Data facets
The Original
 as Technology
 as Data Distinctions
 as Signals
 as Opportunity
O
t it
 as Metaphor
 as New Term for Old Stuff

Timo Elliott
September 2013
Alberto Abelló & Oscar Romero
11
Distributed D
Database
es and Big
g Data
Big Data related areas

Volume and Velocity



Variety and Variability





Data quality
Data integration
Web and text mining
Information retrieval
Validity/Veracity





Declarative querying
Query optimization
Data consistency
Uncertainty
Statistical reasoning
Data linkage (provenance)
Value

Analytics
y










Data mining
Algorithmics
Automatic learning
Simulation
Privacy
Biologists
Linguistics
Chemists
Sociologists
Engineers
September 2013
Alberto Abelló & Oscar Romero
12
Distributed D
Database
es and Big
g Data
Key features of cloud databases
a)
b)
c)
d)
Quick/Cheap set up
Ability to horizontally scale
Ability to replicate & distribute (fragmentation)
Simple call level interface or protocol

e)
f)
g)
Weaker
W
k concurrency model
d l than
th
ACID
Efficient use of distributed indexes and RAM
Fl ibl schema
Flexible
h

h)
No declarative query language
Ability to dynamically add new attributes
Multi tenancy
Multi-tenancy
September 2013
Alberto Abelló & Oscar Romero
13
Distributed D
Database
es and Big
g Data
Distributed Database

A distributed database (DDB) is a database where
d t managementt is
data
i di
distributed
t ib t d over severall
nodes in a network.

Each node is a database itself


Potential heterogeneity
Nodes communicate through the network
September 2013
Alberto Abelló & Oscar Romero
14
Distributed D
Database
es and Big
g Data
Parallel database architectures
D. DeWitt & J. Gray, “Parallel Database Systems:
The
h future
f
off High
h Performance
f
Database
b
Processing”,
” 1992
992
Figure from D. Abady
September 2013
Alberto Abelló & Oscar Romero
15
Distributed D
Database
es and Big
g Data
Activity
Objective: Recognize the benefits of
distributing data
 Tasks:

1. (5’)
Individually solve one exercise
2. (
(10’)) Explain
p
the solution to the others
3. Hand in the three solutions

Roles for the team-mates during task 2:
a) Explains
his/her material
b) Asks for clarification of blur concepts
c)) Mediates and controls time
September 2013
Alberto Abelló & Oscar Romero
16
Distributed D
Database
es and Big
g Data
Impedance Mismatch
Petra Selmer, Advances in Data Management 2012
October 2013
Alberto Abelló & Oscar Romero
17
Distributed D
Database
es and Big
g Data
Impedance Mismatch
Petra Selmer, Advances in Data Management 2012
October 2013
Alberto Abelló & Oscar Romero
18
Distributed D
Database
es and Big
g Data
Schemaless Databases
CREATE TABLE Student (
-
id int,
name varchar2(50),
surname varchar2(50),
h 2(50)
enrolment date);
Insert into Student (1, ‘Oscar’, ‘Romero’, ‘01/01/2012’,
‘Lleida’);
WRONG
Insert into Student (1
(1, ‘Oscar’
Oscar , ‘Romero’
Romero , NULL);
OK
Insert into Student (1, ‘Oscar’, ‘Romero’, ‘01/01/2012’);

Consequences (?) – 2 mins to think of them
-
true,
OK
Gain flexibility
 Lose semantics (also consistency)
Insert into Student (1, {‘Oscar’, ‘Romero’,
‘01/01/2012’});
01/01/2012 });
 May reduce the impedance mismatch



Coupled with HLLs (e.g., Java)
The data independence principle is lost (!)


October 2013
The ANSI / SPARC architecture is not followed
Applications can access and manipulate the database internal
structures
Alberto Abelló & Oscar Romero
19
Distributed D
Database
es and Big
g Data
Different applications
Not Only SQL (different problems entail different solutions)

OLTP

Object-Relational



Scientific databases and other Big Data repositories


Key-value stores
Data Warehousing & OLAP




MOLAP
Column stores
Multidimensional features
Text / documents

Document databases


XML/JSON databases
Stream processing


Distributed databases
Parallel databases
St
Stream
processor
Semantic Web and Open Data

Graph databases
February 2014
Alberto Abelló & Oscar Romero
20
Distributed D
Database
es and Big
g Data
Schemaless Databases
NOSQL solution for the impedance
mismatch
 Several new data models were introduced






Graph data model
Document-oriented databases
Key-value (~ hash tables)
Streams (~
( vectors and matrixes)
These new models lack of an explicit
schema (defined by the user)

However, an implicit schema remains
October 2013
Alberto Abelló & Oscar Romero
21
Distributed D
Database
es and Big
g Data
Databases landscape
February 2014
Alberto Abelló & Oscar Romero
22
Distributed D
Database
es and Big
g Data
Internal Structures
Ben Stopford
p
Progscon & JAX Finance 2015
September 2013
Alberto Abelló & Oscar Romero
23
Distributed D
Database
es and Big
g Data
Polyglot Systems

Federate different kinds of storage systems
Martin Fowler
http://martinfowler.com/bliki/PolyglotPersistence.html
24
Distributed D
Database
es and Big
g Data
NOSQL drawbacks
No ACID
 No standard
 Low-level query

Michael Stonebraker
September 2013
Alberto Abelló & Oscar Romero
25
Distributed D
Database
es and Big
g Data
The Problem is Not SQL
Q

Relational systems are too generic…






OLTP: stored procedures and simple queries
OLAP: ad-hoc complex queries
D
Documents:
t large
l
objects
bj t
Streams: time windows with volatile data
Scientific: uncertainty and heterogeneity
… But the overhead of RDBMS has nothing
to do with SQL

Low-level, record-at-a-time interface is not the
solution
SQL Databases vS. NoSQL Databases
Michael Stonebraker
Communications of the ACM,, 53(4),
( ), 2010
February 2014
Alberto Abelló & Oscar Romero
26
Distributed D
Database
es and Big
g Data
Brewery or bottled beer?
D It Y
Do
Yourself
lf
• Expensive
• Ad hoc development
Off the Shelf
• Economies of scale
• Concrete functionalities
Florian Waas analogy
September 2013
Alberto Abelló & Oscar Romero
27
Distributed D
Database
es and Big
g Data
Specific platforms

Google BigTable


Published in 2006
Implemented by Hbase


Google MapReduce



Published in 2007
Neo4J/Sparksee


Published in 2004
Implemented by Hadoop
MongoDB


Also Dynamo and Cassandra
Published in 2010/2008
SAP HANA


Published in 2011
Prototyped in SanssouciDB
September 2013
Alberto Abelló & Oscar Romero
28
Distributed D
Database
es and Big
g Data
Summary
Big Data definition
 Key features of cloud software (i.e., DBMS)
 Distributed Database definition
 Impedance Mismatch
 NOSQL main
i goals
l and
d features
f t

September 2013
Alberto Abelló & Oscar Romero
29
Distributed D
Database
es and Big
g Data
Bibliography
M. T. Özsu and P. Valduriez. Principles of
Distributed Database Systems, 3rd Ed.
Springer, 2011
 A. Ghazal et al. BigBench: towards an
industry
y standard benchmark for big
g data
analytics. SIGMOD Conference, 2013
 R. Cattell. Scalable SQL and NoSQL Data
Stores. SIGMOD Record 39(4), 2010
 L.
L Liu,
Liu M.T.
M T Özsu (Eds.).
(Eds ) Encyclopedia of
Database Systems. Springer, 2009

September 2013
Alberto Abelló & Oscar Romero
30
Download