Distributed D Database es and Big g Data Big Data Management September 2013 Alberto Abelló & Oscar Romero 1 Distributed D Database es and Big g Data Knowledge objectives 1. 2. 3 3. 4. 5. 6. Give a definition of Big Data Name eight features of cloud databases Give a definition of Distributed Database Recognize the problem of impedance mismatch Name different kinds of NOSLQ databases Recognize the main problems of NOSQL databases September 2013 Alberto Abelló & Oscar Romero 2 Distributed D Database es and Big g Data Understanding Objectives 1. 2. Estimate the cost of a distributed query Transfrom the value in a schemaless database into a relational one September 2013 Alberto Abelló & Oscar Romero 3 Distributed D Database es and Big g Data Motivation “Without data you are just another person with an opinion.” opinion ” William Edwards Deming “It is a capital mistake to theorize before one has data. data ” Sherlock Holmes (A Study in Scarlett) Prescriptive Predictive Descriptive September 2013 Alberto Abelló & Oscar Romero 4 Bain & Company September 2013 5 Distributed D Database es and Big g Data September 2013 6 Distributed D Database es and Big g Data Velocity Volume Variety … Variability V i bilit Validity/Veracity Value From IBM “Understanding Big Data” Millions of Terabytes Distributed D Database es and Big g Data Big Data definition year September 2013 Alberto Abelló & Oscar Romero 7 Distributed D Database es and Big g Data Bigbench September 2013 Alberto Abelló & Oscar Romero 8 Distributed D Database es and Big g Data Big Data sources Structured Created (i.e., business data) Provoked (e.g., customer feedback) Transacted Compiled (e.g., demographics) Experimental (e.g., sampling customers) Unstructured Captured (e.g., search words) User-generated User generated (e.g., (e g social networks) September 2013 Alberto Abelló & Oscar Romero 9 Distributed D Database es and Big g Data Types of Big Data Analyzed in Industry September 2013 Alberto Abelló & Oscar Romero 10 Distributed D Database es and Big g Data Big Data facets The Original as Technology as Data Distinctions as Signals as Opportunity O t it as Metaphor as New Term for Old Stuff Timo Elliott September 2013 Alberto Abelló & Oscar Romero 11 Distributed D Database es and Big g Data Big Data related areas Volume and Velocity Variety and Variability Data quality Data integration Web and text mining Information retrieval Validity/Veracity Declarative querying Query optimization Data consistency Uncertainty Statistical reasoning Data linkage (provenance) Value Analytics y Data mining Algorithmics Automatic learning Simulation Privacy Biologists Linguistics Chemists Sociologists Engineers September 2013 Alberto Abelló & Oscar Romero 12 Distributed D Database es and Big g Data Key features of cloud databases a) b) c) d) Quick/Cheap set up Ability to horizontally scale Ability to replicate & distribute (fragmentation) Simple call level interface or protocol e) f) g) Weaker W k concurrency model d l than th ACID Efficient use of distributed indexes and RAM Fl ibl schema Flexible h h) No declarative query language Ability to dynamically add new attributes Multi tenancy Multi-tenancy September 2013 Alberto Abelló & Oscar Romero 13 Distributed D Database es and Big g Data Distributed Database A distributed database (DDB) is a database where d t managementt is data i di distributed t ib t d over severall nodes in a network. Each node is a database itself Potential heterogeneity Nodes communicate through the network September 2013 Alberto Abelló & Oscar Romero 14 Distributed D Database es and Big g Data Parallel database architectures D. DeWitt & J. Gray, “Parallel Database Systems: The h future f off High h Performance f Database b Processing”, ” 1992 992 Figure from D. Abady September 2013 Alberto Abelló & Oscar Romero 15 Distributed D Database es and Big g Data Activity Objective: Recognize the benefits of distributing data Tasks: 1. (5’) Individually solve one exercise 2. ( (10’)) Explain p the solution to the others 3. Hand in the three solutions Roles for the team-mates during task 2: a) Explains his/her material b) Asks for clarification of blur concepts c)) Mediates and controls time September 2013 Alberto Abelló & Oscar Romero 16 Distributed D Database es and Big g Data Impedance Mismatch Petra Selmer, Advances in Data Management 2012 October 2013 Alberto Abelló & Oscar Romero 17 Distributed D Database es and Big g Data Impedance Mismatch Petra Selmer, Advances in Data Management 2012 October 2013 Alberto Abelló & Oscar Romero 18 Distributed D Database es and Big g Data Schemaless Databases CREATE TABLE Student ( - id int, name varchar2(50), surname varchar2(50), h 2(50) enrolment date); Insert into Student (1, ‘Oscar’, ‘Romero’, ‘01/01/2012’, ‘Lleida’); WRONG Insert into Student (1 (1, ‘Oscar’ Oscar , ‘Romero’ Romero , NULL); OK Insert into Student (1, ‘Oscar’, ‘Romero’, ‘01/01/2012’); Consequences (?) – 2 mins to think of them - true, OK Gain flexibility Lose semantics (also consistency) Insert into Student (1, {‘Oscar’, ‘Romero’, ‘01/01/2012’}); 01/01/2012 }); May reduce the impedance mismatch Coupled with HLLs (e.g., Java) The data independence principle is lost (!) October 2013 The ANSI / SPARC architecture is not followed Applications can access and manipulate the database internal structures Alberto Abelló & Oscar Romero 19 Distributed D Database es and Big g Data Different applications Not Only SQL (different problems entail different solutions) OLTP Object-Relational Scientific databases and other Big Data repositories Key-value stores Data Warehousing & OLAP MOLAP Column stores Multidimensional features Text / documents Document databases XML/JSON databases Stream processing Distributed databases Parallel databases St Stream processor Semantic Web and Open Data Graph databases February 2014 Alberto Abelló & Oscar Romero 20 Distributed D Database es and Big g Data Schemaless Databases NOSQL solution for the impedance mismatch Several new data models were introduced Graph data model Document-oriented databases Key-value (~ hash tables) Streams (~ ( vectors and matrixes) These new models lack of an explicit schema (defined by the user) However, an implicit schema remains October 2013 Alberto Abelló & Oscar Romero 21 Distributed D Database es and Big g Data Databases landscape February 2014 Alberto Abelló & Oscar Romero 22 Distributed D Database es and Big g Data Internal Structures Ben Stopford p Progscon & JAX Finance 2015 September 2013 Alberto Abelló & Oscar Romero 23 Distributed D Database es and Big g Data Polyglot Systems Federate different kinds of storage systems Martin Fowler http://martinfowler.com/bliki/PolyglotPersistence.html 24 Distributed D Database es and Big g Data NOSQL drawbacks No ACID No standard Low-level query Michael Stonebraker September 2013 Alberto Abelló & Oscar Romero 25 Distributed D Database es and Big g Data The Problem is Not SQL Q Relational systems are too generic… OLTP: stored procedures and simple queries OLAP: ad-hoc complex queries D Documents: t large l objects bj t Streams: time windows with volatile data Scientific: uncertainty and heterogeneity … But the overhead of RDBMS has nothing to do with SQL Low-level, record-at-a-time interface is not the solution SQL Databases vS. NoSQL Databases Michael Stonebraker Communications of the ACM,, 53(4), ( ), 2010 February 2014 Alberto Abelló & Oscar Romero 26 Distributed D Database es and Big g Data Brewery or bottled beer? D It Y Do Yourself lf • Expensive • Ad hoc development Off the Shelf • Economies of scale • Concrete functionalities Florian Waas analogy September 2013 Alberto Abelló & Oscar Romero 27 Distributed D Database es and Big g Data Specific platforms Google BigTable Published in 2006 Implemented by Hbase Google MapReduce Published in 2007 Neo4J/Sparksee Published in 2004 Implemented by Hadoop MongoDB Also Dynamo and Cassandra Published in 2010/2008 SAP HANA Published in 2011 Prototyped in SanssouciDB September 2013 Alberto Abelló & Oscar Romero 28 Distributed D Database es and Big g Data Summary Big Data definition Key features of cloud software (i.e., DBMS) Distributed Database definition Impedance Mismatch NOSQL main i goals l and d features f t September 2013 Alberto Abelló & Oscar Romero 29 Distributed D Database es and Big g Data Bibliography M. T. Özsu and P. Valduriez. Principles of Distributed Database Systems, 3rd Ed. Springer, 2011 A. Ghazal et al. BigBench: towards an industry y standard benchmark for big g data analytics. SIGMOD Conference, 2013 R. Cattell. Scalable SQL and NoSQL Data Stores. SIGMOD Record 39(4), 2010 L. L Liu, Liu M.T. M T Özsu (Eds.). (Eds ) Encyclopedia of Database Systems. Springer, 2009 September 2013 Alberto Abelló & Oscar Romero 30