An Introduction to Big Data Ken Smith April 10th, 2013 © 2014 The MITRE Corporation. All rights reserved Big Data … Its Technologies & Analytic Ecosystem For Internal MITRE Use © 2013 The MITRE Corporation. All rights reserved Course Goal Hype curve ………….Tethered To Reality……....... 3 © 2012 The MITRE Corporation. All rights reserved Outline Background: What is “Big Data”? … Why is it big? Parallel Technologies for Big Data Problems Big Data Ecosystem Ongoing Challenges 4 © 2012 The MITRE Corporation. All rights reserved What is “Big Data”? O’Reilly: – “Big data is when the size of the data itself becomes part of the problem” EMC/IDC: – “Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.” IBM: (The famous 3-V’s definition) – Volume (Gigabytes -> Exabytes) – Velocity (Batch -> Streaming Data) – Variety (Structured, Semi-structured, & Unstructured) Credit: Big Data Now, Current Perspectives from O’Reilly Radar (O’Reilly definition); Extracting Value from Chaos, Gantz et al. (IDC definition); Understanding Big Data, Eaton et al. (IBM definition) 5 © 2012 The MITRE Corporation. All rights reserved Data Size Terminology © 2012 The MITRE Corporation. All rights reserved A Simple Data Structure Taxonomy Structured data – Data adheres to a strict template/schema – spreadsheets, relational databases, sensor feeds, … Semi-structured data – Data adheres to a flexible (grammar-based) format Optional fields, repeating fields – Web pages / forms, documents, XML, JSON, … Unstructured data – Data adheres to an unknown format No schema or grammar; you discover what each byte is and means by examining the data – Unparsed text, raw disks, raw video & images, … “Variety”: constantly coping with structure variations; multiple types; changing types 7 © 2012 The MITRE Corporation. All rights reserved Why Are Volume & Velocity Increasing? 1) Internet-Scale Datasets – – – – Activity logfiles (e.g., clickstreams, network logs) Internet indices Relationship data / social networks Velocity note: Bin Laden’s death resulted in 5106 tweets/second 8 © 2012 The MITRE Corporation. All rights reserved Why Are Volume & Velocity Increasing? 2) Sensor Proliferation – Weather satellites; flight recorders; GPS feeds; medical and scientific instruments; cameras – Government agencies who want a sensor on every potentially mad cow, in every cave in Afghanistan, on every cargo container, etc. What if their wish is granted? – Velocity notes: Large Hadron Collider generates 40T/sec High Def UAVs that collect 1.4P/mission – Variety note: increasing # of sensor feeds increasing variety 9 © 2012 The MITRE Corporation. All rights reserved Why Are Volume & Velocity Increasing? 3) Because, with modern cloud parallelism, you can …. – Problem: “Frequent close encounters” are suspicious Given: 73,241 ships reporting {id, lat, long} every 5 minutes for 2 weeks Resulting dataset = 15 GB (uncompressed and indexed) – How do you detect all pairs of ships within X meters of each other? Many solutions generate intermediate “big data” 10 © 2012 The MITRE Corporation. All rights reserved What Good is Big Data? Some Examples! 1) As a basis for analysis – As a human behavior sensor – Supporting new approaches to science 2) To create a useful service 11 © 2013 The MITRE Corporation. All rights reserved Outline Background: What is “Big Data”? … Why is it big? Parallel Technologies for Big Data Problems Big Data Ecosystem Ongoing Challenges 12 © 2012 The MITRE Corporation. All rights reserved Traditional Scaling “Up”: Improve The Components of One System OS: multiple threads / VMs CPU: increase clock speed, bus speed, cache size RAM: increase capacity Disk: Increase capacity, decrease seek time, RAID 13 © 2012 The MITRE Corporation. All rights reserved Scaling “Out”: From Component Speedup to Aggregation Multicore cores on a chip (2, 4, 6, 8, ....) 14 © 2012 The MITRE Corporation. All rights reserved From Component Speedup to Aggregation Multiserver Racks (“Shared Nothing” – only interconnect) 15 © 2012 The MITRE Corporation. All rights reserved From Component Speedup to Aggregation Multi-Rack Data Centers © 2012 The MITRE Corporation. All rights reserved 16 From Component Speedup to Aggregation If you are Google or a few others: Multiple Data Centers 17 © 2012 The MITRE Corporation. All rights reserved The Resulting “Computer” & Its Applications OS CPU RAM ...... Disk This massively parallel architecture can be treated as a single computer Applications for this “computer”: – Can exploit computational parallelism (near linear speedup) – Can have a vastly larger effective address space – Google and Facebook field applications whose user base is measured as a reasonable fraction of the human race © 2012 The MITRE Corporation. All rights reserved 18 The Power of Parallelism: Divide & Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine Page 3.8 Source: a slide by Jimmy Lin, cc-licensed © 2012 The MITRE Corporation. All rights reserved Some Important Software Realities In a Massively Parallel Architecture Communication costs Fault-tolerance Programming abstractions 20 © 2012 The MITRE Corporation. All rights reserved “Numbers Everyone Should Know” From SoCC 2010 Keynote – Jeffrey Dean, Google L1 cache reference Branch mispredict L2 cache reference Mutex lock/unlock Main memory reference Compress 1K w/cheap algorithm Send 2K bytes over 1 Gbps network Read 1 MB sequentially from memory Round trip with same datacenter Disk seek Read 1 MB sequentially from disk Send packet CA->Netherlands->CA 0.5 ns 5 ns 7 ns 25 ns 100 ns 3,000 ns 20,000 ns 250,000 ns 500,000 ns 10,000,000 ns 20,000,000 ns 150,000,000 ns 21 © 2012 The MITRE Corporation. All rights reserved Some Important Software Realities In a Massively Parallel Architecture Communication costs Fault-tolerance Programming abstractions 22 © 2012 The MITRE Corporation. All rights reserved Fault Tolerance Frequency of faults in massively parallel architectures: – Google reports an average of 1.2 failures per analysis job – We assume our laptop will last through the week; but you lose this when you compute with 1000’s of commodity machines. What if the result waits because 499 / 500 worker tasks have completed – but #500 never will finish: Strategy: – Redundancy and checkpointing 23 © 2012 The MITRE Corporation. All rights reserved Some Important Software Realities In a Massively Parallel Architecture Communication costs Fault-tolerance Programming abstractions 24 © 2012 The MITRE Corporation. All rights reserved How Do You Program A Massively Parallel Computer? Parallel programming without help can be very painful! – Parallelize: translate your application into a set of parallel tasks – Task management: assigning tasks to processors, inter-task communication, task restart when they crash – Task synchronization: avoiding extended waits and deadlocks Programmers need simplifying abstractions to be productive – Pioneers Google & Facebook were forced to invent these – Hadoop now provides a tremendous suite Analogy: RDBMSs provide the atomic transaction abstraction – no programmer wants to worry about the details who is reading & writing data while they do! – Just use “begin transaction” and “end transaction” to insulate your code from others using the system 25 © 2012 The MITRE Corporation. All rights reserved Apache Hadoop Open source framework for developing & running parallel applications on hardware clusters – Cloudera & Hortonworks sell “premium” versions & support – adapted from Google’s internal programming model – available at: hadoop.apache.org Key components: – – – – – HDFS (Hadoop Distributed File System) Map-Reduce (parallel programming pattern) Hive, Pig (higher-level languages which compile into Map-Reduce) HBase (key-value store) Mahout (data mining library) Some non-Hadoop parallel frameworks also exist: – Asterdata & Greenplum sell {RDBMS + Map-Reduce + analytics} 26 © 2012 The MITRE Corporation. All rights reserved HDFS (Hadoop Distributed File System) Reduce Map HDFS files .. Underlying file system files .... HDFS: – provides a single unified file system abstracting away the many underlying machines’ file systems – load balances file fragments, maintains replication levels © 2012 The MITRE Corporation. All rights reserved 27 HDFS (Hadoop Distributed File System) HDFS components: – – – – NameNode manages overall file system metadata DataNodes (one per machine) manage actual data DataNodes are easy to add, expanding the file system Both DataNode and NameNodes include a webserver, so node status can be easily checked Example commands: – “/bin/hdfs dfs –ls” lists files in an HDFS directory corresponds to linux “ls” – “/bin/hdfs dfs -rm xx” removes HDFS file xx corresponds to linux “rm xx” © 2012 The MITRE Corporation. All rights reserved 28 HDFS Architecture Adapted from (Ghemawat et al., SOSP 2003) Page 21 © 2012 The MITRE Corporation. All rights reserved MapReduce ■ Iterate over a large number of records ■ Extract something of interest from each ■ Shuffle and sort intermediate results ■ Aggregate intermediate results ■ Generate final output Build a sequence of MR steps Key idea: provide a functional abstraction for these two operations Page 3.7 © 2012 The MITRE Corporation. All rights reserved Ideal MapReducable Problems 1) Input data can be naturally split into “chunks” and distributed 2) Large amounts of data – If smaller than HDFS block size, don’t bother 3) Data independence – Ideally, map operation does not depend on data at other nodes 4) Good redistribution key exists – Output of map job is key-value pairs – The key is used to shuffle/sort the output to the reducers Example: build a word-count index for a huge document corpus – Map: emit {docid, word, 1} tuple for each occurence – Reduce: sum similar tuples, like: {“War And Peace”, *, 1} Not all problems are “ideal”, but MR can still work: www.adjointfunctors.net/su/web/354/references/graph-processing-w-mapreduce.pdf Page 3.9 © 2012 The MITRE Corporation. All rights reserved MapReduce/HDFS Architecture From Wikipedia Commons: http://en.wikipedia.org/wiki/File:Hadoop_1.png Page 21 © 2012 The MITRE Corporation. All rights reserved Higher Level Languages: Hive Hive is a system for managing and querying structured data – Used extensively to provide SQL-like functionality: – Compiles into map-reduce jobs – Includes an optimizer* Developed by Facebook – Almost 99.9% Hadoop jobs at Facebook are generated by a Hive front-end system. *Hive optimizations at: citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.151.2637 Page 25 © 2012 The MITRE Corporation. All rights reserved Apache Pig Open source scripting language – Provides SQL-like primitives in a scripting language – Developed by Yahoo! Almost 30% of their analytic jobs are written in “Pig Latin” Execution Model – compiles into MapReduce (over HDFS files, HBase tables) – Approximately 30% overhead – Optimizes multi-query scripts, filter and limit optimizations that reduce the size of intermediate results Example commands – FILTER: hour00 = FILTER hour_frequency2 BY hour eq '00'; – ORDER: ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour, score); – GROUP: hour_frequency1 = GROUP ngramed2 BY (ngram, hour); – COUNT: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count; Page 26 © 2012 The MITRE Corporation. All rights reserved The Human Approach Massively parallel human beings – “crowdsourcing” A good list of projects: – en.wikipedia.org/wiki/List_of_crowdsou rcing_projects 35 © 2012 The MITRE Corporation. All rights reserved Outline Background: What is “Big Data”? … Why is it big? Parallel Technologies for Big Data Problems Big Data “Ecosystems” Ongoing Challenges in Big Data Ecosystems 36 © 2012 The MITRE Corporation. All rights reserved General “Funnel” Model of Big Data Analytic Workflows 1) Ingest of diverse raw data sources: text, sensor feeds, semi-structured (e.g., web content, email) 3) Generate more structured datasets as needed: RDBMS tables, objects, triple stores 4) Generate & explore userfacing analytic models (data cubes, graphs, clusters). Drill down to details. 2) Transform, clean, subset integrate, index, new datasets. Enrich: extract entities, compute features & metrics. Data science teams work across entire spectrum Some examples & technology stacks: – Clickstream analysis; stock “tick” analysis; social network analysis – Google’s Tenzing stack: SQL/OLAP over Hadoop – Cloudera’s stack: Hive/Pig compiling into Hadoop – Greenplum’s stack: SQL compiling directly onto servers, OR into MapReduce via “external tables” © 2012 The MITRE Corporation. All rights reserved Ecosystem Overview A frequent workflow is emerging: – – – – 1) Ingest data from diverse sources 2) ETL / enrichment 3) Intermediate data management 4) Refined data management (graphs, parsed triples from text, OLAP/relational data) – 5) Analytics & viz tools to build/test models, support decisions – 6) Reachback into earlier steps by “data scientists” Common to diverse types of organizations: – marketing, financial research, scientists, intelligence agencies, … – (social media providers are a bit different: they host the big data) Many technologies working together – Map reduce, semistructured (“NoSQL”) databases, graph databases, RDBMSs, machine learning/data mining algorithms, analytic tools, visualization techniques We will touch on some of these through the rest of today – Many are new and evolving; this is a rapidly moving train! 38 © 2012 The MITRE Corporation. All rights reserved Emergence of the Data Scientist Page 8.5 © 2012 The MITRE Corporation. All rights reserved Spectrum of Big Data Ecosystem Classes Big Data Ecosystems differ along several key questions: 1) Is there a hypothesis being tested? – Testing a hypothesis requires a more sophisticated analysis process 2) Is external data being gathered? – Versus all internally generated data. – External data requires more ETL effort 3) Does it make sense to evolve and expand this ecosystem? – The greater the up-front investment, the more important it is to address serendipitous new hypotheses by reusing/augmenting existing data resources 40 © 2012 The MITRE Corporation. All rights reserved Spectrum of Big Data Ecosystem Classes 1) Non-experiment (no hypothesis exists, external data) – – – – No hypothesis or learning experiment Ecosystem reports aspects of external data, little analysis / new truth Example: CNN “trending now” alerts (Note: subject to being “gamed” by manipulation of external data!) 2) Evolving experimental ecosystem (hypotheses & external data) 3) Self contained experiment (hypothesis exists, no external data) 41 © 2012 The MITRE Corporation. All rights reserved The Non-Experiment (Example: “Trending Now”) 1) External data ingested 2) Basic processing applied to “add value” for consumers (but no rigorous model learning, or hypothesis testing) © 2012 The MITRE Corporation. All rights reserved A Spectrum of Big Data Ecosystems 1) Non-experiment 2) Evolving experimental ecosystem (hypotheses, external data) 3) Self-contained experiment (a hypothesis exists, no external data) – Pre-existent (scientific) hypothesis to test – All necessary data generated to spec within the ecosystem – Example: Argonne National Labs 43 © 2012 The MITRE Corporation. All rights reserved The Self-Contained Experiment (Example: Argonne National Labs) 1) A scientific hypothesis H exists, and a plan to test H by analyzing large datasets. Valid 4) Plan / model applied to data to validate/ invalidate H 2) Any data needed to test H is generated “internally” 3) Data analysis. Perhaps requiring a predictive model to be learned & refined Not valid © 2012 The MITRE Corporation. All rights reserved Spectrum of Big Data Ecosystem Classes 1) Non-experiment (no hypothesis exists, external data) 2) Evolving experimental ecosystem (potential hypotheses, external data) – – – – Massive external datasets suggest new insights / competitive advantage Hypothesis formed and external data gathered Experiment / ecosystem designed to test hypothesis, provide insight Once in place, ecosystem is reused & evolves: new data & hypotheses, cost amortized – Sweet spot … (Consumer analysis, Intelligence analysis …) 3) Self-contained experiment 45 © 2012 The MITRE Corporation. All rights reserved Evolving Experiment Ecosystem: E3 (Example: Google Adwords) 1) Massive external data suggests new insights / competitive advantages 1b) Incremental data suggests incremental insights … Valid Valid 4) Plan / model applied to data to validate/ invalidate H 2) Initial hypothesis H formed & data gathered to test it. Not valid Not valid 3) Data analysis. Perhaps requiring a predictive model to be learned & refined © 2012 The MITRE Corporation. All rights reserved Questions? 47 © 2012 The MITRE Corporation. All rights reserved Outline Background: What is “Big Data”? … Why is it big? Parallel Technologies for Big Data Problems Big Data “Ecosystems” Ongoing Challenges in Big Data Ecosystems 48 © 2012 The MITRE Corporation. All rights reserved Some General & Ongoing Challenges Ecosystems are mature to the extent that they work now. But definitely not a fully “solved problem”!! Some outstanding issues to keep an eye on: – Sampling What if two sources are sampled differently? – Security – Privacy – Metadata E.g., How do we deal with evolution of processing? – Moving/loading big data – People finding, retaining, assigning to roles, training/growing, paying – Outsourcing options disk growth beyond your budget, need for services you can’t provide 49 © 2012 The MITRE Corporation. All rights reserved Normal “Funnel” Model of Big Data Analytic Workflows - Assumption that all data “melts together” within the funnel © 2012 The MITRE Corporation. All rights reserved Security-Partitioned “Funnel” Model of Big Data Analytic Workflows - Assumption that certain data must not be mixed … - How do you implement separation? - Issues: What does this mean for the ability to aggregate, infer? © 2012 The MITRE Corporation. All rights reserved Other Security Issues Parallel HW is often managed by 3rd parties for economics: – Should I expose my sensitive data to DBAs who don’t work for me? – What about other unknown/untrusted tenants of a rented HW infrastructure? Standard encryption only addresses data at rest – When a query hits the DBMS it becomes plaintext in RAM. A rouge cloud DBA can see all my “encrypted” data. It’s hard to map high level policies onto detailed implementations, big data makes this worse – E.g., Books about the stock market cannot be checked out to freshmen 52 © 2012 The MITRE Corporation. All rights reserved Accumulo Data Sensitivity Labels Key Row ID Column Family Qualifier Visibility Timestamp Value Label definition – Labels (e.g., SECRET, NOFORN) are defined, applied on ingest – Cryptographically bound to data – Applied at the key-level (i.e., to every value individually) See: accumulo.apache.org/1.4/user_manual/Security.html Access: – Database users obtain are assigned labels; these are used to gain access when a user authenticates as that user. Issues to consider: – Admin overhead of defining and applying labels to every value – Aligning heterogeneous label sets to realize possible sharing – Label assurance Page 3.4 © 2012 The MITRE Corporation. All rights reserved Lack of Metadata As Harmful: 54 © 2012 The MITRE Corporation. All rights reserved Metadata Challenges in Sponsor Ecosystems 1) Exploiting myriads of datasets with agility – What columns link voice recordings to radar? When do they simultaneously exist in this table? Where are temperature readings? 2) Dealing with “shape-changing” data sources – When data format continually changes, how does my reader interpret serialized data instances without schema information? 3) Accurately matching analytics to datasets – Analytic A requires column C1, derived by f8(). Does C1 exist for May? If C1 exists, but was derived from f7(), it would be bad if A “fails silently”! 4) Rapidly incorporating unknown data sources – Can I reuse the ingest & transformation code from other data sources? 5) Reasoning about the data (data scientist needs) – Where are value distributions & trends over time (e.g., to test a hypothesis, to infer semantics, for process optimization)… Theme: Poorly understood datasets result in high overhead & degraded analytics 55 © 2012 The MITRE Corporation. All rights reserved More Use Cases For Metadata Our Big Data sponsors are obligated to know: What data should be retained? – Given the size of the data, all information can’t be retained forever. Decisions are currently made ‘off the cuff’ which data to retain, and which to let go. Can we characterize data’s use to support retention decisions? Where did this data come from? – Analysts are writing reports and need to know the source of the data so they can determine trustworthiness, legality, dissemination restrictions, and potentially reference the original data object Where a class of data resides? – This is largely a compliance and auditing function. A redacted use case would be: “Which of my systems currently house PII data? Do any systems house this data that aren’t approved for it? Are my security controls working?” With an increasing reliance on both public and private clouds, this is growing increasingly challenging. Where a specific data item resides? – If the lawyers call and say I need to get rid of a certain piece of intelligence, can I locate all copies of it? Who else did I sent it to? If there is a breach at a cloud provider or partner, do I know what data items landed within their perimeter? This would enable more granular breach notifications. Page 56 © 2012 The MITRE Corporation. All rights reserved What is Provenance? “Family Tree” of relationships – Ovals = data, rectangles = processes – Show how data is used and reused Basic metadata – Timestamp – Owner – Name/Descr Can also include annotations – E.g. quality info Is not the actual data object Page 57 © 2012 The MITRE Corporation. All rights reserved How is it Done Today? The general approach is: “The developers just kinda know.” – This does not scale! (with variety … the under-served “V”) Some large companies are now developing point solutions, as vast #’s of different data formats accumulate: – Protobuf schema repository from Google – Avro schema repository from LinkedIn – Hive metacatalog (basis of hCatalog) ■ But these are not general & powerful “first principles” solutions Format-specific data model (e.g., hCatalog favors Hive) Typically focus only on the “SeDe” issue – “poor man’s metadata repository” – https://issues.apache.org/jira/si/jira.issueviews:issue-html/AVRO-1124/AVRO-1124.html 58 © 2012 The MITRE Corporation. All rights reserved Questions? 59 © 2012 The MITRE Corporation. All rights reserved Next Topic in the Outline Intro to Big Data and Scalable Databases – Part 1: Big Data… Its Technologies & Analytic Ecosystem – Part 2: An Introduction To Parallel Databases – Part 3: Technological Innovations and MPP RDBMS 60 © 2012 The MITRE Corporation. All rights reserved An Introduction To Parallel Databases Parallel Databases Parallel ParallelDatabases Databases For Internal MITRE Use © 2012 The MITRE Corporation. All rights reserved Purpose of This Talk Let’s say you have a problem involving: and Lots of data can apply multiple processors What can a database do for me? What databases are available? How do I pick? 62 © 2012 The MITRE Corporation. All rights reserved Outline Taxonomy Software realities for parallel databases Systems engineering strategies 63 © 2012 The MITRE Corporation. All rights reserved A Simple Taxonomy of Parallel Databases “Clouds” are increasingly attractive computational platforms – Traditional solutions don’t automatically scale well to clouds, innovation is occuring rapidly ... A Lot! BigTable / Hbase / Accumulo Non-relational 100 0 Aster Data Max Number of Processors MongoDB FlockDB (aka NoSQL) Greenplum 100 Parallel Relational 10 Traditional RDBMS 1 Structured Semi-structured Triples, Relational (e.g., “Document- Key-value oriented) Market Trends – Consolidation – Hybrids – To “upper left” 64 Data Model Structure © 2012 The MITRE Corporation. All rights reserved A More Complex Taxonomy (451 group) Oh My! 65 © 2012 The MITRE Corporation. All rights reserved Taxonomy Used In This Talk Key-value stores Semi-structured databases Parallel relational Graph databases & Triplestores 66 © 2012 The MITRE Corporation. All rights reserved A Short History of Key Value Stores 2004: Google invented BigTable – Now being replaced by Spanner (distributed transactions, SQL) 2007: Hbase (open source BigTable): hbase.apache.org – Large & growing user community; HDFS file system 2008: Facebook invents Cassandra – HBase data model, but P2P file system; released open source 2010: Facebook enhances & adopts HBase internally 2011: NSA releases Accumulo open source: accumulo.apache.org – Similar to Hbase; includes data sensitivity labels 2012: Basho releases Riak: wiki.basho.com – Web friendly; based on Amazon’s dynamo paper 67 © 2012 The MITRE Corporation. All rights reserved Key-Value Store Data Model Datasets typically modeled as one very large table Key: <row id, column id, version> – Row id (canonical Google row id: reversed URL) – Column id static number of carefully designed column “families” each family can have an unbounded number of columns – Version-timestamp Database keeps record of all previous values (update = append) Query examples: – given a full key, return the value – given a column ID and a value, return all matching rows 68 © 2012 The MITRE Corporation. All rights reserved Other Characteristics of Key Value Stores Performance: designed for scale out – 1 index on the key (faster than HDFS scan), no optimizer Cost: Typically open source; need Hadoop / programming skills – Cloudera support is ~$4K/node Roles: – Great fit: for data you don’t understand well yet (e.g., ETL) Massive, rapidly arriving, highly non-homogenous datasets Need for query by key; enriching by adding aribtrary columns – Poor fit: if you know exactly what your data looks like (lose schema) © 2012 The MITRE Corporation. All rights reserved 69 HBase Table Creation Example Create a table named test with a single column family named cf. Verify its creation by listing all tables and then insert some values. hbase(main):003:0> create 'test', 'cf' 0 row(s) in 1.2200 seconds hbase(main):003:0> list 'test' .. 1 row(s) in 0.0550 seconds hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in 0.0560 seconds hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in 0.0370 seconds hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in 0.0450 seconds 70 © 2012 The MITRE Corporation. All rights reserved HBase Example Verify the data insert by running a scan of the table: hbase(main):007:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value1 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3 3 row(s) in 0.0590 seconds Get a single row: hbase(main):008:0> get 'test', 'row1' COLUMN CELL cf:a timestamp=1288380727188, value=value1 1 row(s) in 0.0400 seconds 71 © 2012 The MITRE Corporation. All rights reserved Taxonomy Used In This Talk Key-value stores Semi-structured databases Parallel relational Graph databases & Triplestores 72 © 2012 The MITRE Corporation. All rights reserved A Short History of Semi-structured Databases 1980’s: “Object-oriented” DBs invented; didn’t take off – Addressed gap between relations & prog. Languages – Good for data hard for RDBMS’s: aircraft & chip designs 1995: Stanford LORE project induces XML schema from data – Coined term “semi-structured” due to flexible schema 2000’s: “Sharding” gave semi-structured databases new life – Now often called “document oriented” (but not “Documentum”) – Great list at en.wikipedia.org/wiki/Document-oriented_database 2009: open source MongoDB; 10gen support; JSON data model 2012: UCI Asterix project www.cs.ucsb.edu/common/wordpress/?p=1533 – Goal: Open source “Postgres-quality” flexible schema DBMS 73 © 2012 The MITRE Corporation. All rights reserved Semi-structured Database Data Model Objects defined by grammar (XML, JSON) – One table per object type; optional attributes – Tight programming language interface – Good compromise between Key-Value and RDBMS JSON Example: (JavaScript Object Notation) – JSON provides syntax for storing and exchanging text information; JSON is smaller than XML and easier to parse. – Looks much like C, Java, etc. data structures { "employees": [ { "firstName":"John" , "lastName":"Doe" }, { "firstName":"Anna" , "lastName":"Smith" }, { "firstName":"Peter" , "lastName":"Jones" } ] } – The employees object is an array of 3 employee records (objects). 74 © 2012 The MITRE Corporation. All rights reserved Other Features of Semi-structured Databases Speed: shards for scale out; often a limited optimizer Cost: Some free, few features; some $500K, many features Killer app(s): – – – – Good fit for “like-but-varying” objects, accessed similarly would have used a relational database, but objects aren’t regular Rapid prototyping in scientific lab “Cloud server” – serving objects used as web content 75 © 2012 The MITRE Corporation. All rights reserved MongoDB Table Creation Example Create a collection named library with a maximum of 50,000 entries. > db.createCollection(”library", { capped : true, size : 536870912, max : 50000 } ) Insert a book (a JSON object): > p = { author: “F. Scott Fitzgerald”, acquisitiondate: new Date(), title: “The Great Gatsby”, tags: [“Crash”, “Reckless” “1920s”]} > db.library.save(p) Retrieve the book: > db.library.find( { title: “The Great Gatsby”} ) > { "_id" : ObjectId("50634d86be4617f17bb159cd"), “author” : “F. Scott Fitzgerald”, “acquisitiondate” : “10/28/2012", “title”: “The Great Gatsby”, “tags" : [“Crash”, “Reckless” “1920s”] } 76 © 2012 The MITRE Corporation. All rights reserved Taxonomy Used In This Talk Key-value stores Semi-structured databases Parallel relational This is Irina’s talk Graph databases & Triplestores 77 © 2012 The MITRE Corporation. All rights reserved Example Systems Key-value stores – BigTable Hbase Accumulo, Cassandra, Riak, … Many are “noSQL” systems Semi-structured – MongoDB, CouchDB (JSON-like); Gemfire (OQL); Marklogic (Xquery, SQL), Asterix, … Parallel relational – Vertica, Greenplum, AsterData, Paraccel, Teradata, Netezza, … Graph databases & Triplestores – FlockDB (simple), “Big Linked Data”, Titan (Gremlin/Tinkerpop), Neo4j (Gremlin/Tinkerpop, SPARQL) AllegroGraph (SPARQL) Legend Commercially available Proprietary Open source or Research Open source, commerical version / support Open source, GOTS 78 © 2012 The MITRE Corporation. All rights reserved Outline Taxonomy Some important software realities for parallel databases – Sharding – Optimizers – Data Consistency Systems engineering strategies 79 © 2012 The MITRE Corporation. All rights reserved A Simple Comparison of Properties Sharding Optimizer Pr. Lang. Integration Flexible Data Model Data Consistency Key Value Semi-Struct Parallel RDBMS The Asterix system being developed at UCI intends to have a high score on all 5 properties 80 © 2012 The MITRE Corporation. All rights reserved Sharding “Sharding” maps one table into a set of distributed fragments – Each fragment located at a single compute node Horizontal partitioning – Shards typically defined by key range partition; but various hashing strategies possible – Speeds up parallel operations (e.g., search, summation) Replication – Multiple copies can be generated for each partition – Speeds read access, improves availability Issue: how do you shard graph data?? – Facebook does it randomly! (No good split) All parallel DBMSs shard data somehow 81 © 2012 The MITRE Corporation. All rights reserved Multiple copies Sharding Illustration Key Range 0..30 Key Range 31..60 Key Range 61..90 Key Range 91.. 100 Primary Primary Primary Primary Secondar y Secondar y Secondar y Secondar y Secondar y Secondar y Secondar y Secondar y Horizontal Partitions Software Realities for Parallel Databases Realities: – Sharding – Optimizers – Transactions & Data Consistency 83 © 2012 The MITRE Corporation. All rights reserved Optimizers & Efficient Queries Optimizers automatically rewrite user queries into an equivalent and more efficiently executable form Invented in the 70’s to make SQL possible The crown jewels of commercial (one node) RDBMSs! Parallel databases can “scale out” to improve performance – Want an order of magnitude speedup? 100 1000 nodes! – Many use a far simpler query language, if one at all (e.g., search by key) Less need/benefit for an optimizer – Example: Hbase provides 1 index, bloom filters, caching, no optimizer Parallel relational databases – Can scale out, and also provide optimizers to get more done with fewer nodes – Very sophisticated data migration primitives (moving shards to the computation, if cheaper, managing solid state & disk, …) Scale out and/or optimizers? It depends!! 84 © 2012 The MITRE Corporation. All rights reserved 85 Optimizing a Single Node RDBMS 0 Given 3 relations (tables) of data: Pilots Flights Pilot.name = Flights.pilot_name Aircraft Flights.aircraft_id = Aircraft.id 0 Which pilots have flown prop-jets? (In SQL) SELECT FROM WHERE AND AND DISTINCT Pilots.name Pilots, Flights, Aircraft Pilot.name = Flights.pilot_name Flights.aircraft_id = Aircraft.id Aircraft.type = “prop-jets” MITRE 86 Initial Query Execution Plan answer (the distinct pilot names) (10) project (only prop-jets - 0.1%) (10,000) select Total tuples processed: 30,012,060 (10,000,000) (10,000,000) (50) Database : join join scan (2000) scan scan Pilots Flights Aircraft (10,000,000) (2000) (50) (10,000,000) MITRE 87 Query Optimization: Improved Plan answer (only distinct pilot’s names) (10) project Total tuples processed: 30,062 (50) (10,000) join scan (10,000) join (10,000) indexed retrieval Database : Pilots (50) (only prop-jets - 0.1%) select (2) Flights Aircraft (10,000,000) (2000) MITRE Parallel DBMS Optimizer Comparison Key value stores – typically do not optimize queries; rely on scale out Semi-structured DBMS’s – Typically a simple approach, also relying on scale-out – MongoDB tries to determine best index when two are available Parallel RDBMSs – Typically provide sophisticated optimizers Migration; reasoning about storage hierarchy – Greenplum migration primitives www.greenplum.com/technology/optimizer: 1) Broadcast Motion (N:N) - Every segment sends target data to all others 2) Redistribute Motion (N:N) - Every segment rehashes the target data (by join column) and redistributes each row to the appropriate segment 3) Gather Motion (N:1) - Every segment sends the target data to a single node (usually the master) 88 © 2012 The MITRE Corporation. All rights reserved Software Realities for Parallel Databases Realities: – Sharding – Optimizers – Transactions & Data Consistency 89 © 2012 The MITRE Corporation. All rights reserved Global Data Consistency +1 3 +1 3 Given updates to replicated data shards, how do you keep them all consistent? update +1 2 Classic DB theory solution: – Two phase commit (2PC): all vote; if all say yes, then all commit – Nice, but communication is costly in a global data center network! – Thus, Amazon has been happy to sell a book it doesn’t have sometimes. Eventual consistency (a hallmark of early “NoSQL”) – No guarantee of “snapshot isolation” – Over time, replicas converge despite node failures & network partitions – Many different flavors / implementations (e.g., HBase, Cassandra) – See also: www.cs.kent.edu/~jin/Cloud12Spring/HbaseHivePig.pptx Google just invented “Spanner” (~2PC!) – Global consistency via atomic clocks/GPS (not everyone has these ); reduces communications 90 © 2012 The MITRE Corporation. All rights reserved Outline Taxonomy Software realities for parallel databases Systems engineering strategies 91 © 2012 The MITRE Corporation. All rights reserved Systems Engineering Strategy You can often get by with just one parallel database – a key value store for ETL, and some BI – a parallel RDBMS for BI, and as a cloud server – or no DBMS (e.g., just use HDFS) … But one size is NOT the best fit for all – Sweet spots exist for each type – This is different from relational era! 92 © 2012 The MITRE Corporation. All rights reserved Roles In The Funnel Workflow Model 3) Generate more structured datasets as needed: RDBMS tables, objects, triple stores 1) Ingest of diverse raw data sources: text, sensor feeds, semistructured (e.g., web content, email) 2) Transform, clean, subset integrate, index, new datasets. Enrich: extract entities, compute features & metrics. 4) Generate & explore user-facing analytic models (data cubes, graphs, clusters). Drill down to details. 1) Key value stores: Manage & query ETL datasets, compute metrics 2) Semi-structured DBS: Persist / query generated objects 3) Parallel RDBMSs, Graph DBS: Support BI queries, graph exploration, … © 2012 The MITRE Corporation. All rights reserved Some Systems Engineering Strategies 1) Tunnel vision: – Use one type of DBMS & just live with its shortcomings if/when you encounter them 2) Optimal assignment: – Pick the best one for each type of workload you will encounter – It takes skill to know how to pick, mix, match up front! 3) Keep your eye on it: – – – – Look at user experiences (forums), best practices Pick initial system(s) that look right & be ready to learn as you go May migrate to a more “final” system over time Google, Facebook are doing this all the time! BigTable to Caffeine to Spanner; Cassandra to (customized) HBase 94 © 2012 The MITRE Corporation. All rights reserved Questions? 95 © 2012 The MITRE Corporation. All rights reserved