The Sibdata Revolution Nick Roussopoulos DCS & UMIACS & Univ. of Maryland September 2009 Data Management: Past to Current • Structured Data Emp ename Gary Shirley Christos Robin Uma Tim sal 30K 35K 37K 22K 30K 12K dept toy candy shoe toy shoe Dept dept candy toy men shoe floor 1 2 2 1 mgr Irene Jim John George • Structured architectures CLIENTS CLIENTS Processors Memory Nick Roussopoulos CLIENTS Data Management: Huh??? Nick Roussopoulos The Landscape Bell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect • • • • • Mainframes 1960s Minicomputers 1970s Microcomputers/PCs 1980s Web-based computing 1990s Devices (Smart phones, PDAs, wireless sensors, RFID) 2000’s Enabling a new generation of applications that Mandate new data management methods & tools. Nick Roussopoulos Data Then and Now • The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!! • The Data Integration quagmire: 40 years of continuous successes (sic) and still a long way to the end. • Structure provides crucial understanding for making data usable and leads to discovery/innovation. Nick Roussopoulos Data Streaming Data Explosion PoS System Barcodes Phones RFID • Exponential data growth • New challenges: continuous, interconnected, distributed, physical • Shrinking business cycles • More complex decisions Inventory Transactional Systems Clickstream Telematics Nick Roussopoulos Sensors The Structure Spectrum • Structured data (schema-first) • regular, known, conforming, … • e.g., Relational database • Unstructured data (schema-never) freeform, irregular, • e.g., plain text, images, audio, … • Semi-structured data (schema-later) • Provides structural information, but less constrained. e.g., XML, tagged text/media Nick Roussopoulos Data Integration • Integration is the ultimate schema-first problem. • Requires complete understanding & disambiguation • Structure (semantics) is both a key enabler and a key impediment here. Nick Roussopoulos Structured Data: How much • Conventional Wisdom: ~20% of data is structured currently. • Consumer apps, enterprise search, multimedia apps are placing downward pressure on this. Nick Roussopoulos State of the Art: Integration-in-the-large • Team work, huge & expensive effort, excruciating pain • Extremely long time lag between data generation and availability • Custom-coded implementations that are often unsuccessful • Clearing house of already discovered knowledge (the high overhead is for disambiguating the semantics of the heterogeneous data) Nick Roussopoulos Future: Integration-in-the-small • End-user, limited in scope, requires training • Continuous as the data sources and equipment evolve • End-user tools are needed • Small cost, enormous opportunity for discovery and innovation Nick Roussopoulos Sibling Data • Aggregation and naming of disparate data regardless location • Includes actual data, references to external data, queries that generate data, & programs to process data • May include other sibdata • Open vs Closed • Open: continuous accumulation • Closed: fixed snapshot (archival) • Location Independent semantics Nick Roussopoulos Web search results Nick Roussopoulos Content vs URL • Content • http://www.michael moore.com/ Nick Roussopoulos Deep-Web Queries SELECT y.title FROM Yahoo_Movies m WHERE m.title like Moore; Nick Roussopoulos Result vs. Query • Results are associated with the time the query was run • Queries can be captured in sibdata and executed at will; thus the sibdata would be open and captures a different result each time it executes Nick Roussopoulos Queries to Relational Databases Yahoo_Actors Nick Roussopoulos Sibdata • Deal with all the data from everywhere & in whatever form they come • Data co-existence no integrated schema, no single warehouse • Expand-as-you-go • Integrate little by little as you need • ETL Data mapping-integrating as you add more data Nick Roussopoulos Sibdata Properties • • • • • Lightweight • Metadata captures the encapsulation, name, and provenance data Location-independent • Accessible from anywhere Isolated • Generated with no interference Durable • Persist until dropped Secure • Guarantee security defined by the creators and sources • Compose multiple levels of security to its components Nick Roussopoulos Comparison to Transactions • Transactions • grouping of many actions into an atomic transaction- ACID properties • Substrate: database • Sibdata • Grouping of data into an atomic sibdata – LLADS • Substrate: actions/transactions/data generators Nick Roussopoulos Sibdata Infrastructure Nick Roussopoulos Sibdata Servers • Establish a global sibdata ID and name • Creates and maintains metadata with provenance, users, security, etc. • Provides searchable catalog • Provides storage for non-sib compliant data sources • Fault tolerance (replication) Nick Roussopoulos Sib Protocols • Establish Sibdata protocol • Concurrency-Consistency issues (?) • Sharing of data • Name conventions • Dispute resolution • Distributed Logging • Security Using chits • Group and multi-valued ownership and visibility Nick Roussopoulos User Interface • • • • • Simple OS support Query Languages Graphical Languages ETL tools Extra functionality • High dimensional indexing • Mining Nick Roussopoulos Conclusions • Need to build Sib Infrastructure • Refine the sibdata semantics • Refine the security protocols • For data aggregates • User groups • Great opportunities for innovation Nick Roussopoulos Presentations & Project • 3 X 7 students = 21 presentations ~2 per lecture • Lecture dates • • • • Sep: Oct: Nov: Dec: 15, 22, 29 6, 13, 20, 27 3, 10, 17, 24 1, 8 • Project: Proposal due Sep 29 • Discussion: Every lecture be prepared to give a 2-3 min progress report, papers found, etc. Nick Roussopoulos Network Data Independence Hellerstein Berkeley • Physical Data Independence • • Decoupling data from layout (not hard coded applications) Permits reorganization of data w/o affecting the apps • Declarative query languages • Using the schema • Distributed Databases • • Transparency hides location from the user who acts as if he is accessing a centralized database Limited sites- not capable to expand to the mobility of and constant change of the configuration Nick Roussopoulos Pilars of Data independence • Indexes- offer indirection allowing modification of the underlying structure table R 1 4 5 6 9 11 3 1 occurrence file • Schema based and declarative query languages & optimization Nick Roussopoulos Sibdata Independence • Encapsulation of dissimilar data • Data can be moved, rearranged, altered • Additional indices on top of Sibdata becomes part of the sibdata • Naming and provenance data are fixed • Do not change to the outside world • Containment information (sibdata encapsulation within other sibdata) is guaranteed Nick Roussopoulos DHT (Chord) • Data centric distribution • according to content- total data independence • very large number of distributed servers • Configuration changes rapidly (although this may not be really that important) • Fault-tolerance (extra machines) • Limited to single key searches (not range or join queries Nick Roussopoulos Network Names & Services • Internet Indirection Infrastructure (i3) • • • Triggers (id,r) where id = global ID and r is an address to forward packets When a mobile user moves to r’, he modifies his trigger to (id,r’) It also supports 1-to-n mappings (anycast) • Content Distribution Networks (Akamai) • Replicates heavy data (images, videos) to multiple sites and redirects user accesses to those that are closer (indirection via location independence) Nick Roussopoulos Relevant DB Technologies • Distributed Aggregation • • Monitor networks (collecting stats) Computing synopses and pass it along • Adaptive execution plans • • Feedback to the execution Commutative tasks to avoid extended delays • Range search over DHT • • Trie hashing Still limited • P2P & Mobile Databases Nick Roussopoulos Pier: A P2P in situ Query Engine Goals • Massively distributed processing • Scallability • Relaxed consistency (best effort) Architecture • P2P Built on top of DHT • Multicast to all related nodes (lscan) • Pipelining the intermediate results Nick Roussopoulos Pier Joins • Stored in DHT • • • • • Namespace=relation NR, NS resourceID =Primary Key (PK) instanceID =tuple # if not a PK Assume R and S are already DHT hashed using <NR,PKR,1> and <NS,PKS,1> Symmetric Join building phase • • lscan NR and NS eliminate unqualified tuples and not needed attributes Rehash all above tuples using • • • • namespace NQ resourceID=R.pkey*S.pkey Tuples are tagged with relation name SymmetricJoin Probing phase • • • Probing in parallel with building (with callbacks) locally Satisfying tuples are either sent to the Qsite or DHT-ed for the pipelined op Consumes a lot of bandwidth Nick Roussopoulos Better Joins • Fetch Matches • • Hash only S lscan R and fetch NS tuples • Rewriting Join using 2-way semijoin • • Project R & R on their PK and joining attribute Do symmetric join on these projections • Rewriting Join using Bloom filters • • Create and DHT the Bloom filters Do lscan and access the Bloom filter to eliminate not joinable tuples Nick Roussopoulos Conclusions for Pier • P2P bring massive parallelism • Repetitive data comparison over DHT brings along massive waste of bandwidth • Smarter in situ distillation (2-way semijoins, Bloom filters) work better Nick Roussopoulos