New Bases for New Data Omar Benjelloun Stanford University Omar Benjelloun - New Bases for New Data January 27th, 2006 1 Relational databases are great A simple, understandable model for data Boss Emp Manager Joe Bill Bill Steve High-level, declarative language for queries and updates: SQL Efficient optimization techniques Relational databases are the cornerstone of the management of homogeneous, regular, exact, centralized information Omar Benjelloun - New Bases for New Data 2 … but data has changed • • • Data is distributed, behind applications, dynamically changing Data is heterogeneous Data may be uncertain Today • • • Data is stored in relational databases (or XML) Techniques for data integration, data exchange … Lots of code Traditional Database Management Systems (DBMS’s) are too rigid New characteristics should be represented in the data New bases are needed • • foundations (models and languages) Processing and optimization techniques Omar Benjelloun - New Bases for New Data 3 Applications Information integration • • • • Data is distributed on multiple heterogenous, independent sources Conflicting information from the sources: inconsistency, uncertainty Varying and evolving reliability of sources Where data came from can be critical information Scientific data management Receptor (e.g., sensor) data management Data cleaning (entity resolution) And many others… Omar Benjelloun - New Bases for New Data 4 Agenda Distributed and dynamic data: Active XML • • • • A “glue” language to connect data and programs XML documents with embedded calls to Web services Distributed interactions through the exchange of AXML data Techniques to query and control the exchange of AXML data Uncertain data: ULDB’s • • • An extension of the relational model with uncertainty and lineage Efficient query evaluation Computing probabilities Conclusion Omar Benjelloun - New Bases for New Data 5 Active XML Omar Benjelloun - New Bases for New Data 6 Distributed data management Information is everywhere XML XML Web service XML services services XML Internet services XML XML XML XML services Omar Benjelloun - New Bases for New Data Web service Data warehouses Databases Web sites PC, PDA, cell phones, home appliances, cars… 7 The golden triangle of distributed data management XML XML a standard for data representation & exchange • • • Extensible Markup Language Labeled ordered trees Rich types: XML Schema Query languages • XPath, XQuery Web services • Standards for distributed computing Omar Benjelloun - New Bases for New Data SOAP WSDL XQuery XPath 8 What is Active XML (AXML)? AXML is a declarative language for distributed information management and an infrastructure to support this language, in a peer-to-peer framework. Omar Benjelloun - New Bases for New Data 9 Active XML documents XML documents with embedded calls to Web services Intensional • Some of the data is given explicitly • Some is given intensionally (i.e. the means to acquire data when needed are given) Dynamic • If the external sources change, the same document will provide • different information Reaction to world changes Omar Benjelloun - New Bases for New Data 10 Not a new idea in databases, nor on the Web Mixing calls to data is an old idea • Procedural attributes in relational systems • Basis of Object-oriented Databases In Web programming • Sun’s JSP, PHP+MySQL Calls to Web services inside documents • Macromedia FLEX, Apache Jelly, Microsoft XAML What is new is the exploitation of the idea… Omar Benjelloun - New Bases for New Data 11 Web services in brief A number of standards • XML • SOAP: Exchange of messages between applications • WSDL: Description of service interfaces (e.g. input/output types) • UDDI: Advertisement and discovery of services • … other proposed standards (choreography, security, etc.) For us: means to provide, invoke and describe remote functions with XML input/output. They make AXML documents universally understandable. Omar Benjelloun - New Bases for New Data 12 A sample AXML document <?xml version=“1.0” ?> <newspaper> <title>Le Monde</title> <date>06/10/2003</date> <call svc=“Yahoo.GetTemp”> <city>Paris</city> </call> <call svc=“TimeOut.GetEvents”> exhibits </call> </newspaper> newspaper GetEvents title date GetTemp “Exhibits” city “06/10/2003” “Le Monde” “Paris” AXML documents may contain calls: • • Omar Benjelloun - New Bases for New Data to any existing Web services (e-bay.net, google.com…) to any AXML Web services (to be defined) 13 Materialization <?xml version=“1.0” ?> <newspaper> <title>Le Monde</title> <date>06/10/2003</date> <temp>16°C</temp> <call svc=“Yahoo.GetTemp”> <city>Paris</city> </call> <call svc=“TimeOut.GetEvents”> exhibits </call> </newspaper> newspaper GetEvents temp GetTemp date “Exhibits” city “16°C” “06/10/2003” “Paris” “Le Monde” title SOAP call Y! • • Replacing the call by its result is not the only option Calls are not necessarily RPC-style synchronous invocations Omar Benjelloun - New Bases for New Data 14 AXML Web services Parameters: AXML data Result: AXML data Great flexibility Distribute computations: by sending as parameters data containing service calls, one can delegate some work to other peers. Partial computations: by returning data containing service calls, one can give to the receiver the control of these calls. Omar Benjelloun - New Bases for New Data 15 Distributed interactions Omar Benjelloun - New Bases for New Data 16 Exchanging Active XML Omar Benjelloun - New Bases for New Data 17 To call or not to call ? newspaper GetEvents temp GetTemp “Exhibits” city “06/10/2003” “16°C” “Le Monde” “Paris” title date Y! Materialization can be performed by the sender, before sending a document… or by the receiver, after receiving it. Omar Benjelloun - New Bases for New Data 18 Why control the materialization of calls? For added functionality, e.g. • Intensional data allows to get up-to-date information. For security reasons or capabilities, e.g. • I don’t trust this Web service/domain, • I don’t have the right credentials to invoke it, • It costs money, • Maybe the receiver doesn’t know Active XML! For performance reasons, e.g. • A proxy can invoke all the services on behalf of a PDA. … and many more reasons you can think of! Omar Benjelloun - New Bases for New Data 19 How to control it? Using types We extend XML Schema, with intensional types: XMLSchemaint Sender Capabilities ACL Cost ... g q f g r ... q ... q g g q g ... g f r q g f g ... r ... data exchange Schema f Receiver Capabilities ACL Cost ... ... ... Static analysis algorithms use signatures of services: WSDLint Omar Benjelloun - New Bases for New Data 20 The extended schema language To simplify, we use here a DTD-like syntax Data: newspaper = title.date.(GetTemp|temp).(GetEvents|exhibit*) title = data date = data temp = data city = data exhibit = title.(GetDate|date) newspaper GetEvents title date GetTemp “Exhibits” city “06/10/2003” Functions: “Le Monde” GetTemp(city) -> temp GetEvents(data) -> (exhibit|performance)* GetDate(title) -> date “Paris” Rewriting: replace call(s) by an arbitrary output of the service. Omar Benjelloun - New Bases for New Data 21 Rewritings The Goal: Given • an AXML document d • a schema s, Can we rewrite d so that it matches s? Safe rewriting: one that for sure leads to s (we know without making any call) Possible rewriting: one that may lead to s (depending on the answers of services) Omar Benjelloun - New Bases for New Data 22 Difficulties Infinite search space • Vertical • Horizontal Main problem • The result of a Web service call is unknown • We just know a signature (input/output types) We want a very efficient solution Foundations of the problem • String & tree automata, • with existential and universal transitions. Omar Benjelloun - New Bases for New Data 23 Results The general problem is undecidable [MSS03] Restrictions on the considered rewritings • Left-to-right: No “going back and forth” • K-depth: bound on the nesting of function calls (Search space still infinite but finitely representable) Under these restrictions • We have algorithms to find safe/possible rewritings. • They are PTIME (for deterministic schemas). • We can also do it between schemas. Implementation • demo at VLDB 2003 (customizable news syndication) Omar Benjelloun - New Bases for New Data 24 Safe rewriting algorithm (flavor) Build an FSA that accepts all k-depth rewritings of the initial word. q0 title q1 date q2 GetTemp q3 q5 temp q6 GetEvents q7 exhibit Aw1 Build an FSA that recognizes the complement of the target type. * title p0 q4 performance * date p1 * temp p2 *GetEvents p3 p4 exhibit * * p6 * p5 A exhibit Omar Benjelloun - New Bases for New Data 25 Safe rewriting algorithm Compute the intersection of these languages: performance exhibit q7,p6 q4,p6 GetEvents q3,p6 q7,p6 GetTemp q0,p0 title q1,p1 date q2,p2 q3,p3 q5,p2 GetEvents q6,p3 q7,p5 q4,p5 exhibit performance temp exhibit exhibit performance q7,p3 q4,p3 q4,p4 A Awk A A smart marking determines whether a safe rewriting exists. Then run the word on the marked automaton to find an actual rewriting. Optimizations: lazy construction of the automata parallel evaluation of calls Omar Benjelloun - New Bases for New Data 26 Querying Active XML Omar Benjelloun - New Bases for New Data 27 Querying AXML Data Given a (tree pattern) query: /newspaper[temp > 18°C]/exhibits//exhibit[location=“Le Louvre”] newspaper Materialize the document? Call only the services that may data to the query answer. exhibits GetEvents temp GetTemp contribute title “Exhibits” getDate GetExhibits city “19°C” City “Paris” “Le Monde” “Paris” The problem: Lazy evaluation of service calls To call or not to call, this time when evaluating a query Omar Benjelloun - New Bases for New Data 28 Lazy evaluation Difficulties: • • • • Calls can be found everywhere in the document May appear dynamically (as a result of previous calls) May become (ir)relevant due to previous invocations Need to take signatures of calls into consideration A possible approach: modify the query processor • • • Top-down evaluation Trigger the calls found on the way Not so great: – Computation is blocked – Optimization opportunities are lost Omar Benjelloun - New Bases for New Data 29 NFQ’s newspaper Given a query to evaluate: temp > 18°C exhibits exhibit location “Le Louvre” newspaper Derive a set of exhibits “node-focused” queries (NFQ), that find the relevant calls when evaluated on the document. temp * * * > 18°C Etc. Need to be reevaluated, as the document evolves! Omar Benjelloun - New Bases for New Data 30 Optimizations Service calls sequencing • • Analysis of the relationship between calls (through the NFQ’s) Layering, and parallelization inside each layer. Filtering by type analysis • Match output types of services to the data expected by queries “Pushing” queries to capable services Acceleration: • • Via relaxation: – NFQ approximation – Superset of the relevant calls Via a special access structure, similar to a DataGuide: – Restricted to paths that lead to service calls – Indexes the calls Experimental assessment • 10x speed-up when combining optimizations Omar Benjelloun - New Bases for New Data 31 There is more… The AXML peer system • • • Manages persistent AXML documents Provides AXML services Open source Language extensions to control the activation of calls Continuous services Theoretical foundations …check out http://www.activexml.net Omar Benjelloun - New Bases for New Data 32 Uncertain data Omar Benjelloun - New Bases for New Data 33 Basic Premise Traditional relational DB • Every data item’s value must be exact • Every data item is in the database or not • Where data came from and how it evolves is not important ULDB’s relax these constraints by making 1. Data 2. Uncertainty 3. Lineage all first-class interrelated concepts Omar Benjelloun - New Bases for New Data 34 Previous work Models for uncertainty • Labeled nulls, c-tables, probabilistic models,... Trade-off between • expressiveness • Simplicity of representation, complexity of operations • We investigated this space in [DBHM06] Models for lineage • In relational databases, data warehouses • Definition of lineage can be tricky for complex queries First to consider lineage together with uncertainty Omar Benjelloun - New Bases for New Data 35 Uncertainty alternate SAW x-tuple Witness Car Granny VW Granny BMW Cop Ford Cop ? maybe Cop Ford Cop VW VW Possible worlds: Granny VW Granny BMW Cop Ford Cop Ford Granny VW Granny BMW Cop VW Cop VW Simple formalism • • not complete not closed under joins Omar Benjelloun - New Bases for New Data 36 Lineage SAW OWNS Witness Car Suspect Car Granny VW Chris VW Cop Ford Chris BMW Mike VW Mike Ford witness, suspect ACCUSES Witness Suspect Granny Chris Granny Mike Cop Mike Omar Benjelloun - New Bases for New Data 37 ULDB’s SAW OWNS Witness Car Granny VW Cop Ford Granny BMW ? Suspect Car Chris VW Chris BMW Mike VW Mike Ford ACCUSES Witness Suspect Granny Chris Granny Mike Cop Mike Granny Chris Omar Benjelloun - New Bases for New Data ? ? ? 38 ULDB’s SAW OWNS Witness Car Granny VW Cop Ford Grann y BMW ? Suspect Car Chris VW Chris BMW Mike VW Mike Ford ACCUSES Witness Suspect Granny Chris Granny Mike Cop Mike Granny Chris Omar Benjelloun - New Bases for New Data ? ? ? 39 Properties ULDB’s are simple • • x-tuples: set of alternate tuples, with or without ‘?’ lineage: associates with each alternate a set of alternates / external symbols ULDB’s are expressive • • • Complete: can represent any finite set of possible worlds (with lineage) Simple implementation of monotonic queries, with correct lineages Natural probabilistic extension ULDB’s are efficient • • Query processing can use existing query optimizers Tuple certainty/membership can be tested in polynomial time Omar Benjelloun - New Bases for New Data 40 Query processing Omar Benjelloun - New Bases for New Data 41 Querying ULDB’s D Q(D) Possible worlds Algorithm ULDB’s Relational databases (with lineage) Query semantics D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn) Q(Di): add query result as new relation and lineage to Di Omar Benjelloun - New Bases for New Data 42 Algorithm SAW OWNS Witness Car Granny VW Cop Ford Granny BMW Kid Grann yKid BMW Ford ? ? Ford Chris Granny Mike Cop Mike Granny Kid Chris Mike Omar Benjelloun - New Bases for New Data Chris VW Chris BMW Mike VW Mike Ford witness, suspect ACCUSES Witness Suspect Granny Suspect Car ? ? Kid Mike ? ? 43 Properties Efficient algorithm • Query processing phase can use standard query optimizer • Lineages are easy to propagate • “Grouping” phase requires a single pass on the result Initial prototype • represents a ULDB as a relational DB • uses simple query rewriting techniques Algorithm works for any monotonic query (including SPJU queries) Omar Benjelloun - New Bases for New Data 44 Probabilities Omar Benjelloun - New Bases for New Data 45 Probabilistic ULDB’s SAW Witness Car Granny VW Cop Ford 0.3 0.2 Granny BMW Cop VW 0.5 ?0.3 0.7 Semantics: As before, with a probability for each possible world Without lineages • • Alternates of the same x-tuple correspond to disjoint events Alternates of different x-tuples correspond to independent events Lineages • • Capture correlations Help propagate probabilities for query results Omar Benjelloun - New Bases for New Data 46 Probabilistic query answering Compute queries as before Compute probabilities on demand • • Traverse lineages transitively to the leaves Combine probabilities of reached alternates ? ? ? ? 0.2 0.3 0.4 ? 0.1 0.3 0.5 1 Optimizations: memoize probabilities, efficiently detect ‘closest independent ancestors’ Omar Benjelloun - New Bases for New Data 47 Future work Richer queries • • • Duplicate elimination, difference, aggregation Supported through new kinds of lineages (e.g., disjunctive, negative) Querying the uncertainty and the lineage More operations • • Updates (and their lineage), close to versioning “Uncertain operations”, e.g., entity resolution, inconsistency repairs More optimization techniques More theory Omar Benjelloun - New Bases for New Data 48 Conclusion Omar Benjelloun - New Bases for New Data 49 New “Bases” for new data The database way • • • Simple models Declarative languages Optimization techniques … for new features of data • • Distribution and decentralization: Active XML Uncertainty and lineage: ULDB’s There are more challenges • Real-world side effects, semantic reasoning and strong requirements • security, privacy, personalization Big challenge: Doing it all in a coherent way • • One “big” model? Integration of models? Omar Benjelloun - New Bases for New Data 50 Merci Omar Benjelloun - New Bases for New Data 51