CSE 636 Data Integration XML Distributed Query Processing Slides by Yannis Papakonstantinou Overview • The Virtual XML View Approach towards Data Integration • Query Processing in XML Mediators – Issues Overview – An Algebra-Based Architecture – Navigation-driven Evaluation 2 Data Integration Requirements in eBusiness Applications • It starts with … “Provide to customers, partners, employees Application X”, where X may be in Business Intelligence, Customer Support, … • Then the problem comes up… “The applications uses information assets widely distributed across my enterprise” • If only…. “Give to the application a single place to go to access all the information required. Requirements are evolving so make sure the system can be easily maintained and upgraded” 3 View-Based Approach: Wrappers Export Basic Source Views customer_table <customer_table> customer <customer> name <name>John</name> John <id>56</id> id <city>Chicago</city> 56 </customer> city <customer> Chicago <name>George</name> customer <id>58</id> name <city>Chicago</city> George </customer> id … 58 </customer_table> city Chicago … Client Application Integrated (XML) View Mediator (XML) View (XML) View Wrapper Wrapper Customers Rel. DB Orders Rel. DB 4 Wrappers Export Basic Source Views Client Application Integrated (XML) View Mediator (XML) View (XML) View Wrapper Wrapper Customers Rel. DB Orders Rel. DB order_table order id 1034 cid 56 item chips order id 1567 cid 56 item salsa … 5 customers Mediators Export Integrated Views, customer Tailored to Application Needsname John id customer_table customer name John id 56 city 56 city Client Chicago Application orders order id Integrated (XML) View 1034 item chips Mediator order … customer order_table … (XML) View (XML) View order id Wrapper Wrapper 1034 cid Customers Orders 56 Rel. DB Rel. DB item 6 Virtual Views: Query-Driven Mediator Operation Application Retrieve Chicago customer names and id’s Mediator Find all Chicago customer names, along with their ordered items Retrieve all cid’s and item names of orders Wrapper Wrapper Customers Database Orders Database 7 On-Demand (Query-Driven) Mediator Operationcustomers Application customer name John id 56 … Wrapper Customers Database Mediator customer name John ordered_items item chips item salsa customer … order cid 56 item Wrapper chips order cid 56 item Orders salsa Database … 8 Multiple Plans are Possible • Retrieve customers • For each customer find matching orders 9 A New Kind of Query Processing Problem • Build and Run “Optimal” Plan – Consisting of operators that – Collect source info using supported queries and commands – Combine info into XML result 10 Challenges in Query Processing & Optimization • Operate within the Limited and Different Capabilities of the Sources – Describe sets of supported queries – Use most efficient supported queries • Optimize plans/queries sent to sources – – – – Estimate Costs of Plans Adapt Plans Along the Way Beyond Conjunctive Queries Compose Queries/Views Efficiently • Schema inference & optimization • Combine navigation & querying 11 From Limited Wrappers to Efficient Plans for Extended Query Sets all queries over schema Queries supported by mediator • Answering Queries Using Views • But with Infinite Sets of Views • Increasing Relevance due to Web Services Queries supported by wrapper Source Data & Schema Source Data & Schema 12 Challenges in Query Processing & Optimization • Operate within the Limited and Different Capabilities of the Sources – Describe sets of supported queries – Use most efficient supported queries • Optimize plans/queries sent to sources – – – – Estimate Costs of Plans Adapt Plans Along the Way Beyond Conjunctive Queries XQuery processing • Schema inference & optimization • Combine navigation & querying – Build iterator models for low memory footprint 13 Navigation-Driven Evaluation of Query Result customer_table customer name John id 56 city Chicago customer name George id 58 customers customer name John id 56 city order_table Chicago order orders id order 1034 id cid 1034 56 item item chips chips order order … id customer 1567 … cid 56 14 Navigation-Driven Evaluation p Input: client navigations Client right(p) down(p) view definition ans = q( s1 … sn ) result Lazy Mediator Output: source navigations s1 XML source ... sn XML source 15 Navigation-Driven Evaluation Input: client navigations Client view definition ans = q( s1 … sn ) result Lazy Mediator Output: source navigations s1 XML source ... sn XML source 16 Navigation-Driven Evaluation Input: client navigations Client view definition ans = q( s1 … sn ) result Lazy Mediator Output: source navigations s1 XML source ... sn XML source 17 Navigation-Driven Evaluation Input: client navigations Client view definition ans = q( s1 … sn ) result Lazy Mediator Output: source navigations s1 XML source ... sn XML source 18 Navigation-Driven Evaluation Input: client navigations Client view definition ans = q( s1 … sn ) result Lazy Mediator Output: source navigations s1 XML source ... sn XML source 19 Mixing Querying & Navigation customers customer name John Find details of all id salsa orders below 56 visited node city Chicago orders order id 1034 item chips order … customer … 20 Challenges in Mixing Querying & Navigation • Two-dimensional navigation – Reminds of cursors but there are multiple continuation points • Controlling size + shape • Contextualizing queries by navigation 21 Overview • The Virtual XML View Approach towards Data Integration • Query Processing in XML Mediators – Issues Overview – An Algebra-Based Architecture – Navigation-driven Evaluation 22 An Algebra-Based Query Processor Architecture Client XQuery XQuery Views Navigation Requests Results Translation to Algebra Algebra Plan Source Description Function Description Rewriter/Optimizer Physical Algebra Plan Plan Execution Engine Source Schemas & Types Functions Queries & Fetch Requests to Sources 23 Query Processing on Tuple-Oriented Algebra Enables… • Well-known efficient physical implementations of the operators • Join optimization • Nested data by nested plans or group-by • Efficient iterator model 24 XQuery: Queries & Views for XML <customers> { for $cust in document(“db”)/customer return <customer> { $cust/id, for $order in document(“db”)/order where $order/cid = $cust/id return <order> { $order/id } </order> } </customer> } </customers> 25 Access and Navigation $db1 ct ct $cust $cust_id c1 i1 c2 i2 getD $cust, id $cust_id $db1 ct ct $cust c1 c2 getD $db1, customer $cust $db1 ct source db, [$db1] ct c1 c2 db customer_table customer name John id 56 customer name George id 58 i1 i2 26 Simplification Using Schema Inference Since $cust_id $cust and $cust is “useless” otherwise $db1 ct ct $cust_id ct i1 i2 getD $db1, customer/id $cust_id $db1 ct source db, [$db1] db customer_table customer name John id 56 customer name George id 58 i1 i2 27 Nested Plans $db1 ct ct apply $cust_id i1 i2 $orders [o11…] [o21…] $cust_id $cust_id i2 i1 nestedSrc … Plan p $part $part, p $orders $db1 ct for $db1 $db1 ct ct ct $cust_id $part i1 $db1 ct $cust_id i1 i2 $db1 ct $cust_id i2 $part $db1 ct ct $cust_id i1 i2 28 Joins and Selections $cust_id $db1 $cust_id $db2 $order $cust_id2 $order_id … $cust_id2=? $db2 $order $cust_id2 $order_id … getD $order, id $order_id getD $order, cid $cust_id2 $db1 ct getD $db2, order $order $cust_id i1 nestedSrc $part source db, [$db2] 29 Constructors $orders [e1, e2] e2 listify $oidE $orders … … … $oidL [o1] [o2] crEl order, $oidL $oidE … … … $order_id o1 o2 crList $order_id $oidL … … … $order_id o1 o2 $oidE e1 e2 e1 order order o2 o1 $oidL [o1] [o2] 30 Algebra Example 31 Plan Decomposition • • • • Within Rewriting Optimizer Rules replacing “leaf” trees May move commutable parts Catch: No projection limitation 32 Plan After Decomposition 33 Replacing Nested Plans with GroupBy/Outerjoin Combinations apply apply $part, p $R $part, p $R p3 p3 nestedSrc groupBy nestedSrc for $part $part S(p1) $part p2 $part p1 p1 p2 34 Multiple Possible Plans 35 Overview • The Virtual XML View Approach towards Data Integration • Query Processing in XML Mediators – Issues Overview – An Algebra-Based Architecture – Navigation-driven Evaluation 36 Building Navigation-Driven Evaluation on the Algebra Client Source access Source access Source Source 37 Think of Each Operator as a Lazy Mediator root tuple $db1 $db1 ct ct $cust $cust_id c1 i1 c2 i2 getD $cust, id $cust_id $db1 ct ct $cust c1 c2 customer_table $cust customer name $cust_id John id tuple 56 customer $db1 name $cust George id $cust_id 58 c1 i1 c2 i2 38 Navigation-Driven Evaluation of Operators Augmented with • nextTuple(p) • p.attr Input: client navigations result Lazy Operator Output: source navigations s1 Result of Operator below ... sn Result of Operator below 39 Use of Semantic Id’s in NavigationDriven Evaluation r/d(<f1, f2, …, fn>) Operator State V1: f1 V2: f2 … … Vn: fn Other: … Proceed down/right <f’1, f’2, …, f’n> Operator State V1: f’1 V2: f’2 … … Vn: f’n Other: … 40 Fragments Reduce the “Set State” – “Produce State” Overhead root Hole 3 customer name, “John” Hole 2 order oid, 123 lineitem lineitem lineitem Hole 1 41 Fragments Reduce the “Set State” – “Produce State” Overhead root Hole 3 customer name, “John” order order ordnum=16 Hole 5 oid, 123 lineitem lineitem lineitem Hole 1 lineitem lineitem Hole 4 42 Controlling the Size and Shape of Fragments Client listify Client-Server Interaction Controller listify Source access Source access Source Source 43 Fragment Size causes Memory Footprint causes Performance 44 Fragmentation Strategies • Fixed Fragment Size – Ideal for depth-first, left-to-right navigation • Adaptive Fragment Size – Assign larger pieces to those who use them 45 Response Performance for Breadth-First and Depth-First Depth First traversal Breadth First traversal 46 References • Navigation-Driven Evaluation of Virtual Mediated Views – Bertram Ludäscher, Yannis Papakonstantinou, Pavel Velikhov – EDBT 2000 • Architecture and Implementation of an XQuerybased Information Integration Platform – Yannis Papakonstantinou, Vasilis Vassalos – IEEE Data Eng. Bull. 25(1), 2002 • XML queries and algebra in the Enosys integration platform – Yannis Papakonstantinou, Vinayak R. Borkar, Maxim Orgiyan, Konstantinos Stathatos, Lucian Suta, Vasilis Vassalos, Pavel Velikhov – Data Knowl. Eng. 44(3), 2003 47