Large-Scale Data Integration Systems Exporting and Interactively Querying Web Service-Accessed Sources: The CLIDE System Developer Application Domain Application Integration Engineer Integration Domain • CNET • PCWorld Application Compatible Combinations of Computers, Routers and Printers Integrated Schema Mediator Michalis Petropoulos Source Domain Web Service Web Web Service Service Data Source Source Data Schema Source • Dell Computers by CPU • Cisco Routers by Rate • HP Printers by Speed … Source Owner Source Schema Computer Portals … • Dell Computers • Cisco Routers • HP Printers Database Seminar, February 2010 2 Large-Scale Data Integration Systems Developer Parameterized Views Application Domain Application Integration Engineer Running Example What queries can the mediator answer for me? CLIDE Application Schema Schema Computers(cid, cpu, ram, price) NetCards(cid, rate, standard, interface) Integration Domain Source Domain Mediator Views Integrated Schema V1 ComByCpu(cpu) → (Computer)* SELECT DISTINCT Com1.* FROM Computers Com1 WHERE Com1.cpu=cpu Web Service Web Web Service Service Data Source Source Data Schema Source Computers V3 RouWired() → (Router)* for a given cpu SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wired' Computers & NetCards → (Router)* for a given V4 cpuRouWireless() & rate V2 ComNetByCpuRate(cpu, rate) → (Computer, NetCard)* … Source Owner Source Schema Routers(rate, standard, price, type) Views SELECT DISTINCT Com1.*, Net1.* FROM Computers Com1, Network Net1 WHERE Com1.cid=Net1.cid AND Com1.cpu=cpu AND Net1.rate=rate … 3 Wired Routers Wireless Routers SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wireless' Conjunctive Queries CQ • • Equality & Comparison Conditions Parameters 4 1 Sophisticated Mediators Make Feasible Queries Hard to Predict Running Example Integrated Schema Feasible Queries FQ • Equivalent CQ query rewritings using the views • Might involve more than one views • Order might matter Developer Application Mediator V1 V2 Integrated Schema V3 V4 Query: Infeasible Feasible Get all Computers ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers • Integrated schema puts together the Dell and Cisco schemas Computers.* NetCards.* Routers.* A123 P4 512 400 A123 10 .11b USB 10 .11b 50 Wireless B123 P4 1024 550 B123 54 .11g USB 54 .11g 120 Wireless Attribute Associations • (Computers.cid, NetCards.cid) • (NetCards.rate, Routers.rate) • (NetCards.standard, Routers.standard) B Routers.* 10 .11b 50 Wireless 54 .11g 120 Wireless Computers.* NetCards.* 512 400 A123 10 .11b USB B123 P4 1024 550 B123 54 .11g USB MediatorA123 P4 A C Cisco ComNetByCpuRate(‘P4’, ‘10’) V1 ComNetByCpuRate(‘P4’, ‘54’) V4 V2 5 Problem D Mediator RouWireless() Dell E 6 The CLIDE Solution CLIDE 1. Large number of sources 2. Large number of views (web-services) 3. Mediator capabilities Developer Application Developer formulates an application query Is an application query feasible? If not, how do I know which ones are feasible? Mediator Integrated Schema A query formulation interface, which interactively guides the developer toward feasible queries by employing a coloring scheme Previous options: – The developer had to browse the view definitions and somehow formulate a feasible query – Or formulate queries until a feasible one is found (trial-and-error) V1 V2 Dell No system-provided guidance 7 V3 V4 Cisco 8 2 QBE-Like Interfaces CLIDE Interface Microsoft SQL-Server Last/Next Step Table Alias Selection Boxes Feasibility Flag Table Boxes Projection Box • Table, selection, projection and join actions • Feasibility Flag • Color-based suggestions 9 10 Example Interaction Example Interaction Snapshot 1 Snapshot 2 Yellow required action Blue required choice of action price C – At least one Mediator feasible query cannot beram formulated 512 400 1024 550 unless this action is performed – All feasible queries require this action White optional action – Feasible queries can be formulated w/ or w/o these actions A cid cpu ram price A123 P4 512 400 B123 P4 1024 550 ComByCpu(‘P4’) B V1 11 12 3 Example Interaction Example Interaction Snapshot 3 Snapshot 4 • * any other constant • Red prohibited action Join Lines: • Only yellow and blue are displayed • Must appear in Attribute Associations – Does not appear in any feasible query – Lead to “Dead End” state 13 Example Interaction 14 Demo Snapshot 5 ram price rate interface price 512 400 10 USB 50 1024 550 54 USB 120 F Mediator A Routers.* 10 .11b 512 Wireless 54 .11g 1024 Wireless RouWireless() V4 B Computers.* NetCards.* A123 P4 512 400 A123 10 .11b 50 B123 P4 1024 550 B123 54 .11g 120 D ComNetByCpuRate(‘P4’, rate) E V2 15 16 4 CLIDE Properties Interaction Graph • Completeness of Suggestions Com1 Com1.ram Com1.price Table Action Join Action Com1.cpu=‘P4’ Net1 Com1.cid=Net1.cid Rou1 … … … … … … … … … … – Every feasible query can be formulated by performing yellow and blue actions at every step Selection Action • Summarization of Suggestions – At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness • Rapid Convergence By Following Suggestions • • • • – The shortest sequence of actions from a query to any feasible query consists of suggested actions Nodes are queries: One for each q∈CQ Edges are actions: Table, selection, projection and join actions Green nodes are feasible queries Infinitely big structure – All CQ queries – All possible combinations of actions formulating them 17 Interaction Graph: Colorable Actions 18 Interaction Graph: Colors • Yellow action α – Every path from current node n to a feasible node contains α • Blue action α Com1.cid Com1.cpu … Current Node – At least one feasible query cannot be formulated unless this action is performed (summarization) … • Red action α Com1.cid – No path to a feasible node contains α … Current Node Com1.cpu … Com1.cid=* … Com1.cid=* … Com1.cpu=* … Net1 Net1.rate=’54Mbps’ … … … Net1.rate=’54Mbps’ Com1.cid=Net1.cid … … … Com1.price=* Net1 Com1.cid=Net1.cid … … Com1.ram=* … … Com1.price=* … • Colorable actions AC label outgoing edges of the current node Current Node … … Com1.cpu=* Com1.ram=* Rou1 Com1.cid=Net1.cid Net1.rate=Rou1.rate … … … … … Com1.cpu=* … Net1 Rou1 … … Rou1 Com2 Com2 … Com2 Com2.cid=Net1.cid Com2.cpu=‘P4’ Net1.rate=‘54Mbps’ … … … … … 19 20 5 Color Determined By a Finite Set of Feasible Queries CLIDE Architecture Challenge: Infinitely Many Feasible Queries Actions Front-End Current Query Colored Actions + Feasibility Flag … User ius Rad Back-End Color Algorithm Closest Feasible Queries FQC Seed Queries SQ Parameters Algorithm Schemas Views … … … … Closest Feasible Queries Algorithm Aliases Collapse Rule … n Closest Feasible Queries FQC Maximally-Contained Rewriter ? … Solution: Closest Feasible Queries FQC Minimal Feasible Extension Queries • FQC is sufficient to color actions in AC • Theorem: Set of Closest Feasible Queries is Finite Column Associations Challenge: How far can the Closest Feasible Queries FQC be? • Back-End invoked every time the user performs an action Solution: Based on Maximally Contained Queries FQMC – i.e., the user arrives at a new node in the interactions graph 21 Maximally Contained Queries FQMC 22 Closest Feasible Queries FQC Algorithm Challenge: How far can the Closest Feasible Queries FQC be? Solution: Maximally Contained Queries FQMC Maximally Contained Query Query: Q1 Get all Computers Query: Q2 Get all Computers with a given cpu s adiu pL R Closest… Feasible Queries FQC Not Maximally Maximally Contained Contained Query Query: Q4 Q3 Get all Computers with a given ram cpu & ram … n … Maximally Contained Queries FQMC … … … … • Compute maximally contained queries FQMC • Theorem: All FQC queries are reachable via a path of length p ≤ pL • The radius pL is the longest path to a maximally contained query • Assuming fixed SELECT clause (projection list) • Covered extensively in literature – MiniCon, Bucket, InverseRules Algorithms • FQMC is finite 23 24 6 Closest Feasible Queries FQC Algorithm Closest Feasible Queries FQC Algorithm Challenge: Find the Closest Feasible Queries Solution: Collapse Aliases … Closest Feasible Queries FQC n … Maximally Contained Feasible Queries FQMC … Closest Feasible Queries FQC … … n … Maximally Contained Feasible Queries FQMC … … … … … … More feasible nodes • Theorem: All queries in FQMC are in FQC • But not all queries in FQC are in FQMC • Collapse Aliases to compute FQC \ FQMC • Check satisfiability 25 Color Algorithm 26 CLIDE Implementation & Optimizations Maximally-Contained Rewriter Front-End Yellow and Blue • An action α is colored based on which closest feasible queries it appear in • Yellow, if α appears in all queries in FQC • Blue, if α appears in at least one (but not all) query in FQC Current Query Minimal Feasible ExtensionColored QueriesActions FQME + Feasibility Flag + Maximum Projection Lists Back-End Redundant Actions Removal Color Algorithm Maximally-Contained Feasible Extension Queries + Maximum Projection Lists Seed Queries SQ Redundant Queries Removal Parameters Algorithm Feasible Extension Queries Closest Feasible Queries FQC + Maximum Projection Lists White and Red • Attach Maximum Projection Lists to Closest Feasible Queries Views Expansion Closest Feasible Queries Algorithm Aliases Collapse Rule Maximally-Contained Feasible Queries over Views Minimal Feasible + Containment Mappings Extension Queries Maximally-Contained Rewriter MiniCon – Projections that can be added to a feasible query, without compromising feasibility Containment Mappings Logging Column Schemas Views Associations • Projection α is white if in the maximum projection list • Color selections based on projections • Views expansion introduce redundancy – Affects CLIDE’s rapid convergence and summarization • Efficient containment test crucial to redundancy removal 27 28 7 CLIDE Performance CLIDE Performance Chains of Stars – No Parameters Chains of Stars – No Parameters 4000 4000 112 Views 168 Views 224 Views 280 Views 3000 Time (ms) 2500 Rest of Algorithms Containment Testing 3500 3000 Time (ms) 3500 2000 1500 1000 500 2500 2000 1500 1000 500 0 (2,1) (2,2) (4,3) (4,4) (6,5) (6,6) (8,7) (8,8) 0 (10,9) (10,10) (12,11) (12,12) (14,13) (14,14) 7720 142 7820 143 7920 (# of Joins, # of Selections) In Current Query 144 8020 145 8120 # of Containment Tests B B2Bi1 i BiM A 30 CLIDE Performance Chains of Stars – With Parameters Chains of Stars – With Parameters 4000 1600 1600 112 Views 140 Views 168Views Views 196 224 Views 252 Views 280Views Views 308 Rest of Algorithms Containment Testing 1400 1200 1000 Time (ms) Time Time(ms) (ms) 148 29 CLIDE Performance 1500 600 1000 400 500 200 00 8320 C C1C i1 A-span = 7 i CiM B-span = 3 C1CL Selections = 4,6,8,10 BK 2500 1000 2000 800 147 … … … … … … …… B B11 3000 1200 8220 C1 • Schema Queries Views 3500 1400 146 800 600 400 (2,1) (2,1) (2,2) (2,2) (4,3) (4,3) (4,4) (4,4) (6,5) (6,6) (6,6) (8,7) (6,5) (8,7) (8,8) (8,8) (10,9) (10,9) (10,10) (10,10) (12,11) (12,11) (12,12) (12,12)(14,13) (14,13)(14,14) (14,14) (# of Joins, # of Selections) In Current Query (# of Joins, # of Selections) In Current Query 200 0 1696 C1 A … … …… B B11 B B2Bi1 i BiM BK 156 1697 157 1698 158 1699 159 1700 # of Containment Tests 160 1701 161 1702 162 … … … … • Schema Queries Views C C1C i1 A-span = 7 i CiM B-span = 3 C1CL Selections = 4,6,8,10 31 32 8 CLIDE Summary First interactive query formulation interface based on source and mediator capabilities Applicability • Service-Oriented Architectures • Privacy-Preserving Services Contributions • Interaction Guarantees: Rapid Convergence, Completeness, Summarization of Suggestions • Interaction Graph • Back-End Algorithms – Closest Feasible Queries, Colors, Parameters • Modular, Customizable Architecture http://www.clide.info 33 9