  Exporting and Interactively Querying

advertisement
Large-Scale Data Integration Systems

Exporting and Interactively Querying
Web Service-Accessed Sources:
The CLIDE System
Developer
Application
Domain


Application

Integration
Engineer
Integration
Domain
•  CNET
•  PCWorld
Application
Compatible Combinations
of Computers, Routers
and Printers
Integrated
Schema
Mediator
Michalis Petropoulos

Source
Domain
Web
Service
Web
Web
Service Service
Data
Source
Source
Data
Schema
Source
•  Dell Computers by CPU
•  Cisco Routers by Rate
•  HP Printers by Speed
…
Source
Owner
Source
Schema
Computer
Portals
…
•  Dell Computers
•  Cisco Routers
•  HP Printers
Database Seminar, February 2010
2
Large-Scale Data Integration Systems

Developer
Parameterized Views
Application
Domain

Application

Integration
Engineer

Running Example

What queries can the
mediator answer for me?
CLIDE
Application
Schema
Schema
Computers(cid, cpu, ram, price)
NetCards(cid, rate, standard, interface)
Integration
Domain
Source
Domain
Mediator
Views
Integrated
Schema
V1 ComByCpu(cpu) → (Computer)*
SELECT DISTINCT Com1.*
FROM Computers Com1
WHERE Com1.cpu=cpu
Web
Service
Web
Web
Service Service
Data
Source
Source
Data
Schema
Source
Computers
V3 RouWired() → (Router)*
for a given cpu
SELECT DISTINCT Rou1.*
FROM Routers Rou1
WHERE Rou1.type='Wired'
Computers & NetCards
→ (Router)*
for a given V4
cpuRouWireless()
& rate
V2 ComNetByCpuRate(cpu, rate) →
(Computer, NetCard)*
…
Source
Owner
Source
Schema
Routers(rate, standard, price, type)
Views
SELECT DISTINCT Com1.*, Net1.*
FROM Computers Com1, Network Net1
WHERE Com1.cid=Net1.cid
AND Com1.cpu=cpu
AND Net1.rate=rate
…
3
Wired
Routers
Wireless
Routers
SELECT DISTINCT Rou1.*
FROM Routers Rou1
WHERE Rou1.type='Wireless'
Conjunctive Queries CQ
• 
• 
Equality & Comparison Conditions
Parameters
4
1
Sophisticated Mediators Make
Feasible Queries Hard to Predict
Running Example
Integrated Schema

Feasible Queries FQ
•  Equivalent CQ query rewritings using the views
•  Might involve more than one views
•  Order might matter

Developer
Application
Mediator
V1
V2
Integrated
Schema
V3
V4
Query: Infeasible
Feasible
Get all Computers
‘P4’ Computers, together with their NetCards
and their compatible ‘Wireless’ Routers
•  Integrated schema puts together
the Dell and Cisco schemas
Computers.*
NetCards.*
Routers.*
A123 P4 512 400 A123 10 .11b USB 10 .11b 50 Wireless
B123 P4 1024 550 B123 54 .11g USB 54 .11g 120 Wireless
Attribute Associations
•  (Computers.cid, NetCards.cid)
•  (NetCards.rate, Routers.rate)
•  (NetCards.standard, Routers.standard)
B
Routers.*
10 .11b 50 Wireless
54 .11g 120 Wireless
Computers.*
NetCards.*
512 400 A123 10 .11b USB
B123 P4 1024 550 B123 54 .11g USB
MediatorA123 P4
A
C
Cisco
ComNetByCpuRate(‘P4’, ‘10’)
V1
ComNetByCpuRate(‘P4’, ‘54’)
V4
V2
5
Problem
D
Mediator
RouWireless()
Dell
E
6
The CLIDE Solution
CLIDE
1.  Large number of sources
2.  Large number of views (web-services)
3.  Mediator capabilities


Developer
Application
Developer formulates an application query
  Is an application query feasible?
  If not, how do I know which ones are feasible?
Mediator
Integrated
Schema
 A query formulation interface,
which interactively guides the
developer toward feasible queries
by employing a coloring scheme
Previous options:
–  The developer had to browse the view definitions and
somehow formulate a feasible query
–  Or formulate queries until a feasible one is found
(trial-and-error)
V1
V2
Dell
No system-provided guidance
7
V3
V4
Cisco
8
2
QBE-Like Interfaces
CLIDE Interface
Microsoft SQL-Server
Last/Next Step
Table Alias
Selection Boxes
Feasibility Flag
Table Boxes
Projection Box
•  Table, selection, projection and join actions
•  Feasibility Flag
•  Color-based suggestions
9
10
Example Interaction
Example Interaction
Snapshot 1
Snapshot 2
Yellow  required action
Blue  required choice of action
price
C
–  At least one Mediator
feasible query cannot
beram
formulated
512
400
1024 550
unless this action is performed
–  All feasible queries require this action
White  optional action
–  Feasible queries can be formulated
w/ or w/o these actions
A
cid cpu ram price
A123 P4 512 400
B123 P4 1024 550
ComByCpu(‘P4’)
B
V1
11
12
3
Example Interaction
Example Interaction
Snapshot 3
Snapshot 4
•  *  any other constant
•  Red  prohibited action
Join Lines:
•  Only yellow and blue are displayed
•  Must appear in Attribute Associations
–  Does not appear in any feasible query
–  Lead to “Dead End” state
13
Example Interaction
14
Demo
Snapshot 5
ram price rate interface price
512 400 10
USB
50
1024 550 54
USB
120
F
Mediator
A
Routers.*
10 .11b 512 Wireless
54 .11g 1024 Wireless
RouWireless()
V4
B
Computers.*
NetCards.*
A123 P4 512 400 A123 10 .11b 50
B123 P4 1024 550 B123 54 .11g 120
D
ComNetByCpuRate(‘P4’, rate)
E
V2
15
16
4
CLIDE Properties
Interaction Graph
•  Completeness of Suggestions
Com1
Com1.ram
Com1.price
Table
Action
Join Action
Com1.cpu=‘P4’
Net1
Com1.cid=Net1.cid
Rou1
…
…
…
…
…
…
…
…
…
…
–  Every feasible query can be formulated by
performing yellow and blue actions at every step
Selection
Action
•  Summarization of Suggestions
–  At every step, only a minimal number of actions is
suggested, i.e., the ones that are needed to
preserve completeness
•  Rapid Convergence By Following
Suggestions
• 
• 
• 
• 
–  The shortest sequence of actions from a query to
any feasible query consists of suggested actions
Nodes are queries: One for each q∈CQ
Edges are actions: Table, selection, projection and join actions
Green nodes are feasible queries
Infinitely big structure
–  All CQ queries
–  All possible combinations of actions formulating them
17
Interaction Graph: Colorable Actions
18
Interaction Graph: Colors
•  Yellow action α
–  Every path from current node n to a
feasible node contains α
•  Blue action α
Com1.cid
Com1.cpu
…
Current Node
–  At least one feasible query cannot be
formulated unless this action is
performed (summarization)
…
•  Red action α
Com1.cid
–  No path to a feasible node contains α
…
Current
Node
Com1.cpu
…
Com1.cid=*
…
Com1.cid=*
…
Com1.cpu=*
…
Net1
Net1.rate=’54Mbps’
…
…
…
Net1.rate=’54Mbps’
Com1.cid=Net1.cid
…
…
…
Com1.price=*
Net1
Com1.cid=Net1.cid
…
…
Com1.ram=*
…
…
Com1.price=*
…
•  Colorable actions AC label
outgoing edges of the current node
Current
Node
…
…
Com1.cpu=*
Com1.ram=*
Rou1
Com1.cid=Net1.cid
Net1.rate=Rou1.rate
…
…
…
…
…
Com1.cpu=*
…
Net1
Rou1
…
…
Rou1
Com2
Com2
…
Com2
Com2.cid=Net1.cid
Com2.cpu=‘P4’
Net1.rate=‘54Mbps’
…
…
…
…
…
19
20
5
Color Determined
By a Finite Set of Feasible Queries
CLIDE Architecture
Challenge: Infinitely Many Feasible Queries

Actions
Front-End
Current Query
Colored Actions + Feasibility Flag
…
User
ius
Rad
Back-End
Color Algorithm
Closest
Feasible
Queries FQC
Seed Queries SQ
Parameters Algorithm
Schemas
Views
…
…
…
…
Closest Feasible Queries Algorithm
Aliases Collapse Rule
…
n
Closest Feasible Queries FQC
Maximally-Contained Rewriter
?
…
Solution: Closest Feasible Queries FQC
Minimal Feasible
Extension Queries
•  FQC is sufficient to color actions in AC
•  Theorem: Set of Closest Feasible Queries is Finite
Column
Associations
Challenge: How far can the Closest Feasible Queries FQC be?
•  Back-End invoked every time the user performs an action
Solution: Based on Maximally Contained Queries FQMC
–  i.e., the user arrives at a new node in the interactions graph
21
Maximally Contained Queries FQMC
22
Closest Feasible Queries FQC Algorithm
Challenge: How far can the Closest Feasible Queries FQC be?
Solution: Maximally Contained Queries FQMC
Maximally Contained Query
Query: Q1
Get all Computers
Query: Q2
Get all Computers
with a given cpu
s
adiu
pL R
Closest…
Feasible
Queries FQC
Not Maximally
Maximally
Contained
Contained
Query
Query: Q4
Q3
Get all Computers
with a given ram
cpu & ram
…
n
…
Maximally
Contained
Queries FQMC
…
…
…
…
•  Compute maximally contained queries FQMC
•  Theorem: All FQC queries are reachable
via a path of length p ≤ pL
•  The radius pL is the longest path to a maximally contained
query
•  Assuming fixed SELECT clause (projection list)
•  Covered extensively in literature
–  MiniCon, Bucket, InverseRules Algorithms
•  FQMC is finite
23
24
6
Closest Feasible Queries FQC Algorithm
Closest Feasible Queries FQC Algorithm
Challenge: Find the Closest Feasible Queries
Solution: Collapse Aliases
…
Closest
Feasible
Queries FQC
n
…
Maximally
Contained
Feasible
Queries FQMC
…
Closest
Feasible
Queries FQC
…
…
n
…
Maximally
Contained
Feasible
Queries FQMC
…
…
…
…
…
…
More feasible nodes
•  Theorem: All queries in FQMC are in FQC
•  But not all queries in FQC are in FQMC
•  Collapse Aliases to compute FQC \ FQMC
•  Check satisfiability
25
Color Algorithm
26
CLIDE Implementation & Optimizations
Maximally-Contained Rewriter
Front-End
Yellow and Blue
•  An action α is colored based on which closest feasible
queries it appear in
•  Yellow, if α appears in all queries in FQC
•  Blue, if α appears in at least one (but not all) query in FQC
Current Query
Minimal Feasible ExtensionColored
QueriesActions
FQME + Feasibility Flag
+ Maximum Projection Lists
Back-End
Redundant Actions Removal
Color Algorithm
Maximally-Contained Feasible Extension Queries
+
Maximum
Projection
Lists
Seed
Queries
SQ
Redundant Queries Removal
Parameters Algorithm
Feasible Extension Queries
Closest
Feasible
Queries
FQC
+ Maximum
Projection
Lists
White and Red
•  Attach Maximum Projection Lists to Closest Feasible
Queries
Views Expansion
Closest Feasible
Queries Algorithm
Aliases Collapse Rule
Maximally-Contained
Feasible Queries over Views
Minimal Feasible
+ Containment Mappings
Extension Queries
Maximally-Contained Rewriter
MiniCon
–  Projections that can be added to a feasible query, without
compromising feasibility
Containment Mappings Logging
Column
Schemas
Views
Associations
•  Projection α is white if in the maximum projection list
•  Color selections based on projections
•  Views expansion introduce redundancy
–  Affects CLIDE’s rapid convergence and summarization
•  Efficient containment test crucial to redundancy removal
27
28
7
CLIDE Performance
CLIDE Performance
Chains of Stars – No Parameters
Chains of Stars – No Parameters
4000
4000
112 Views
168 Views
224 Views
280 Views
3000
Time (ms)
2500
Rest of Algorithms
Containment Testing
3500
3000
Time (ms)
3500
2000
1500
1000
500
2500
2000
1500
1000
500
0
(2,1)
(2,2)
(4,3)
(4,4)
(6,5)
(6,6)
(8,7)
(8,8)
0
(10,9) (10,10) (12,11) (12,12) (14,13) (14,14)
7720
142
7820
143
7920
(# of Joins, # of Selections) In Current Query
144
8020
145
8120
# of Containment Tests
B
B2Bi1
i
BiM
A
30
CLIDE Performance
Chains of Stars – With Parameters
Chains of Stars – With Parameters
4000
1600
1600
112 Views
140 Views
168Views
Views
196
224 Views
252 Views
280Views
Views
308
Rest of Algorithms
Containment Testing
1400
1200
1000
Time (ms)
Time
Time(ms)
(ms)
148
29
CLIDE Performance
1500
600
1000
400
500
200
00
8320
C
C1C i1 A-span = 7
i
CiM
B-span = 3
C1CL
Selections = 4,6,8,10
BK
2500
1000
2000
800
147
… … …
…
… … ……
B
B11
3000
1200
8220
C1
•  Schema
Queries
Views
3500
1400
146
800
600
400
(2,1)
(2,1)
(2,2)
(2,2)
(4,3)
(4,3)
(4,4)
(4,4)
(6,5) (6,6)
(6,6) (8,7)
(6,5)
(8,7) (8,8)
(8,8) (10,9)
(10,9) (10,10)
(10,10) (12,11)
(12,11) (12,12)
(12,12)(14,13)
(14,13)(14,14)
(14,14)
(# of Joins, # of Selections) In Current Query
(# of Joins, # of Selections) In Current Query
200
0
1696
C1
A
… … ……
B
B11
B
B2Bi1
i
BiM
BK
156
1697
157
1698
158
1699
159
1700
# of Containment Tests
160
1701
161
1702
162
… … …
…
•  Schema
Queries
Views
C
C1C i1 A-span = 7
i
CiM
B-span = 3
C1CL
Selections = 4,6,8,10
31
32
8
CLIDE Summary
First interactive query formulation interface based on source
and mediator capabilities
Applicability
•  Service-Oriented Architectures
•  Privacy-Preserving Services
Contributions
•  Interaction Guarantees: Rapid Convergence, Completeness,
Summarization of Suggestions
•  Interaction Graph
•  Back-End Algorithms
–  Closest Feasible Queries, Colors, Parameters
•  Modular, Customizable Architecture
http://www.clide.info
33
9
Download