Slides - Computer Science @ UC Davis

advertisement
Implementing Mapping
Composition
Todd J. Green*
University of Pennsylania
with Philip A. Bernstein (Microsoft Research),
Sergey Melnik (Microsoft Research),
Alan Nash (UC San Diego)
VLDB 2006
*Work partially supported by NSF grants IIS0513778 and IIS0415810
Seoul, Korea
Schema mappings
 Mapping: a correspondence between instances
of different schemas
Students
Name,
Address
Names
SID,
Name
m
Addresses
SID,
Address
S1
S2
Students  Name,Address (Names ⋈ Addresses)
2
Applications of mappings
Schema evolution
Names  Names
σCountry = KR(Addresses)  SID,Address(Local)£{KR}
σCountry  KR(Addresses)  Foreign
Students  Name,Address,Country(Names ⋈ Addresses)
Students
Name,
Address,
Country
S1
Names
SID,
Name
Names
SID,
Name
m12
Addresses
SID,
Address,
Country
S2
m23
Local
SID,
Address
...
Foreign
SID,
Address,
Country
S3
3
Applications of mappings
Data integration, data exchange
Sn
Addresses
SID,
Address,
Country
Names
SID,
Name
...
m1
Students  Name,Address
(Names ⋈ Addresses)
S1
mn
Names  Names
Local  SID,Address(Country = KR(Addresses))
Foreign  Country  KR(Addresses)
Sn−1
Students
Name,
Address,
Country
...
Names
SID,
Name
Local
SID,
Address
Foreign
SID,
Address,
Country4
Requirements for constraints
 “First attribute in R is a key for R”
2,4(R ⋈1=3 R) µ 2,2(R)
 “View V equals R joined with S”
V µ R ⋈ S, V ¶ R ⋈ S
 “Second attribute of R is a foreign key in S”
2(R) µ 1(S)
2,4(S ⋈1=3 S) µ 2,2(S)
 Data integration, data exchange – GLAV
R⋈SµT⋈U
5
Mapping composition
Names  Names
σCountry = KR(Addresses)  SID,Address(Local)£{KR}
σCountry  KR(Addresses)  Foreign
Students  Name,Address, Country (Names ⋈
(SID,Address(Local)£{KR} [ Foreign))
Students  Name,Address,Country
(Names ⋈ Addresses)
Students
Name,
Address,
Country
S1
m12
Names
SID,
Name
m12  m23
Addresses
SID,
Address,
Country
S2
Names
SID,
Name
m23
Local
SID,
Address
Foreign
SID,
Address,
Country
S3
6
Composition is hard
 Hard part: write composition in the same language
as the input mappings. Depending on language:
 Not always possible
 Not even decidable whether possible
 Strategy 1: use powerful (second-order) mapping
language closed under composition [FKPT04]
 Not supported by DBMS today
 Expensive to check
 Source-target restriction
 Strategy 2: settle for partial solutions [NBM05]
 Containment mappings  easier integration with DBMS
 The strategy we adopt in this work
7
Our contributions
New algorithm for composition problem
Incorporates view unfolding and leftcomposition (new technique)
Makes best effort in failure cases
Algebraic rather than logic-based mappings
Use of monotonicity to handle more operators
Modular and extensible factoring of algorithm
First implementation of composition
Experimental evaluation
8
Formal definition of composition
 Mapping: set of pairs of instances of db schemas
 The composition m12 ±m23 is the mapping
{hA,Ci : (9B)(hA,Bi 2 m12 and hB,Ci 2 m23)}
where A,B,C are instances of S1,S2,S3
 Composition problem: find constraints in same
language as input mappings giving the composition of
the input mappings
U(∙,∙,∙)
 Example:
S1 = {R}, S2 = {S,T}, S3 = {U,V,W}
S(∙,∙)
R(∙,∙,∙)
m12
R ⊆ S⋈T, S ⊆ (U), T = VT(∙,∙)
–W
) R ⊆ (U)⋈(V - W)
S1
R ⊆ S⋈T
S2
V(∙,∙)
m23
W(∙,∙)
S ⊆ (U),
T=V–W
S3
9
Best-effort composition problem
 Composition not always possible
 “Best-effort” composition problem: compute
set of constraints equivalent to input constraints,
but with as many symbols from S2 eliminated as
possible
R ⊆ U,
R ⊆ V,
1,4(2=3(UU)) ⊆ U,
1,4(2=3(VV)) ⊆ V,
U ⊆ T,
V⊆T
Can eliminate U (cross out left column) or V
(right column), but not both [NBM05]
10
Composition algorithm overview
For each relation R in S2
Try to eliminate R via (1) view unfolding
Replace = by pairs of ⊆, ⊇
For each relation R in S2 not yet eliminated
Try to eliminate R via (2) left compose
Else, try to eliminate R via (3) right compose
Output:
New constraints and list of relations successfully
eliminated
11
(1) View unfolding
 Idea: exploit equality constraints (if we have any)
 Standard technique: substitute view definition
for occurrences of view relation in mappings
T = V – W, R ⊆ S ⋈T, T  X ⊆ (U)

R ⊆ S ⋈(V – W), (V – W)  X ⊆ (U)
 Body must not mention view relation itself
 Doesn’t matter what else is in body
 Can substitute everywhere
12
(2) Left compose
 “View unfolding” for containment constraints
(V) ⊆ R – U, R ⊆ S ⋈ T

(V) ⊆ (S ⋈ T) – U
 Needs monotonicity of expressions in R.
E1 ⊆ E2(R), R ⊆ E3
´
E1 ⊆ E2(E3)
if E2(R) is monotone in R (and R not in E3)
 Partial check for monotonicity
“Is S – (T – R) monotone in R?”
13
Normalization for left compose
Need one constraint of form R ⊆ E1
Use identities to normalize, e.g.:
R ⊆ E1 and R ⊆ E2 iff R ⊆ E1  E2
E1  E2 ⊆ E3 iff E1 ⊆ E3 and E2 ⊆ E3
(E1) ⊆ E2 iff E1 ⊆ E2  Dr
More identities in paper
After left compose, try to eliminate D
14
(3) Right compose
 Dual to left compose, from [NBM05]
 Example:
S ⋈T  R, R – U (V)

(S ⋈T) – U  (V)
 Monotonicity check needed here too
 Normalization may introduce Skolem functions
 E1  (E2) iff f(E1)  E2
 Must eliminate Skolem functions after composition
 Lots of effort coding this step!
15
User-defined operators
 User specifies:
 Monotonicity of operator in its arguments
“If E1 monotone in R and E2 antimonotone in R or
independent of R, then E1 * E2 monotone in R”
“if E1 monotone in R or independent of R and E2
antimonotone in R, then E1 * E2 monotone in R”
 Identities for normalization
“E1 * E2  E3 iff E1  E2  E3 ”
 User-defined operators and standard relational
operators treated uniformly
16
Implementation
 12K lines of C# code, command-line tool
# Test case 13: PODS05 example 2
SCHEMA
R(2), S(2), T(2)
CONSTRAINTS
R <= S,
P_{0,2} J_{0,1:1,2} (S S) <= R,
S <= T
ELIMINATE
S;
Output:
P_{0,2} J_{0,1:1,2}(R R) <= R,
R <= T
17
Experimental evaluation
 First attempt at a composition benchmark
 Schema editing and schema reconciliation
scenarios
 “Add a column to R to produce S”: (R) = S
 Measure
 % of symbols eliminated
 Running time
 As a function of
 Editing primitives allowed, length of edit sequence,
presence/absence of keys, starting schema size, …
 Synthetic data
18
Summary of results
 Algorithm often effective in eliminating most or even
all relation symbols from S2
 Running time in subsecond range even for large
problems containing hundreds of constraints
 Certain schema editing primitives problematic
 Key constraints did not reduce effectiveness,
although did increase running time (and output
size)
19
Schema editing
Execution time (sec)
3.5
3
2.5
2
1.5
1
0.5
0
0
10
20
30
40
50
60
Run number
70
80
90
 Random starting schema (30 relations of 2-10 attributes)
 100 random edits
 100 different runs, sorted by execution time
100
20
Schema reconciliation (1)
1
fraction of
symbols
eliminated
0.8
0.6
execution
time (sec)
0.4
0.2
0
10
30
50
70
90
110 130
150
170
190
210
Number of edits
 Random schema (30 relations of 2-10 attributes), random edits
 Point represents median time of reconciliation step of 500 runs
21
Schema reconciliation (2)
Fraction of symbols
eliminated
1
complete
0.8
no view
unfolding
0.6
0.4
no right
compose
0.2
0
10
20
30
40
50
60
70
80
90
100
Schema size
 Random schema (variable # relations of 2-10 attributes)
 100 random edits
 100 different runs, sorted by execution time
22
Related work
 [MH03] J. Madhavan, A. Y. Halevy. Composing
mappings among data sources. VLDB, 2003.
 [FKPT04] R. Fagin, Ph. G. Kolaitis, L. Popa,
W.C. Tan. Composing schema mappings:
second-order dependencies to the rescue.
PODS, 2004.
 [NBM05] A. Nash, P. A. Bernstein, S. Melnik.
Composition of mappings given by embedded
dependencies. PODS, 2005.
23
Conclusion and future work
 We motivated and described the mapping
composition problem
 We presented an implementation of a practical
new algorithm for the composition problem
 We also presented an experimental evaluation
 To do: theoretical analysis of impact of userdefined operators
 To do: output constraints from algorithm can be
a mess! How to clean up?
24
Download