Document

advertisement
Noget helt andet…

Platon vil gerne være vært (i Århus) for et BIT
møde i efteråret
– SOA eller MDM
– Fint for mig, men hvad siger i ?

Platon inviterer alle til www.bi2006.dk
– 7-8 juni
– Special pris for BIT medlemmer: 2995 kr.
– Tilmelding via Jørgen Davidsen, jda@platon.net
1
Lineage Tracing in Data
Warehouses
Torben Bach Pedersen
Based on work by Yingwei Cui and Jennifer Widom
Stanford University Database Group
Motivation: Data Warehousing
Wow?!
Data Warehouse
Lucrative Fields
Theory
$320K
Databases $8800K
Networks $800K
Courses
Source 1
Enrollments
Students
Source 2
Source 3
3
Data Warehouse
Lucrative Fields
Oh, I see...
Theory
$320K
Databases
Database $8800K
1800
Networks $800K
Lineage Tracer
Courses
CS154
CS145
CS244
CS245
Theory
Databases
Networks
Databases
Enrollments
CS154
CS145
CS244
CS145
CS245
…
Source 1
Joe
Ted
Bob
Ann
Jane
…
Source 2
Students
Ann
Bob
Jane
Joe
Ted
…
BS
$1K
MS
$1K
Web $5K
BS
$1K
Web $5K
…
…
Source 3
4
The Data Lineage Problem

Data warehouses integrate data from multiple
sources for analysis and mining

Data lineage: given data item o in the warehouse,
which data items in the sources were used to
derive o?

Sometimes called “drill-through” in industry
– “Drill-through” often limited
5
Challenges

Warehouse of relational views over relational sources
– What is a good formal definition for lineage?
– How do we trace data lineage for arbitrary views?
– How do we make it efficient?

Warehouse defined by graph of data transformations
– No fixed, well-defined relational operators
– Large transformation sequences and graphs
6
Outline of Talk

Part 1: Lineage tracing for relational views

Part 2: Lineage tracing for general data transformations
7
Part 1:
Lineage Tracing for Relational Views

Declarative definition of data lineage

Lineage tracing algorithms

Using auxiliary views for efficient lineage tracing

Experimental results (small sample)
8
Views We Consider

Relational algebra s, p,
V

Arbitrary use of aggregation a
a

Set semantics

Also in thesis
– Set operators , , 
– Bag semantics
p
a
s
R
S
T
9
Simple Lineage Example
V = aY,sum(Z) (sX >Z(R
R X Y
3 a
8 b
S Y
a
b
b
b
Z
2
0
9
6
T X
3
8
8
8
Y
a
b
b
b
Z
2
0
9
6
U X
sX >Z 3
8
8
Y
a
b
b
Z
2
0
6
S))
V
aY,sum(Z)
select Y,sum(Z)
from R natural join S
where X>Z
group by Y
Y sum
a 2
b 6
10
Lineage for Relational Operators

Unary relational operators (s, p, a) definition took
a long time R
R*
op
t
Lineage of t according to op is the maximal subset R* R
such that
(1) op(R*) = {t}
- output of R* through op is t
(2) t*  R*: op({t*})  - op used on t* is nonempty
11
Lineage for Relational Operators

Example 1 – the two conditions ensure that only
tuples contributing to t are included in lineage
R X
3
8
8
8
Y
a
b
b
b
Z
2
0
9
6
sX >Z
X
3
8
8
Y
a
b
b
Z
2
0
6
Lineage of t according to op is the maximal subset R* R
such that
(1) op(R*) = {t}
(2) t*  R*: op({t*}) 
12
Lineage for Relational Operators

Example 2 –”maximal” requirement ensures that
(8,b,0) tuple in included in (b,6) lineage
R X
3
8
8
Y
a
b
b
Z
2
0
6
aY,sum(Z)
Y sum
a 2
b 6
Lineage of t according to op is the maximal subset R* R
such that
(1) op(R*) = {t}
(2) t*  R*: op({t*}) 
13
Lineage for Relational Operators

N-ary relational operators ( ,,) – lineage unique
R1
R1*
op
R2*
R2
Lineage of t according to op is the maximal subsets Ri* Ri
for i = 1..n such that
(1) op(R1*, …, Rn*) = {t}
(2) ti*  Ri*: op(R1, …, {ti*}, …, Rn) 
14
Lineage for Relational Views

Lineage of a tuple set is union of lineage of each tuple in the set

Lineage for views is defined recursively => naive, but
inefficient, algorithm (need to recompute/store all intermediate
results) R1
R1*
U
V
op2
op1
R2*
t
U*
R2
Lineage of t is R1*, R2*
15
Lineage Tracing

Convert view into segmented normal form (SPJ+agg)

Each segment a(p(s(E1

Generate one tracing query for each segment

Apply tracing queries recursively
–

…
En)))
# non-top a + 1
Proof: lineage result is unaffected by
normalization and segment-level tracing
16
Tracing Query for One Segment
R X Y
3 a
8 b
S Y
a
b
b
b
Z
2
0
9
6
V = aY,sum(Z) (sX >Z(R
s
X >Z
a
S))
V
Y,sum(Z)
TQ = Split R,S (s X >Z  Y=b(R
Y sum
a
b
2
6
S))
R*={(8,b)}, S*={(b,0),(b,6)}
Split = ”unjoin” – project over R+S schemas
17
Recursive Tracing Procedure
R X Y
3 a
8 b
S
Y
a
b
b
b
Z
2
0
9
6
V = aW, avg(sum)(a Y,sum(Z)(sX >Z (R
s
U Y sum
a
a 2
b 6
T Y
a
b
b
W
p
p
q
S))
T))
V W avg
a
p 4
q 6
TQ
=S))
Split
(s W=q(U
TQ
=
Split
(
s
(
R
1
U,T
R*={(8,b)},
S*={(b,0),(b,6)},
T*={(b,q)}
2
R,S
X >Z  Y=b
T))
18
Making It Efficient

Source accesses are usually expensive or impossible

Need some intermediate results for lineage tracing
Store auxiliary views at the warehouse
– Reduce or eliminate source accesses
– Reduce recomputation of intermediate results
19
Aux View Example
20
Aux View Example
21
Auxiliary Views

There are many possible auxiliary views

For single-segment views a(p(s(R1
…
Rn)))
– Identified 10 possible auxiliary view schemes
– Studied performance tradeoffs

For arbitrary views
– Hard optimization problem
– Exhaustive and heuristic algorithms
– Performance study
22
Single Segment Schemes

Store nothing (NO)

Store Base Tables (BT)

Store Lineage Views (LV)

Store Split Lineage Tables (SLT)

Store Partial Base Tables (PBT)

Store Base Table Projections (BP)

Store Lineage View Projections (LP)

Self-maintainable variations: LV-S, SLT-S, PBT-S
23
Auxiliary Views: Performance Tradeoffs
+ Always improve lineage tracing
– Must be maintained when sources change
+ Can also help with maintenance of original user views
24
Auxiliary View Schemes for
Single-Segment Views
Parameters:
- 3-way SPJ view
- sources: 10MB each
- disk: 1Mbps
- network: 50kbps
- 1000 operations
- q/u ratio = 4
Measurements:
- tracing time
- maintenance time
25
Auxiliary View Selection
Algorithms for Arbitrary Views
26
Part 2:
Transformation Graphs

Lineage definition

Tracing algorithms


Data Warehouse
T6
Combining transformations
for lineage tracing
Experimental results
(tiny sample)
Source 1
T4
T5
T2
T1
T3
Source 2
Source 3
27
Transformation Example
id
1
2
3
4
5
6
cust date
A 2/8/99
C 4/5/99
D
6/1/99
B 8/6/99
D 10/8/99
10/8/99
B 12/1/99
12/1/99
Order
Product
id
1
2
2
3
3
3
prod-list
1(10),2(10)
2(5),3(10)
1(20),2(10)
1(10),3(5)
1(5),3(10)
2(10),3(10)
T1
T2
name price
imac 1200
vaio 2400
vaio 1800
palm 500
palm 400
palm
palm 300
palm
split
“join” pivot projection
selection
projection
T3
T4
selection
valid
10/1/986/1/98-9/1/99
9/2/992/1/98-7/1/98
7/2/98-9/1/99
9/2/99-
T5
T6
T7
SalesJump
name
palm
palm
avg3
2K
2K
Q4
6K
6K
28
Lineage for General Transformations

A transformation can be an arbitrary program
?
T
 select … from … where …
 main(int argc, char** argv) {…}
 sed “s/string1/string2/g” …
– One extreme: relational operators
– Another extreme: we know nothing about T
– Middle ground: based on transformation properties
29
Transformation Properties

Transformation classes

Additional properties
– Transformation subclasses
– Schema information
– Provided inverse or tracing procedure
30
Transformation Classes
dispatcher
I: T(I) =  T({i})
iI
Produces 0 or more
output items per
input item
Applying T on
complete set is the
same as on each
input item
separately
T*(o) = {i | oT({i})}
31
Dispatcher Example
O1
Order
id
1
2
3
4
5
6
cust
A
C
D
B
D
B
date
2/8/99
4/5/99
6/1/99
8/6/99
10/8/99
12/1/99
prod-list
1(10),2(10)
2(5),3(10)
1(20),2(10)
1(10),3(5)
1(5),3(10)
2(10),3(10)
T1
id cust date
1 A 2/8/99
1 A 2/8/99
pid
1
2
quant
10
10
5
5
6
6
1
3
2
3
5
10
10
10
:
D
D
B
B
:
10/8/99
10/8/99
12/1/99
12/1/99
:
A non-relational operator, but a typical dispatcher
32
Transformation Classes
dispatcher
aggregator
I: T(I) =  T({i})
I and T(I)={o1…on}:
 unique partition I1..In
of I s.t. T(Ik) = {ok}
T*(o) = {i | oT({i})}
T*(ok) = Ik
iI
33
Aggregator Example
O3
oid name
1 imac
1 vaio
2 vaio
2 palm
3 imac
3 vaio
4 imac
4 palm
5 imac
5 palm
6 vaio
6 palm
date price quant
2/8/99 1200 10
2/8/99 2400 10
4/5/99 2400
5
4/5/99 400 10
6/1/99 1200 20
6/1/99 2400 10
8/6/99 1200 10
8/6/99 400
5
10/8/99 1200
5
10/8/99 300 10
12/1/99 1800 10
12/1/99 300 10
O4
T4
name Q1 Q2
imac 12K 24K
vaio 24K 12K
palm 0K
4K
Q3 Q4
12K 6K
24K 18K
2K 6K
T4 computes quarterly sales per product by ”pivoting”
Again, a non-relational operator, but a typical aggregator
34
Transformation Classes
dispatcher
aggregator
black-box
I: T(I) =  T({i})
I and T(I)={o1…on}:
 unique partition I1..In
of I s.t. T(Ik) = {ok}
All others
T*(o) = {i | oT({i})}
T*(ok) = Ik
iI
T*(o) = I
35
Transformation Classes

Most transformations are dispatchers, aggregators, or
their compositions

A transformation can be both dispatcher and aggregator
– Proof: Lineage definitions are then equivalent

Transformations can be relational operators
– Lineage definitions same as relational definitions
36
Transformation Properties

Transformation classes

Additional properties
– Transformation subclasses
– Schema information
– Provided inverse or tracing procedure
37
Transformation Subclasses

Permit more efficient lineage tracing

Filter is a special dispatcher
– Each input data item produces itself or nothing

Context-free aggregator
– Whether two input data items are in the same partition
is independent of other items

Key-preserving aggregator
– Any subset of an input partition always produces the
same output key
38
Tracing Example: Aggregators

Consider T(I) = {o1…on}

Tracing the lineage of o for aggregator
– Partition input I into I1…In such that T(Ik) = {ok}
– Return Ik such that T(Ik) = {o}

Tracing the lineage of o for context-free aggregator
– Partition input I into I1…In such that |T(Ik)| = 1
– Return Ik such that T(Ik) = {o}
– 2^n versus n^2 running time !
39
Schema Information

Input schema A=(A1…An) and key Akey

Output schema B=(B1…Bn) and key Bkey

Schema mappings: f(A)  B and A  g(B)

Transformations with special schema mappings
– Forward key-map: f(A)  Bkey
– Backward key-map: Akey  g(B)
– Backward total-map: A  g(B)
– More efficient tracing for these
40
Tracing Example: Forward Key-Maps
O3
oid name
1 imac
1 vaio
2 vaio
2 palm
3 imac
3 vaio
4 imac
4 palm
5 imac
5 palm
6 vaio
6 palm
date price quant
2/8/99 1200 10
2/8/99 2400 10
4/5/99 2400
5
4/5/99 400 10
6/1/99 1200 20
6/1/99 2400 10
8/6/99 1200 10
8/6/99 400
5
10/8/99 1200
5
10/8/99 300 10
12/1/99 1800 10
12/1/99 300 10
O4
T4
name Q1 Q2
imac 12K 24K
vaio 24K 12K
palm 0K
4K
Q3 Q4
12K 6K
24K 18K
2K 6K
”name” is carried over as key - trace of ”palm” is easy :
the O3 tuples with name = ’palm’
41
Other Properties

Transformation author provides Tracing Procedure

Provided Transformation Inverse T –1
– If T is an aggregator, then o’s lineage is T –1({o})
– Not always true for dispatchers or black-boxes
42
Tracing Procedures
Property
Procedure
# T Calls
# Accesses
dispatcher
TraceDS
O(|I|)
O(|I|)
aggregator
TraceAG
O(2|I|)
O(2|I|)
black-box
return I;
0
O(|I|)
filter
return o;
0
0
context-free aggr.
TraceCF
O(|I|2)
O(|I|2)
key-preserving aggr.
TraceKP
O(|I|)
O(|I|)
forward key-map
TraceFM
0
O(|I|)
backward key-map
TraceBM
0
O(|I|)
backward total-map
TraceTM
0
0
Provided tracing-proc.
provided
?
?
43
Property Hierarchy
ANY
black-box
aggregator
context-free aggr.
dispatcher
key-preserving aggr.
forward key-map
backward key-map
total-map
filter
provided
tracing-proc.
or inverse
44
Summary of Our Approach for
One Transformation

Properties are provided with transformations
– Specified by the transformation author
– Declared in prepackaged transformations
– Derived using recent techniques [Clio01, RB01]

The best property of a transformation is selected
based on the hierarchy

The tracing procedure using the best property is
called at tracing time

Indexing techniques
45
Transformation Sequences
I

T1
T2
T3
Tn
O
Naive algorithm traces backwards one transformation
at a time
– Need all intermediate results
– Poor performance for long sequences
46
Transformation Sequences
I
I

T1
T2
T3
T’
Tn
O
Tn
O
Combine transformations and trace as one
– Reduces number of intermediate results
– By combining judiciously
 Reduces tracing cost
 Doesn’t lose accuracy
47
Download