A Warehousing Approach to Data and Knowledge Integration

advertisement
Lineage Tracing in Data
Warehouses
Yingwei Cui
Stanford University Database Group
Motivation: Data Warehousing
Wow?!
Data Warehouse
Lucrative Fields
Theory
$320K
Databases $8800K
Networks $800K
Courses
Source 1
Enrollments
Students
Source 2
Source 3
2
Data Warehouse
Lucrative Fields
Oh, I see...
Theory
$320K
Databases
Database $8800K
1800
Networks $800K
Lineage Tracer
Courses
CS154
CS145
CS244
CS245
Theory
Databases
Networks
Databases
Enrollments
CS154
CS145
CS244
CS145
CS245
…
Source 1
Joe
Ted
Bob
Ann
Jane
…
Source 2
Students
Ann
Bob
Jane
Joe
Ted
…
BS
$1K
MS
$1K
Web $5K
BS
$1K
Web $5K
…
…
Source 3
3
The Data Lineage Problem

Data warehouses integrate data from multiple
sources for analysis and mining

Data lineage: given data item o in the warehouse,
which data items in the sources were used to
derive o?

Sometimes called “drill-through” in industry
4
Challenges

Warehouse of relational views over relational sources
– What is a good formal definition for lineage?
– How do we trace data lineage for arbitrary views?
– How do we make it efficient?

Warehouse defined by graph of data transformations
– No fixed, well-defined relational operators
– Large transformation sequences and graphs
5
Contributions

Thesis contributions
– Basics of lineage tracing for relational views [TODS’00]
– Lineage tracing system prototype [ICDE’00 demo]
– Performance study and optimizations [ICDE’00, DMDW’00]
– Lineage tracing for general data transformations [VLDB’01]
– View update for deletions using data lineage [TechReport’01]

Other contributions (joint with others)
– Data warehousing performance issue [VLDB’00]
– Data management for wireless networks [Infocom’98, Globecom’97]
6
Outline of Talk

Part 1: Lineage tracing for relational views

Part 2: Lineage tracing for general data transformations

Part 3: View update for deletions using data lineage
(time permitting)
7
Part 1:
Lineage Tracing for Relational Views

Declarative definition of data lineage

Lineage tracing algorithms

Using auxiliary views for efficient lineage tracing

Experimental results (small sample)
8
Views We Consider

Relational algebra s, p,
V

Arbitrary use of aggregation a
a

Set semantics

Also in thesis
– Set operators , , 
– Bag semantics
p
a
s
R
S
T
9
Simple Lineage Example
V = aY,sum(Z) (sX >Z(R
R X Y
3 a
8 b
S Y
a
b
b
b
Z
2
0
9
6
T X
3
8
8
8
Y
a
b
b
b
Z
2
0
9
6
U X
sX >Z 3
8
8
Y
a
b
b
Z
2
0
6
S))
V
aY,sum(Z)
Y sum
a 2
b 6
10
Lineage for Relational Operators

Unary relational operators (s, p, a)
R
R*
op
t
Lineage of t according to op is the maximal subset R* R
such that
(1) op(R*) = {t}
(2) t*  R*: op({t*}) 
11
Lineage for Relational Operators

Example 1
R X Y Z
X Y Z
3 a 2
3 a 2
sX >Z 8 b 0
8 b 0
8 b 6
8 b 9
8 b 6
Lineage of t according to op is the maximal subset R* R
such that
(1) op(R*) = {t}
(2) t*  R*: op({t*}) 
12
Lineage for Relational Operators

Example 2
R X
3
8
8
Y
a
b
b
Z
2
0
6
aY,sum(Z)
Y sum
a 2
b 6
Lineage of t according to op is the maximal subset R* R
such that
(1) op(R*) = {t}
(2) t*  R*: op({t*}) 
13
Lineage for Relational Operators

N-ary relational operators (e.g., )
R1
R1*
op
R2*
R2
Lineage of t according to op is the maximal subsets Ri* Ri
for i = 1..n such that
(1) op(R1*, …, Rn*) = {t}
(2) ti*  Ri*: op(R1, …, {ti*}, …, Rn) 
14
Lineage for Relational Views

Lineage of a tuple set is union of lineage of each tuple in the set

Lineage for views is defined recursively
R1
R1*
U
V
op2
op1
R2*
t
U*
R2
Lineage of t is R1*, R2*
15
Lineage Tracing

Convert view into a segmented normal form

Each segment a(p(s(E1

Generate one tracing query for each segment

Apply tracing queries recursively
–

…
En)))
# non-top a + 1
Lineage result is unaffected by normalization and
segment-level tracing
16
Tracing Query for One Segment
R X Y
3 a
8 b
S Y
a
b
b
b
Z
2
0
9
6
V = aY,sum(Z) (sX >Z(R
s
X >Z
a
S))
V
Y,sum(Z)
TQ = Split R,S (s X >Z  Y=b(R
Y sum
a
b
2
6
S))
R*={(8,b)}, S*={(b,0),(b,6)}
17
Recursive Tracing Procedure
R X Y
3 a
8 b
S
Y
a
b
b
b
Z
2
0
9
6
V = aW, avg(sum)(a Y,sum(Z)(sX >Z (R
s
U Y sum
a
a 2
b 6
T Y
a
b
b
W
p
p
q
S))
T))
V W avg
a
p 4
q 6
TQ
=S))
Split
(s W=q(U
TQ
=
Split
(
s
(
R
1
U,T
R*={(8,b)},
S*={(b,0),(b,6)},
T*={(b,q)}
2
R,S
X >Z  Y=b
T))
18
Making It Efficient

Source accesses are usually expensive or impossible

Need some intermediate results for lineage tracing
Store auxiliary views at the warehouse
– Reduce or eliminate source accesses
– Reduce recomputation of intermediate results
19
Auxiliary Views

There are many possible auxiliary views

For single-segment views
a(p(s(R1
…
Rn)))
– Identified 10 possible auxiliary view schemes
– Studied performance tradeoffs

For arbitrary views
– Hard optimization problem
– Exhaustive and heuristic algorithms
– Performance study
20
Auxiliary Views: Performance Tradeoffs
+ Always improve lineage tracing
– Must be maintained when sources change
+ Can also help with maintenance of original user views
21
Auxiliary View Schemes for
Single-Segment Views
Parameters:
- 3-way SPJ view
- sources: 10MB each
- disk: 1Mbps
- network: 50kbps
- 1000 operations
- q/u ratio = 4
Measurements:
- tracing time
- maintenance time
22
Auxiliary View Selection
Algorithms for Arbitrary Views
23
Part 2:
Transformation Graphs

Lineage definition

Tracing algorithms


Data Warehouse
T6
Combining transformations
for lineage tracing
Experimental results
(tiny sample)
Source 1
T4
T5
T2
T1
T3
Source 2
Source 3
24
Transformation Example
id
1
2
3
4
5
6
cust date
A 2/8/99
C 4/5/99
D
6/1/99
B 8/6/99
D 10/8/99
10/8/99
B 12/1/99
12/1/99
Order
Product
id
1
2
2
3
3
3
prod-list
1(10),2(10)
2(5),3(10)
1(20),2(10)
1(10),3(5)
1(5),3(10)
2(10),3(10)
T1
T2
name price
imac 1200
vaio 2400
vaio 1800
palm 500
palm 400
palm
palm 300
palm
split
“join” pivot projection
selection
projection
T3
T4
selection
valid
10/1/986/1/98-9/1/99
9/2/992/1/98-7/1/98
7/2/98-9/1/99
9/2/99-
T5
T6
T7
SalesJump
name
palm
palm
avg3
2K
2K
Q4
6K
6K
25
Lineage for General Transformations

A transformation can be an arbitrary program
?
T
 select … from … where …
 main(int argc, char** argv) {…}
 sed “s/string1/string2/g” …
– One extreme: relational operators
– Another extreme: we know nothing about T
– Middle ground: based on transformation properties
26
Transformation Properties

Transformation classes

Additional properties
– Transformation subclasses
– Schema information
– Provided inverse or tracing procedure
27
Transformation Classes
dispatcher
I: T(I) =  T({i})
iI
T*(o) = {i | oT({i})}
28
Dispatcher Example
O1
Order
id
1
2
3
4
5
6
cust
A
C
D
B
D
B
date
2/8/99
4/5/99
6/1/99
8/6/99
10/8/99
12/1/99
prod-list
1(10),2(10)
2(5),3(10)
1(20),2(10)
1(10),3(5)
1(5),3(10)
2(10),3(10)
T1
id cust date
1 A 2/8/99
1 A 2/8/99
pid
1
2
quant
10
10
5
5
6
6
1
3
2
3
5
10
10
10
:
D
D
B
B
:
10/8/99
10/8/99
12/1/99
12/1/99
:
29
Transformation Classes
dispatcher
aggregator
I: T(I) =  T({i})
I and T(I)={o1…on}:
 unique partition I1..In
of I s.t. T(Ik) = {ok}
T*(o) = {i | oT({i})}
T*(ok) = Ik
iI
30
Aggregator Example
O3
oid name
1 imac
1 vaio
2 vaio
2 palm
3 imac
3 vaio
4 imac
4 palm
5 imac
5 palm
6 vaio
6 palm
date price quant
2/8/99 1200 10
2/8/99 2400 10
4/5/99 2400
5
4/5/99 400 10
6/1/99 1200 20
6/1/99 2400 10
8/6/99 1200 10
8/6/99 400
5
10/8/99 1200
5
10/8/99 300 10
12/1/99 1800 10
12/1/99 300 10
O4
T4
name Q1 Q2
imac 12K 24K
vaio 24K 12K
palm 0K
4K
Q3 Q4
12K 6K
24K 18K
2K 6K
31
Transformation Classes
dispatcher
aggregator
black-box
I: T(I) =  T({i})
I and T(I)={o1…on}:
 unique partition I1..In
of I s.t. T(Ik) = {ok}
All others
T*(o) = {i | oT({i})}
T*(ok) = Ik
iI
T*(o) = I
32
Transformation Classes

Most transformations are dispatchers, aggregators, or
their compositions

A transformation can be both dispatcher and aggregator
– Lineage definitions are equivalent

Transformations can be relational operators
– Lineage definitions same as relational definitions
33
Transformation Properties

Transformation classes

Additional properties
– Transformation subclasses
– Schema information
– Provided inverse or tracing procedure
34
Transformation Subclasses

Permit more efficient lineage tracing

Filter is a special dispatcher
– Each input data item produces itself or nothing

Context-free aggregator
– Whether two input data items are in the same partition
is independent of other items

Key-preserving aggregator
– Any subset of an input partition always produces the
same output key
35
Tracing Example: Aggregators

Consider T(I) = {o1…on}

Tracing the lineage of o for aggregator
– Partition input I into I1…In such that T(Ik) = {ok}
– Return Ik such that T(Ik) = {o}

Tracing the lineage of o for context-free aggregator
– Partition input I into I1…In such that |T(Ik)| = 1
– Return Ik such that T(Ik) = {o}
36
Schema Information

Input schema A=(A1…An) and key Akey

Output schema B=(B1…Bn) and key Bkey

Schema mappings: f(A)  B and A  g(B)

Transformations with special schema mappings
– Forward key-map: f(A)  Bkey
– Backward key-map: Akey  g(B)
– Backward total-map: A  g(B)
37
Tracing Example: Forward Key-Maps
O3
oid name
1 imac
1 vaio
2 vaio
2 palm
3 imac
3 vaio
4 imac
4 palm
5 imac
5 palm
6 vaio
6 palm
date price quant
2/8/99 1200 10
2/8/99 2400 10
4/5/99 2400
5
4/5/99 400 10
6/1/99 1200 20
6/1/99 2400 10
8/6/99 1200 10
8/6/99 400
5
10/8/99 1200
5
10/8/99 300 10
12/1/99 1800 10
12/1/99 300 10
O4
T4
name Q1 Q2
imac 12K 24K
vaio 24K 12K
palm 0K
4K
Q3 Q4
12K 6K
24K 18K
2K 6K
38
Other Properties

Provided Tracing Procedure

Provided Transformation Inverse T –1
– If T is an aggregator, then o’s lineage is T –1({o})
– Not always true for dispatchers or black-boxes
39
Tracing Procedures
Property
Procedure
# T Calls
# Accesses
dispatcher
TraceDS
O(|I|)
O(|I|)
aggregator
TraceAG
O(2|I|)
O(2|I|)
black-box
return I;
0
O(|I|)
filter
return o;
0
0
context-free aggr.
TraceCF
O(|I|2)
O(|I|2)
key-preserving aggr.
TraceKP
O(|I|)
O(|I|)
forward key-map
TraceFM
0
O(|I|)
backward key-map
TraceBM
0
O(|I|)
backward total-map
TraceTM
0
0
Provided tracing-proc.
provided
?
?
40
Property Hierarchy
ANY
black-box
aggregator
context-free aggr.
dispatcher
key-preserving aggr.
forward key-map
backward key-map
total-map
filter
provided
tracing-proc.
or inverse
41
Summary of Our Approach for
One Transformation

Properties are provided with transformations
– Specified by the transformation author
– Declared in prepackaged transformations
– Derived using recent techniques [Clio01, RB01]

The best property of a transformation is selected
based on the hierarchy

The tracing procedure using the best property is
called at tracing time

Indexing techniques
42
Transformation Sequences
I

T1
T2
T3
Tn
O
Naive algorithm traces backwards one transformation
at a time
– Need all intermediate results
– Poor performance for long sequences
43
Transformation Sequences
I
I

T1
T2
T3
T’
Tn
O
Tn
O
Combine transformations and trace as one
– Reduces number of intermediate results
– By combining judiciously
 Reduces tracing cost
 Doesn’t lose accuracy
44
Overall Approach

Algorithm for deriving properties of T = T1 • T2
from properties of T1 and T2

Coarse-grained cost metric for a tracing sequence
based on transformation properties

Greedy algorithm
45
Example of Greedy Algorithm
T4
T5
T6
T7
fkmap(2)
btmap(1)
filter(1)
bkmap(2)
fkmap(2)
btmap(1)T6 bkmap(2)
T4’
T7
fkmap(2)
filter(1) bkmap(2)
T4’ blkbox(5)
fkmap(2)
bkmap(2)
T6’
bkmap(2)
blkbox(5)
46
Multiple-Input Example
O1
id cust date pid
1 A 2/8/99 1
1 A 2/8/99 2
:
5
5
6
6
D
D
B
B
:
10/8/99
10/8/99
12/1/99
12/1/99
1
3
2
3
quant
10
10
:
name price
imac 1200
vaio 2400
vaio 1800
palm 400
palm 300
O3
5
10
10
10
T3
O2
id
1
2
2
3
3
dispatcher
valid
10/1/986/1/98-9/1/99
9/2/997/2/98-9/1/99
9/2/99-
oid name
1 imac
1 vaio
:
5 imac
5 palm
6 vaio
6 palm
date price quant
2/8/99 1200 10
2/8/99 2400 10
:
:
10/8/99 1200 5
10/8/99 300 10
12/1/99 1800 10
12/1/99 300 10
dispatcher
47
Transformation Graphs
I1
I2

O
Definition time
–
Specify properties of each transformation in graph
48
Transformation Graphs
I1

I2




O
Definition time
–
–
–
Specify properties of each transformation in graph
Consider each path as a transformation sequence
Combine transformations in each sequence
49
Transformation Graphs
I1

I2

Definition time

Load time



O
– Save intermediate results and build indices as desired

Tracing time
–
–
Trace lineage through each sequence
Combine results
50
Example Revisited
Order
bkmap
T1 dispatcher fkmap btmap
T3
Product
T4
T5
filter
bkmap
T6
T7
SalesJump
T2 dispatcher
filter
Order
bkmap
T1
T3
Product
T2
bkmap
fkmap
T4
T5
T6
T7
SalesJump
dispatcher
51
Experimental Results
Transformation graph based on a complex TPC-D query (Q12)
52
Part 3:
View Update Using Data Lineage

View update: translating updates on views to updates on
base tables

Obvious connection to lineage in case of view deletions

Fresh approach with improved results
53
View Update Translations:
Valid and Exact
t
V
……
R1
R2
Rn
54
View Update Translations:
Valid and Exact
t
V
……
R1
R2
Rn
55
View Update Translations:
Valid and Exact
t
V
……
R1
R2
Rn
56
Our Algorithm

Uses lineage to:
– Find an exact translation whenever one exists
(in linear time for many cases)
– Find a “good” translation when no exact translation exists

Fully automatic

Previous approaches
– Don’t always find an exact translation
– Often require user input
– Consider restricted classes of views
57
Related Work

Schema-level lineage tracing (annotation-based)
[BB99, HQGW93, RS98]

Drill-down or drill-through on data cubes [Gray95]

“Weak inverse” for transformations [WS97]

Warehouse load resumption [LGMW00]

Data cleaning [GFSS+01]

View update [DB82, Mas84, Kel85]
58
Conclusions

Data lineage problem in two scenarios
–
–

For both scenarios, we provide:
–
–
–
–

Warehouse defined by relational views
Warehouse defined by general data transformations
Formal lineage definition
Lineage tracing algorithms
Optimization techniques
System prototype and performance study
Use lineage for the view update problem
59
Some Open Problems

Lineage of “missing” view or base tuples

Deriving transformation properties

Combining with annotation-based approach

View update
– Translation ambiguity
– Base table constraints
– Multiple interacting views
60
61
Lineage Applications

On-line analytical processing (OLAP)

Scientific databases

Sensory and monitoring systems

Data cleaning

Warehouse resumption

Data security

View update
62
Lineage Tracing

Convert view definition into a segmented normal form
V
V
p
s
p
s



a
p
R
S
p
a
a
p
s
a
s
p
T
W
R
S
T
W
Generate one tracing query for each ASPJ segment
Apply tracing queries top-down through view definition
Lineage result is unaffected by normalization
63
Tracing Example
R K1
1
2
3
S K2
1
2
3
4
5
X
a
b
c
X
b
a
b
d
b
Y
p
q
r
Z
2
4
31
8
9
V = aX,avg(Z)(sK1<K2 (R
s
a
S))
V
X avg
a
b
TQ = Split R,S (sK1<K2  X=b(R
4
6
S))
64
Split Lineage Tables (SLT)
R K1
1
2
3
S K2
1
2
3
4
5
X
a
b
c
X
b
a
b
d
b
Y
p
q
r
Z
2
4
3
8
9
s
R' K1 X Y
1 a p
2 b q
Split
S' K2
2
3
5
X
a
b
b
a
V
X avg
a
b
4
6
Z
4
31
9
65
Base Table Projections (BP)
R K1
1
2
3
X
a
b
c
Y
p
q
r
R’ K1
1
p
2
3
X
a
b
c
s
S K2
1
2
3
4
5
X
b
a
b
d
b
Z
2
4
31
8
98
S’ K2
p
1
2
3
4
5
X
b
a
b
d
b
a
V X avg
a
b
4
6
66
Context-Free Aggregator
Example
O3
oid name
1 imac
1 vaio
2 vaio
2 palm
3 imac
3 vaio
4 imac
4 palm
5 imac
5 palm
6 vaio
6 palm
date price quant
2/8/99 1200 10
2/8/99 2400 10
4/5/99 2400
5
4/5/99 400 10
6/1/99 1200 20
6/1/99 2400 10
8/6/99 1200 10
8/6/99 400
5
10/8/99 1200
5
10/8/99 300 10
12/1/99 1800 10
12/1/99 300 10
O4
T4
name Q1 Q2
imac 12K 24K
vaio 24K 12K
palm 0K
4K
Q3 Q4
12K 6K
24K 18K
2K 6K
67
Tracing Example 1

Tracing procedure for context-free aggregators
– Partition input I into I1…In such that |T(Ik)| = 1;
– Return Ik s.t. T(Ik) = {o};
68
Lineage Equivalence

Lineage of equivalent SPJ views are equivalent

Not for ASPJ views
R X
3
8
8
Y
a
b
b
Z
2
0
6
U Y sum
aY,sum(Z) a 2
b 6
69
Lineage Equivalence

Lineage of equivalent SPJ views are equivalent

Not for ASPJ views
R X
3
8
8
Y
a
b
b
Z
2
0
6
aB=0
U Y sum
aY,sum(Z) a 2
b 6
70
Non-Context-Free Example
71
Non-Context-Free Example
72
Indices Help!

Conventional index
– On input key Akey for a backward key-map with
Akeyg(B)

Functional index
– On f(A) for a forward key-map with f(A)Bkey
– On T(A) for a dispatcher

Lineage index
– Mapping the key of each output data item o to
the keys of input data items in o’s lineage
73
Experimental Results
Tracing through an “SP” transformation over TPC-D table PartSupp
74
Tracing Through Sequences

Tracing cost estimation
– Divide properties into 5 groups
– T’s cost level depends on the group of its best property
– Associate a sequence with N[1..5] where N[k] records
the number of transformations with cost level k

Greedy algorithm
– Pick a combination that results in the lowest N
75
Lineage Annotation (Appendix)
T1
T2
{1}
{1,2}
{1,2}
{2,4}
{4}
{4}
1
2
3
4
T1*
{1,2}
{1,2,4}
{4}
T2*
76
Multiple Inputs and Outputs
I1
I2
Im
..
.
T
O1
O2
On

Define properties for each input and output

Trace lineage for each input/output pair using singleinput single-output tracing procedures
77
View Update
UV
V
V’
D
D’
UD?
 Deletions on SPJ view  deletions on base database

View tuple deletion request –t and base tuple deletion D

D is a translation for –t if {t}  V = V(D) – V(D – D)

Side-effect E = V – {t}; D is exact if E = 
78
Relationships to Data Lineage
For an SPJ view:
 ti
 Ri belongs to t’s lineage Ri* iff
{t}  pA(s C (R1
 ti
t
pA
sC
…{ti}… Rn))
belongs to t’s exclusive lineage Ri** iff
{t} = pA (sC (R1 …{ti}… Rn))
…
Intuition: ti contributes only to t
R1
R2
Rn
79
The Problem

View update
V’
V
D

t
?
pA
sC
D’
…
View update for deletions
R1
R2
Rn
80
Relationships to Data Lineage

Deleting a lineage branch Ri*of t
is always a translation for –t
t
pA
sC
…
R1
R2
Rn
81
Relationships to Data Lineage


Deleting a lineage branch Ri*of t
is always a translation for –t
t
pA
sC
Deleting any subset of t’s exclusive
lineage D** never causes side-effect
…
R1
R2
Rn
82
Relationships to Data Lineage

Deleting a lineage branch Ri*of t
is always a translation for –t

Deleting any subset of t’s exclusive
lineage D** never causes side-effect

If –t has an exact translation D, it
must also has an exact translation
within t’s lineage
t
pA
sC
…
R1
R2
Rn
83
Translating View Tuple Deletions
DELETE(t, V, D)
compute lineage D* and exclusive lineage D**;
IF D** is a translation THEN RETURN;
IF i s.t. Ri* causes no side-effect THEN RETURN;
FOR each subset D of D* DO
IF D is not a translation
THEN prune all subsets of D;
ELSE IF D causes a side-effect
THEN prune all supersets of D;
ELSE RETURN;
84
Detailed Computations

Is D a translation for –t?
if t  pA (sC ((R1*–R1)
then D is a translation

(Rn*–Rn)))
Does D cause side-effect?
E

i=1..n
pA (sC (R1 …Ri…
if E  pA (sC ((R1–R1)
then D is exact

…
…
Rn))) – {t}
(Rn–Rn)))
Further pruning by sizes
85
Download