Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group Motivation: Data Warehousing Wow?! Data Warehouse Lucrative Fields Theory $320K Databases $8800K Networks $800K Courses Source 1 Enrollments Students Source 2 Source 3 2 Data Warehouse Lucrative Fields Oh, I see... Theory $320K Databases Database $8800K 1800 Networks $800K Lineage Tracer Courses CS154 CS145 CS244 CS245 Theory Databases Networks Databases Enrollments CS154 CS145 CS244 CS145 CS245 … Source 1 Joe Ted Bob Ann Jane … Source 2 Students Ann Bob Jane Joe Ted … BS $1K MS $1K Web $5K BS $1K Web $5K … … Source 3 3 The Data Lineage Problem Data warehouses integrate data from multiple sources for analysis and mining Data lineage: given data item o in the warehouse, which data items in the sources were used to derive o? Sometimes called “drill-through” in industry 4 Challenges Warehouse of relational views over relational sources – What is a good formal definition for lineage? – How do we trace data lineage for arbitrary views? – How do we make it efficient? Warehouse defined by graph of data transformations – No fixed, well-defined relational operators – Large transformation sequences and graphs 5 Contributions Thesis contributions – Basics of lineage tracing for relational views [TODS’00] – Lineage tracing system prototype [ICDE’00 demo] – Performance study and optimizations [ICDE’00, DMDW’00] – Lineage tracing for general data transformations [VLDB’01] – View update for deletions using data lineage [TechReport’01] Other contributions (joint with others) – Data warehousing performance issue [VLDB’00] – Data management for wireless networks [Infocom’98, Globecom’97] 6 Outline of Talk Part 1: Lineage tracing for relational views Part 2: Lineage tracing for general data transformations Part 3: View update for deletions using data lineage (time permitting) 7 Part 1: Lineage Tracing for Relational Views Declarative definition of data lineage Lineage tracing algorithms Using auxiliary views for efficient lineage tracing Experimental results (small sample) 8 Views We Consider Relational algebra s, p, V Arbitrary use of aggregation a a Set semantics Also in thesis – Set operators , , – Bag semantics p a s R S T 9 Simple Lineage Example V = aY,sum(Z) (sX >Z(R R X Y 3 a 8 b S Y a b b b Z 2 0 9 6 T X 3 8 8 8 Y a b b b Z 2 0 9 6 U X sX >Z 3 8 8 Y a b b Z 2 0 6 S)) V aY,sum(Z) Y sum a 2 b 6 10 Lineage for Relational Operators Unary relational operators (s, p, a) R R* op t Lineage of t according to op is the maximal subset R* R such that (1) op(R*) = {t} (2) t* R*: op({t*}) 11 Lineage for Relational Operators Example 1 R X Y Z X Y Z 3 a 2 3 a 2 sX >Z 8 b 0 8 b 0 8 b 6 8 b 9 8 b 6 Lineage of t according to op is the maximal subset R* R such that (1) op(R*) = {t} (2) t* R*: op({t*}) 12 Lineage for Relational Operators Example 2 R X 3 8 8 Y a b b Z 2 0 6 aY,sum(Z) Y sum a 2 b 6 Lineage of t according to op is the maximal subset R* R such that (1) op(R*) = {t} (2) t* R*: op({t*}) 13 Lineage for Relational Operators N-ary relational operators (e.g., ) R1 R1* op R2* R2 Lineage of t according to op is the maximal subsets Ri* Ri for i = 1..n such that (1) op(R1*, …, Rn*) = {t} (2) ti* Ri*: op(R1, …, {ti*}, …, Rn) 14 Lineage for Relational Views Lineage of a tuple set is union of lineage of each tuple in the set Lineage for views is defined recursively R1 R1* U V op2 op1 R2* t U* R2 Lineage of t is R1*, R2* 15 Lineage Tracing Convert view into a segmented normal form Each segment a(p(s(E1 Generate one tracing query for each segment Apply tracing queries recursively – … En))) # non-top a + 1 Lineage result is unaffected by normalization and segment-level tracing 16 Tracing Query for One Segment R X Y 3 a 8 b S Y a b b b Z 2 0 9 6 V = aY,sum(Z) (sX >Z(R s X >Z a S)) V Y,sum(Z) TQ = Split R,S (s X >Z Y=b(R Y sum a b 2 6 S)) R*={(8,b)}, S*={(b,0),(b,6)} 17 Recursive Tracing Procedure R X Y 3 a 8 b S Y a b b b Z 2 0 9 6 V = aW, avg(sum)(a Y,sum(Z)(sX >Z (R s U Y sum a a 2 b 6 T Y a b b W p p q S)) T)) V W avg a p 4 q 6 TQ =S)) Split (s W=q(U TQ = Split ( s ( R 1 U,T R*={(8,b)}, S*={(b,0),(b,6)}, T*={(b,q)} 2 R,S X >Z Y=b T)) 18 Making It Efficient Source accesses are usually expensive or impossible Need some intermediate results for lineage tracing Store auxiliary views at the warehouse – Reduce or eliminate source accesses – Reduce recomputation of intermediate results 19 Auxiliary Views There are many possible auxiliary views For single-segment views a(p(s(R1 … Rn))) – Identified 10 possible auxiliary view schemes – Studied performance tradeoffs For arbitrary views – Hard optimization problem – Exhaustive and heuristic algorithms – Performance study 20 Auxiliary Views: Performance Tradeoffs + Always improve lineage tracing – Must be maintained when sources change + Can also help with maintenance of original user views 21 Auxiliary View Schemes for Single-Segment Views Parameters: - 3-way SPJ view - sources: 10MB each - disk: 1Mbps - network: 50kbps - 1000 operations - q/u ratio = 4 Measurements: - tracing time - maintenance time 22 Auxiliary View Selection Algorithms for Arbitrary Views 23 Part 2: Transformation Graphs Lineage definition Tracing algorithms Data Warehouse T6 Combining transformations for lineage tracing Experimental results (tiny sample) Source 1 T4 T5 T2 T1 T3 Source 2 Source 3 24 Transformation Example id 1 2 3 4 5 6 cust date A 2/8/99 C 4/5/99 D 6/1/99 B 8/6/99 D 10/8/99 10/8/99 B 12/1/99 12/1/99 Order Product id 1 2 2 3 3 3 prod-list 1(10),2(10) 2(5),3(10) 1(20),2(10) 1(10),3(5) 1(5),3(10) 2(10),3(10) T1 T2 name price imac 1200 vaio 2400 vaio 1800 palm 500 palm 400 palm palm 300 palm split “join” pivot projection selection projection T3 T4 selection valid 10/1/986/1/98-9/1/99 9/2/992/1/98-7/1/98 7/2/98-9/1/99 9/2/99- T5 T6 T7 SalesJump name palm palm avg3 2K 2K Q4 6K 6K 25 Lineage for General Transformations A transformation can be an arbitrary program ? T select … from … where … main(int argc, char** argv) {…} sed “s/string1/string2/g” … – One extreme: relational operators – Another extreme: we know nothing about T – Middle ground: based on transformation properties 26 Transformation Properties Transformation classes Additional properties – Transformation subclasses – Schema information – Provided inverse or tracing procedure 27 Transformation Classes dispatcher I: T(I) = T({i}) iI T*(o) = {i | oT({i})} 28 Dispatcher Example O1 Order id 1 2 3 4 5 6 cust A C D B D B date 2/8/99 4/5/99 6/1/99 8/6/99 10/8/99 12/1/99 prod-list 1(10),2(10) 2(5),3(10) 1(20),2(10) 1(10),3(5) 1(5),3(10) 2(10),3(10) T1 id cust date 1 A 2/8/99 1 A 2/8/99 pid 1 2 quant 10 10 5 5 6 6 1 3 2 3 5 10 10 10 : D D B B : 10/8/99 10/8/99 12/1/99 12/1/99 : 29 Transformation Classes dispatcher aggregator I: T(I) = T({i}) I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok} T*(o) = {i | oT({i})} T*(ok) = Ik iI 30 Aggregator Example O3 oid name 1 imac 1 vaio 2 vaio 2 palm 3 imac 3 vaio 4 imac 4 palm 5 imac 5 palm 6 vaio 6 palm date price quant 2/8/99 1200 10 2/8/99 2400 10 4/5/99 2400 5 4/5/99 400 10 6/1/99 1200 20 6/1/99 2400 10 8/6/99 1200 10 8/6/99 400 5 10/8/99 1200 5 10/8/99 300 10 12/1/99 1800 10 12/1/99 300 10 O4 T4 name Q1 Q2 imac 12K 24K vaio 24K 12K palm 0K 4K Q3 Q4 12K 6K 24K 18K 2K 6K 31 Transformation Classes dispatcher aggregator black-box I: T(I) = T({i}) I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok} All others T*(o) = {i | oT({i})} T*(ok) = Ik iI T*(o) = I 32 Transformation Classes Most transformations are dispatchers, aggregators, or their compositions A transformation can be both dispatcher and aggregator – Lineage definitions are equivalent Transformations can be relational operators – Lineage definitions same as relational definitions 33 Transformation Properties Transformation classes Additional properties – Transformation subclasses – Schema information – Provided inverse or tracing procedure 34 Transformation Subclasses Permit more efficient lineage tracing Filter is a special dispatcher – Each input data item produces itself or nothing Context-free aggregator – Whether two input data items are in the same partition is independent of other items Key-preserving aggregator – Any subset of an input partition always produces the same output key 35 Tracing Example: Aggregators Consider T(I) = {o1…on} Tracing the lineage of o for aggregator – Partition input I into I1…In such that T(Ik) = {ok} – Return Ik such that T(Ik) = {o} Tracing the lineage of o for context-free aggregator – Partition input I into I1…In such that |T(Ik)| = 1 – Return Ik such that T(Ik) = {o} 36 Schema Information Input schema A=(A1…An) and key Akey Output schema B=(B1…Bn) and key Bkey Schema mappings: f(A) B and A g(B) Transformations with special schema mappings – Forward key-map: f(A) Bkey – Backward key-map: Akey g(B) – Backward total-map: A g(B) 37 Tracing Example: Forward Key-Maps O3 oid name 1 imac 1 vaio 2 vaio 2 palm 3 imac 3 vaio 4 imac 4 palm 5 imac 5 palm 6 vaio 6 palm date price quant 2/8/99 1200 10 2/8/99 2400 10 4/5/99 2400 5 4/5/99 400 10 6/1/99 1200 20 6/1/99 2400 10 8/6/99 1200 10 8/6/99 400 5 10/8/99 1200 5 10/8/99 300 10 12/1/99 1800 10 12/1/99 300 10 O4 T4 name Q1 Q2 imac 12K 24K vaio 24K 12K palm 0K 4K Q3 Q4 12K 6K 24K 18K 2K 6K 38 Other Properties Provided Tracing Procedure Provided Transformation Inverse T –1 – If T is an aggregator, then o’s lineage is T –1({o}) – Not always true for dispatchers or black-boxes 39 Tracing Procedures Property Procedure # T Calls # Accesses dispatcher TraceDS O(|I|) O(|I|) aggregator TraceAG O(2|I|) O(2|I|) black-box return I; 0 O(|I|) filter return o; 0 0 context-free aggr. TraceCF O(|I|2) O(|I|2) key-preserving aggr. TraceKP O(|I|) O(|I|) forward key-map TraceFM 0 O(|I|) backward key-map TraceBM 0 O(|I|) backward total-map TraceTM 0 0 Provided tracing-proc. provided ? ? 40 Property Hierarchy ANY black-box aggregator context-free aggr. dispatcher key-preserving aggr. forward key-map backward key-map total-map filter provided tracing-proc. or inverse 41 Summary of Our Approach for One Transformation Properties are provided with transformations – Specified by the transformation author – Declared in prepackaged transformations – Derived using recent techniques [Clio01, RB01] The best property of a transformation is selected based on the hierarchy The tracing procedure using the best property is called at tracing time Indexing techniques 42 Transformation Sequences I T1 T2 T3 Tn O Naive algorithm traces backwards one transformation at a time – Need all intermediate results – Poor performance for long sequences 43 Transformation Sequences I I T1 T2 T3 T’ Tn O Tn O Combine transformations and trace as one – Reduces number of intermediate results – By combining judiciously Reduces tracing cost Doesn’t lose accuracy 44 Overall Approach Algorithm for deriving properties of T = T1 • T2 from properties of T1 and T2 Coarse-grained cost metric for a tracing sequence based on transformation properties Greedy algorithm 45 Example of Greedy Algorithm T4 T5 T6 T7 fkmap(2) btmap(1) filter(1) bkmap(2) fkmap(2) btmap(1)T6 bkmap(2) T4’ T7 fkmap(2) filter(1) bkmap(2) T4’ blkbox(5) fkmap(2) bkmap(2) T6’ bkmap(2) blkbox(5) 46 Multiple-Input Example O1 id cust date pid 1 A 2/8/99 1 1 A 2/8/99 2 : 5 5 6 6 D D B B : 10/8/99 10/8/99 12/1/99 12/1/99 1 3 2 3 quant 10 10 : name price imac 1200 vaio 2400 vaio 1800 palm 400 palm 300 O3 5 10 10 10 T3 O2 id 1 2 2 3 3 dispatcher valid 10/1/986/1/98-9/1/99 9/2/997/2/98-9/1/99 9/2/99- oid name 1 imac 1 vaio : 5 imac 5 palm 6 vaio 6 palm date price quant 2/8/99 1200 10 2/8/99 2400 10 : : 10/8/99 1200 5 10/8/99 300 10 12/1/99 1800 10 12/1/99 300 10 dispatcher 47 Transformation Graphs I1 I2 O Definition time – Specify properties of each transformation in graph 48 Transformation Graphs I1 I2 O Definition time – – – Specify properties of each transformation in graph Consider each path as a transformation sequence Combine transformations in each sequence 49 Transformation Graphs I1 I2 Definition time Load time O – Save intermediate results and build indices as desired Tracing time – – Trace lineage through each sequence Combine results 50 Example Revisited Order bkmap T1 dispatcher fkmap btmap T3 Product T4 T5 filter bkmap T6 T7 SalesJump T2 dispatcher filter Order bkmap T1 T3 Product T2 bkmap fkmap T4 T5 T6 T7 SalesJump dispatcher 51 Experimental Results Transformation graph based on a complex TPC-D query (Q12) 52 Part 3: View Update Using Data Lineage View update: translating updates on views to updates on base tables Obvious connection to lineage in case of view deletions Fresh approach with improved results 53 View Update Translations: Valid and Exact t V …… R1 R2 Rn 54 View Update Translations: Valid and Exact t V …… R1 R2 Rn 55 View Update Translations: Valid and Exact t V …… R1 R2 Rn 56 Our Algorithm Uses lineage to: – Find an exact translation whenever one exists (in linear time for many cases) – Find a “good” translation when no exact translation exists Fully automatic Previous approaches – Don’t always find an exact translation – Often require user input – Consider restricted classes of views 57 Related Work Schema-level lineage tracing (annotation-based) [BB99, HQGW93, RS98] Drill-down or drill-through on data cubes [Gray95] “Weak inverse” for transformations [WS97] Warehouse load resumption [LGMW00] Data cleaning [GFSS+01] View update [DB82, Mas84, Kel85] 58 Conclusions Data lineage problem in two scenarios – – For both scenarios, we provide: – – – – Warehouse defined by relational views Warehouse defined by general data transformations Formal lineage definition Lineage tracing algorithms Optimization techniques System prototype and performance study Use lineage for the view update problem 59 Some Open Problems Lineage of “missing” view or base tuples Deriving transformation properties Combining with annotation-based approach View update – Translation ambiguity – Base table constraints – Multiple interacting views 60 61 Lineage Applications On-line analytical processing (OLAP) Scientific databases Sensory and monitoring systems Data cleaning Warehouse resumption Data security View update 62 Lineage Tracing Convert view definition into a segmented normal form V V p s p s a p R S p a a p s a s p T W R S T W Generate one tracing query for each ASPJ segment Apply tracing queries top-down through view definition Lineage result is unaffected by normalization 63 Tracing Example R K1 1 2 3 S K2 1 2 3 4 5 X a b c X b a b d b Y p q r Z 2 4 31 8 9 V = aX,avg(Z)(sK1<K2 (R s a S)) V X avg a b TQ = Split R,S (sK1<K2 X=b(R 4 6 S)) 64 Split Lineage Tables (SLT) R K1 1 2 3 S K2 1 2 3 4 5 X a b c X b a b d b Y p q r Z 2 4 3 8 9 s R' K1 X Y 1 a p 2 b q Split S' K2 2 3 5 X a b b a V X avg a b 4 6 Z 4 31 9 65 Base Table Projections (BP) R K1 1 2 3 X a b c Y p q r R’ K1 1 p 2 3 X a b c s S K2 1 2 3 4 5 X b a b d b Z 2 4 31 8 98 S’ K2 p 1 2 3 4 5 X b a b d b a V X avg a b 4 6 66 Context-Free Aggregator Example O3 oid name 1 imac 1 vaio 2 vaio 2 palm 3 imac 3 vaio 4 imac 4 palm 5 imac 5 palm 6 vaio 6 palm date price quant 2/8/99 1200 10 2/8/99 2400 10 4/5/99 2400 5 4/5/99 400 10 6/1/99 1200 20 6/1/99 2400 10 8/6/99 1200 10 8/6/99 400 5 10/8/99 1200 5 10/8/99 300 10 12/1/99 1800 10 12/1/99 300 10 O4 T4 name Q1 Q2 imac 12K 24K vaio 24K 12K palm 0K 4K Q3 Q4 12K 6K 24K 18K 2K 6K 67 Tracing Example 1 Tracing procedure for context-free aggregators – Partition input I into I1…In such that |T(Ik)| = 1; – Return Ik s.t. T(Ik) = {o}; 68 Lineage Equivalence Lineage of equivalent SPJ views are equivalent Not for ASPJ views R X 3 8 8 Y a b b Z 2 0 6 U Y sum aY,sum(Z) a 2 b 6 69 Lineage Equivalence Lineage of equivalent SPJ views are equivalent Not for ASPJ views R X 3 8 8 Y a b b Z 2 0 6 aB=0 U Y sum aY,sum(Z) a 2 b 6 70 Non-Context-Free Example 71 Non-Context-Free Example 72 Indices Help! Conventional index – On input key Akey for a backward key-map with Akeyg(B) Functional index – On f(A) for a forward key-map with f(A)Bkey – On T(A) for a dispatcher Lineage index – Mapping the key of each output data item o to the keys of input data items in o’s lineage 73 Experimental Results Tracing through an “SP” transformation over TPC-D table PartSupp 74 Tracing Through Sequences Tracing cost estimation – Divide properties into 5 groups – T’s cost level depends on the group of its best property – Associate a sequence with N[1..5] where N[k] records the number of transformations with cost level k Greedy algorithm – Pick a combination that results in the lowest N 75 Lineage Annotation (Appendix) T1 T2 {1} {1,2} {1,2} {2,4} {4} {4} 1 2 3 4 T1* {1,2} {1,2,4} {4} T2* 76 Multiple Inputs and Outputs I1 I2 Im .. . T O1 O2 On Define properties for each input and output Trace lineage for each input/output pair using singleinput single-output tracing procedures 77 View Update UV V V’ D D’ UD? Deletions on SPJ view deletions on base database View tuple deletion request –t and base tuple deletion D D is a translation for –t if {t} V = V(D) – V(D – D) Side-effect E = V – {t}; D is exact if E = 78 Relationships to Data Lineage For an SPJ view: ti Ri belongs to t’s lineage Ri* iff {t} pA(s C (R1 ti t pA sC …{ti}… Rn)) belongs to t’s exclusive lineage Ri** iff {t} = pA (sC (R1 …{ti}… Rn)) … Intuition: ti contributes only to t R1 R2 Rn 79 The Problem View update V’ V D t ? pA sC D’ … View update for deletions R1 R2 Rn 80 Relationships to Data Lineage Deleting a lineage branch Ri*of t is always a translation for –t t pA sC … R1 R2 Rn 81 Relationships to Data Lineage Deleting a lineage branch Ri*of t is always a translation for –t t pA sC Deleting any subset of t’s exclusive lineage D** never causes side-effect … R1 R2 Rn 82 Relationships to Data Lineage Deleting a lineage branch Ri*of t is always a translation for –t Deleting any subset of t’s exclusive lineage D** never causes side-effect If –t has an exact translation D, it must also has an exact translation within t’s lineage t pA sC … R1 R2 Rn 83 Translating View Tuple Deletions DELETE(t, V, D) compute lineage D* and exclusive lineage D**; IF D** is a translation THEN RETURN; IF i s.t. Ri* causes no side-effect THEN RETURN; FOR each subset D of D* DO IF D is not a translation THEN prune all subsets of D; ELSE IF D causes a side-effect THEN prune all supersets of D; ELSE RETURN; 84 Detailed Computations Is D a translation for –t? if t pA (sC ((R1*–R1) then D is a translation (Rn*–Rn))) Does D cause side-effect? E i=1..n pA (sC (R1 …Ri… if E pA (sC ((R1–R1) then D is exact … … Rn))) – {t} (Rn–Rn))) Further pruning by sizes 85