Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis1, Panos Vassiliadis2, Manolis Terrovitis1, Spiros Skiadopoulos1 (1) National Technical University of Athens {asimi,mter,spiros}@dbnet.ece.ntua.gr (2) University of Ioannina pvassil@cs.uoi.gr Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work DaWaK'05, Copenhagen, August 2005 2 Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work DaWaK'05, Copenhagen, August 2005 3 Extract-Transform-Load (ETL) Extract Sources DaWaK'05, Copenhagen, August 2005 Transform & Clean DSA Load DW 4 Motivation DS.PS_NEW DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY SUPPKEY=1 DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE 1 DIFF1 DS.PS1 Add_SPK1 SK1 rejected DS.PS_OLD A2EDate $2€ rejected U rejected 1 DS.PS_NEW DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY SUPPKEY=2 Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE=SYSDATE QTY>0 2 DIFF2 DS.PS2 Add_SPK2 NotNULL SK2 rejected DS.PS_OLD AddDate CheckQTY rejected 2 Log Log DSA PKEY, DAY MIN(COST) S1_PARTSU PP FTP1 DW.PARTSU PP Aggregate1 DW.PARTSUPP.DATE, DAY S2_PARTSU PP FTP2 Sources DaWaK'05, Copenhagen, August 2005 TIME V1 PKEY, MONTH AVG(COST) Aggregate2 V2 DW 5 Background Traditional workflow modeling has treated workflows as graphs with control-flow semantics We take advantage of the data-centric, script-based nature of ETL activities to model their internals as graphs, too, as a graph, which we call Architecture Graph Our previous efforts [DMDW’02] handled simple cleanings and transformations and templates to simplify the definition of scenarios [CAiSE’03] DaWaK'05, Copenhagen, August 2005 6 Background [DMDW’02, CAiSE’03] POPULATED_FIELD IN.A1 IN NotNull_A1 OUT IN A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 A4 A4 A4 A4 R Legend: NotNull_A1 SK_A2 R DaWaK'05, Copenhagen, August 2005 … Black-box model: no semantics in the graph 7 Why is it important? What part of the scenario is affected if we delete an attribute? Which attributes/tables are involved in the population of an attribute? Straightforward to follow the data propagation chain How “good” is my design of the ETL scenario? Detection of inconsistencies Detection of important, vulnerable or useless attributes Well-defined quality measurement theory based on graphs DaWaK'05, Copenhagen, August 2005 8 Graph-based modeling of data centric workflows is important!!! IN PKEY AddSPK 1 OUT IN SK1 OUT Transitive: In Out Deg SKEY 8 0 8 Avg/ attribute 2.83 2.83 5.67 Avg/ entity 5.67 5.67 11.33 Vulnerable point in the scenario PKEY PKEY PKEY SUPPKEY SUPPKEY SUPPKEY PKEY SKEY SOURCE LOOKUP_ PS SK DaWaK'05, Copenhagen, August 2005 9 Contribution We incorporate update semantics in our graphbased modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We introduce a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction. Long version and ER’05: details for adding internal semantics DaWaK'05, Copenhagen, August 2005 10 Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work DaWaK'05, Copenhagen, August 2005 11 Updates and Transformations In this paper, we consider adding graph modeling techniques for several kinds of activities: Update (INS, UPD, DEL) activities Aggregates Rules employing negation and aliases Functions We use LDL++ as language that declaratively describes the semantics DaWaK'05, Copenhagen, August 2005 12 Updates to the database An update expression is of the form head <- query part, update part with the following semantics: 1. 2. a query to the database for the tuples that abide by the query part we update the predicate of the update part as specified in the rule. raise1(Name, Sal, NewSal) <employee(Name, Sal), Sal = 1100, NewSal = Sal * 1.1, - employee(Name, Sal), + employee(Name, NewSal). DaWaK'05, Copenhagen, August 2005 (a) (b) (c) (d) 13 Updates to the database raise1 head Functio n_* input Name Sal Name = Sal Emplo yee output Sal +/- - Name Sal + 1100 = NewSal 1.1 DaWaK'05, Copenhagen, August 2005 = NewSal Scale 14 Updates to the database A side-effect rule is treated as an activity, with the corresponding node. The output schema of the activity is derived from the structure of the predicate of the head of the rule. For every predicate with a + or – in the body of the rule, a respective provider edge from the output schema of the side-effect activity is assumed. For every predicate that appears in the rule without a + or – tag, we assume the respective input schema. Provider edges from this predicate towards these schemata are added as usual. The same applies for the attributes of the input and output schemata of the side effect activity. DaWaK'05, Copenhagen, August 2005 15 Aggregation Aggregation in LDL: 1. grouping of values to a bag and 2. application of an aggregate function over the values of the bag. R16: aggregate1.a_in(skey,suppkey,date,qty,cost)<dw.partsupp(skey,suppkey,date,qty,cost) R17: temp(skey,day,<cost>) <aggregate1.a_in(skey,suppkey,date,qty,cost). R18: aggregate1.a_out(skey,day,min_cost) <temp(skey,day,all_costs), aggr(min,all_costs,min_cost). R19: v1(skey,day,min_cost) <aggregate1.a_out(skey,day,min_cost). DaWaK'05, Copenhagen, August 2005 16 aggrega te1 Aggregation R18 R17 hea in skey hea d temp g suppkey aggr out skey skey day day g = day d all_ costs min_ cost <cost> min qty = <> min min_ cost cost DaWaK'05, Copenhagen, August 2005 17 Aggregation Relations which create a set from the values of a field employ a pair of regulator edges through an intermediate node ‘<>’. Provider relations for attributes used as groupers are tagged with ‘g’. One of the attributes of the aggr function node consumes data from a constant that indicates which aggregate function should be used (e.g., avg, min, max). DaWaK'05, Copenhagen, August 2005 18 Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work DaWaK'05, Copenhagen, August 2005 19 Zooming in and out Activity Level S1 S2 C A B D There is a principled way of zooming in and out, in various levels of abstraction: The attribute level The schema level The activity level S3 E S4 B. OUT1 B.IN S4 B.P B. OUT2 Schema Level a1 a1 a2 a2 a3 Attribute Level DaWaK'05, Copenhagen, August 2005 20 Zooming in and out For each node x of the architecture graph G(V,E) representing a schema: 1. for each provider edge (xa,y) or (y,xa), involving an attribute of x and an entity y, external to x, introduce the respective provider edge between x and y (unless it already exists, of course); 2. remove the provider edges (xa,y) and (y,xa) of the previous step; 3. remove the nodes of the attributes of x and the respective part-of edges. DaWaK'05, Copenhagen, August 2005 21 aggrega te1 Ga Zooming in and out R18 R17 aggrega te1 hea in R17 R18 Gs skey he he temp g ad temp aggr aggr d out skey skey day day ad suppkey in hea d out g = day all_ costs min_ cost <cost> min qty = <> min min_ cost cost DaWaK'05, Copenhagen, August 2005 22 aggrega te1 Ga Zooming in and out R18 R17 aggrega te1 hea in R17 R18 Gs skey he he temp g ad temp aggr d out skey skey day day ad suppkey in hea d aggr out g = day all_ costs min_ cost <cost> min Measure Definition Ga Gs Size Size(G) 7 25 Length Max provider path 3 2 Complexity 0.5*ext. edges + int. edges 8 36 F_IN+F_OUT 1 F (IN+OUT) DaWaK'05, Copenhagen, August 2005 Cohesion qty = <> min min_ cost cost 0.75 23 Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work DaWaK'05, Copenhagen, August 2005 24 Summary We have incorporated update semantics in our graphbased modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We have introduced a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction Long version and ER’05: internal semantics for activities and quality measures for the design of ETL activities http://www.cs.uoi.gr/~pvassil/publications/2005_ER_AG/ETL_blueprints_long.pdf DaWaK'05, Copenhagen, August 2005 25 Arktos II On-going/Future Work This work is part of the Arktos II project http://www.cs.uoi.gr/~pvassil/projects/arktos_II Future work includes research in What-if analysis of ETL scenarios Measures for the quality of the design of ETL scenarios DaWaK'05, Copenhagen, August 2005 26 Arktos II Thank you! http://www.cs.uoi.gr/~pvassil/projects/arktos_II DaWaK'05, Copenhagen, August 2005 27 Backup Slides DaWaK'05, Copenhagen, August 2005 28 Vision – big picture The major research goal is to be able to have a metadata repository that incorporates metadata on the static part of an information system, i.e., tables, constraints, query forms, etc on the dynamic part, i.e., data-centric software modules We invest on a graph-based modeling approach, based on the flexibility of graphs as modeling tools DaWaK'05, Copenhagen, August 2005 29 Preliminaries Data types Constants Attributes RecordSets Function types Functions DaWaK'05, Copenhagen, August 2005 Integer 1 PKEY R $2€ my$2€ 30 Relationships Instance-Of Relationships Part-Of Relationships Regulator Relationships Provider Relationships Derived Provider Relationships DaWaK'05, Copenhagen, August 2005 31 Activities Parameters Input 1 Activity Name Input 2 Input Schemata Rejected Rows Output Schema Rejections Schema Parameter List Output/Rejection Operational Semantics DaWaK'05, Copenhagen, August 2005 Output 32 Importance Metrics Dependency: the in-degree of the node with respect to the provider edges; Responsibility: the out-degree of the node with respect to the provider edges; Degree: dependency + responsibility Local vs. Transitive DaWaK'05, Copenhagen, August 2005 33 Functions Functions are treated as any other predicate in LDL, with the following special characteristics: The function involves a list of parameters, the last of which is the return value of the function. All function parameters referenced in the body of the rule either as homonyms with attributes, of other predicates or through equalities with such attributes, are linked through equality regulator relationships with these attributes. The return value is possibly connected to the output through a provider relationship (or with some other predicate of the body, through a regulator relationship). DaWaK'05, Copenhagen, August 2005 34 Aliases & Negation Alias relationships. An alias relationship is introduced whenever the same predicate appears in the same rule (e.g., in the case of a self-join). All the nodes representing these occurrences of the same predicate are connected through alias relationships to denote their semantic interrelationship. Note that due to the fact that intraactivity programs do not directly interact with external recordsets or activities, this practically involves the rare case of internal intermediate rules Negation. When a predicates appears negated in a rule body, then the respective part-of edge between the rule and the literal’s node is tagged with ‘⌐’. Note that negated predicates can appear only in the rule body. DaWaK'05, Copenhagen, August 2005 35 Activity semantics in LDL R06: a_in1(pkey,suppkey,date,qty,cost)<ps(pkey,suppkey,date,qty,cost). R07: a_in2(pkey,source,skey)<lookUp(l_pkey,source,l_skey), pkey=l_pkey, skey=l_skey,source=1. R08: a_out(pkey,suppkey,date,qty,cost,skey)<a_in1(pkey,date,qty,cost), a_in2(pkey,source,l_skey). R09: D2E.a_in(skey,suppkey,date,qty,cost)<sk.a_out(pkey,suppkey,date,qty,cost,skey) DaWaK'05, Copenhagen, August 2005 36 SK Activities Program R06 hea R08 d R09 head he ad DSA_PS a_in1 a_out d2e. a_in pkey pkey pkey skey suppkey suppkey suppkey suppkey date date date date qty qty cost qty cost cost cost cost = R07 head skey lookup a_in2 1 l_key pkey = DaWaK'05, Copenhagen, August 2005 source source l_skey skey 37 Zooming in and out A A P A.IN1 A.IN1 P1 A.OUT A.OUT P A.IN2 P2 DaWaK'05, Copenhagen, August 2005 A.REJ A.IN2 A.REJ 38