Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos {pvassil,asimi,spiros}@dblab.ece.ntua.gr National Technical University of Athens KDBS Laboratory http://www.dbnet.ece.ntua.gr General Idea The problem: The conceptual part of the definition of ETL process in the early stages of a DW project The key idea: The mapping of the attributes of the data sources to the attributes of the DW tables Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 2 Outline Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 3 Extract-Transform-Load (ETL) Extract Sources Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 Transform & Clean DSA Load DW 4 Motivation Practical necessity e.g., 80% of the development time in a DW project In-house development, ad-hoc solutions Lack of related work The front end of the DW has monopolized the research on the conceptual part of DW modeling Thus, the design, development and deployment of ETL processes, needs modeling, design and methodological foundations Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 5 Motivation Early stages of the DW design : Concepts are still fuzzy and changing frequently Lots of interviews with people No time for a full, clean-cut definition of the DW and the ETL workflow Still, we can: Trace the mapping of the attributes of the data sources to the attributes of the DW tables PK S1.A Trace necessary constraints and transformations for the ETL process Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 DW.A 6 Outline Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 7 Conceptual Model Entities of our model: Concepts Attributes Part-of Relationships Transformations Serial Composition of Transformations Provider Relationships Notes ETL Constraints Candidate Relationships Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 8 Conceptual Model Concepts a name, finite set of attributes represent an entity in the source database or in the DW concept Attributes same role as in ER/dimensional models a granular module of information attribute We do not employ standard UML notation for concepts and attributes, for the reason that we need to treat attributes as first class citizens of our model Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 9 Conceptual Model Part-of Relationships finite set of attributes emphasize the fact that a concept is composed of a set of attributes Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 part of 10 Conceptual Model Example Source 1 S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST} Data Warehouse DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST} Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 11 Conceptual Model S1.PARTSUPP DW.PARTSUPP PKey PKey SuppKey SuppKey Date Qty Qty Cost Cost Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 12 Conceptual Model Transformations finite set of input/output attributes, a symbol abstractions that represent parts, or full modules of code, executing a single task two categories: transformation filtering or data cleaning operations (e.g., foreign key violations) transformation operations (e.g., aggregation) Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 13 Conceptual Model Provider Relationships finite set of input/output attributes, an appropriate transformation map a set of input attributes to a set of output attributes through a relevant transformation* provider 1:1 * provider N:M If the attributes are semantically and physically compatible, no transformation is required Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 14 Conceptual Model S1.PARTSUPP PKey DW.PARTSUPP SK SuppKey SuppKey f Qty Cost Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 PKey Date Qty NN Cost 15 Conceptual Model Notes informal tags, exactly as in UML modeling used for: Note simple comments explaining design decisions explanation of the semantics of the applied transformation tracing of runtime constraints Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 16 Conceptual Model S1.PARTSUPP DW.PARTSUPP PKey SK SuppKey PKey SuppKey f Qty Date Qty Cost NN Cost Date = SysDate() Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 17 Conceptual Model ETL Constraints finite set of attributes, a single transformation express the fact that the data of a certain concept fulfill several requirements Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 ETL_constraint 18 Conceptual Model S1.PARTSUPP PK PKey SK SuppKey DW.PARTSUPP PKey SuppKey f Qty Date Qty Cost NN Cost Date = SysDate() Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 19 Conceptual Model Candidate Relationships a single candidate concept, a single target concept used when a certain DW concept is populated by a finite set of more than one candidate source concepts Active Candidate Relationship a certain candidate that has been selected for the population of the target concept a specialization of candidate relationships active canditate candidate1 ... candidaten Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 target {XOR} 20 Conceptual Model Due to acccuracy and small size (< update window) Necessary providers: S1 and S2 S1.PartSupp Annual PartSupp’s U DW.PartSupp S2.PartSupp Recent PartSupp’s {XOR} Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 21 Conceptual Model Necessary providers: S1 and S2 Due to acccuracy and small size (< update window) {Duration<4h} U Annual PartSupp’s S2.PARTSUPP Recent PartSupp’s DW.PARTSUPP PK S1.PARTSUPP PKey SK PKey {XOR} PKey SK SuppKey Qty γ Date f Department y Ke .P y S2 uppKe S . 2 S S2.Date SUM SU (S2.Q ty) M( S2 .C os t) SuppKey Date f Qty Qty Cost Cost SuppKey NN Cost f $2€ Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 American to European Date Date = SysDate() 22 Outline Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 23 Instantiation & Specialization Layers The key issues: generecity identification of a small set of generic constructs to capture all cases usability construction of a ‘palette’ of frequently used types Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 24 Instantiation & Specialization Layers Metamodel layer Template layer a set of generic entities, able to represent any ETL scenario involves classes: Concept, Attribute, Transformation, ETL Constraint and Relationship a set of ‘built-in’ specializations of the entities of the Metamodel layer, specifically tailored for the most frequent elements of ETL scenarios Schema layer a specific ETL scenario all the entities of the Schema layer are instances of the classes of the Metamodel layer Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 25 Instantiation & Specialization Layers Concept Attribute Transformation ETL_Constraint Relationship Metamodel Layer IsA Part Of Fact Table ER Relationship Dimension ER Entity Template Layer American to European Date Surrogate Key Assignment $2€ Candidate Serial Composition Aggregation Provider InstanceOf Candidate 1 SK f S2.PartSupp Candidate 2 DW.PartSupp γ f Schema Layer Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 26 Instantiation & Specialization Layers Template layer Four groups of logical transformations Filters Unary transformations Binary transformations Composite transformations Two groups of physical transformations Transfer operations File operations Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 27 Instantiation & Specialization Layers Filters Composite transformations Selection (σ) Not null (NN) Primary key violation (PK) Foreign key violation (FK) Unique value (UN) Domain mismatch DM) Slowly changing dimension (Type 1,2,3) (SDC-1/2/3) Format mismatch (FM) Data type conversion (DTC) Switch (σ*) Extended union (U) Unary transformations Push Aggregation (γ) Projection (π) Function application (f) Surrogate key assignment(SK) Tuple normalization (N) Tuple denormalization (DN) File operations EBCDIC to ASCII conversion (EB2AS) Sort file (Sort) Transfer operations Ftp (FTP) Compress/Decompress (Z/dZ) Encrypt/Decrypt (Cr/dCr) Binary transformations Union (U) Join () Diff (Δ) Update Detection (ΔUPD) Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 28 Outline Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 29 Methodology Step 1 Step 2 Candidates and active candidates for the involved data stores Step 3 Identification of the proper data stores Attribute mapping between the providers and the consumers Step 4 Annotating the diagram with runtime constraints Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 30 Outline Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 31 Conclusions Our contributions lies in: The proposal of a novel conceptual model which is customized for the tracing of interattribute relationships and the respective ETL activities A customizable and extensible construction The introduction of a 'palette' of a set of frequently used ETL activities Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 32 On-going/Future Work The Arktos II project is aimed towards the Conceptual modeling Logical modeling Optimization What-if analysis of ETL scenarios http://www.dblab.ece.ntua.gr/ ~pvassil/projects/arktos_II Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 33 Thank you Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 34 Back-up slides Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 35 Logical Model [DMDW’02] DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY SUPPKEY=1 DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE DS.PS_NEW1 DIFF1 DS.PS1 Add_SPK1 SK1 rejected DS.PS_OLD1 DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY SUPPKEY=2 A2EDate $2€ rejected U rejected Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE=SYSDATE QTY>0 DS.PS_NEW2 DIFF2 DS.PS2 Add_SPK2 NotNULL SK2 rejected DS.PS_OLD2 DSA Log AddDate CheckQTY rejected Log PKEY, DAY MIN(COST) S1_PARTSUPP FTP1 Aggregate1 DW.PARTSUPP DW.PARTSUPP.DATE, DAY S2_PARTSUPP FTP2 Sources Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 TIME V1 PKEY, MONTH AVG(COST) Aggregate2 V2 DW 36 Conceptual Model concept attribute transformation Note ETL_constraint provider 1:1 provider N:M serial composition active canditate part of Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 candidate1 ... candidaten target {XOR} 37 The lifecycle of a Data Warehouse and its ETL processes Administration of DW Logical Model for DW, Sources & Activities Conceptual Model for DW, Sources & Activities Logical Design Tuning – Full Activity Description Reverse Engineering of Sources & Software Requirements Construction Collection Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 Metrics Physical Model for DW, Sources & Activities Software & SW Metrics 38 Conceptual Model «metaclass» ETL_Constraint 1 +attributes «metaclass» PartOf +transformation 1 1 1 1 «metaclass» Transformation +name +symbol 1 «metaclass» Serial Composition 1 +initiating 1 * +consequent 1 «metaclass» Provider +input 1 * +transformation +output * * «metaclass» Attribute +name Tag +content 1 1 +input +output * * 1 «metaclass» Concept +name * 1 +schema 1 «metaclass» Relationship «metaclass» Candidate -candidate 1 1 Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 -target «metaclass» Active Candidate 39 Conceptual Model General Notes It is not a process/workflow model It is orthogonal to the conceptual models which are available for the modeling of DW star schemata It is specifically tailored for the back end of the DW Any of the proposals for the DW front end can be combined with our approach Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 40 Conceptual Model Serial Composition of Transformations a single initiating transformation, a single subsequent transformation combine several transformations in a single provider relationship Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 serial composition 41