Conceptual Modeling for ETL processes

advertisement
Conceptual Modeling
for ETL processes
Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos
{pvassil,asimi,spiros}@dblab.ece.ntua.gr
National Technical University of Athens
KDBS Laboratory
http://www.dbnet.ece.ntua.gr
General Idea

The problem:


The conceptual part of the definition of ETL
process in the early stages of a DW project
The key idea:

The mapping of the attributes of the data
sources to the attributes of the DW tables
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
2
Outline





Motivation
Conceptual Model
Instantiation and Specialization Layers
Methodology for the usage of the
conceptual model
Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
3
Extract-Transform-Load (ETL)
Extract
Sources
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
Transform
& Clean
DSA
Load
DW
4
Motivation

Practical necessity



e.g., 80% of the development time in a DW project
In-house development, ad-hoc solutions
Lack of related work

The front end of the DW has monopolized the research
on the conceptual part of DW modeling
Thus, the design, development and deployment
of ETL processes, needs modeling, design and
methodological foundations
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
5
Motivation

Early stages of the DW design :




Concepts are still fuzzy and changing
frequently
Lots of interviews with people
No time for a full, clean-cut definition of the
DW and the ETL workflow
Still, we can:


Trace the mapping of the attributes of the
data sources to the attributes of the DW
tables
PK
S1.A
Trace necessary constraints and
transformations for the ETL process
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
DW.A
6
Outline





Motivation
Conceptual Model
Instantiation and Specialization Layers
Methodology for the usage of the
conceptual model
Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
7
Conceptual Model

Entities of our model:









Concepts
Attributes
Part-of Relationships
Transformations
Serial Composition of Transformations
Provider Relationships
Notes
ETL Constraints
Candidate Relationships
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
8
Conceptual Model

Concepts



a name, finite set of attributes
represent an entity in the source
database or in the DW
concept
Attributes


same role as in ER/dimensional
models
a granular module of information
attribute
We do not employ standard UML notation for concepts and attributes, for the
reason that we need to treat attributes as first class citizens of our model
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
9
Conceptual Model

Part-of Relationships


finite set of attributes
emphasize the fact that
a concept is composed of
a set of attributes
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
part of
10
Conceptual Model

Example

Source 1


S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST}
Data Warehouse

DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST}
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
11
Conceptual Model
S1.PARTSUPP
DW.PARTSUPP
PKey
PKey
SuppKey
SuppKey
Date
Qty
Qty
Cost
Cost
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
12
Conceptual Model

Transformations



finite set of input/output
attributes, a symbol
abstractions that represent
parts, or full modules of
code, executing a single task
two categories:


transformation
filtering or data cleaning operations
(e.g., foreign key violations)
transformation operations
(e.g., aggregation)
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
13
Conceptual Model
Provider Relationships



finite set of input/output attributes, an
appropriate transformation
map a set of input attributes to a set of
output attributes through a relevant
transformation*
provider
1:1
*
provider
N:M
If the attributes are semantically and physically compatible, no
transformation is required
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
14
Conceptual Model
S1.PARTSUPP
PKey
DW.PARTSUPP
SK
SuppKey
SuppKey
f
Qty
Cost
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
PKey
Date
Qty
NN
Cost
15
Conceptual Model

Notes


informal tags, exactly as in
UML modeling
used for:



Note
simple comments explaining
design decisions
explanation of the semantics
of the applied transformation
tracing of runtime constraints
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
16
Conceptual Model
S1.PARTSUPP
DW.PARTSUPP
PKey
SK
SuppKey
PKey
SuppKey
f
Qty
Date
Qty
Cost
NN
Cost
Date = SysDate()
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
17
Conceptual Model

ETL Constraints


finite set of attributes, a
single transformation
express the fact that the
data of a certain concept
fulfill several requirements
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
ETL_constraint
18
Conceptual Model
S1.PARTSUPP
PK
PKey
SK
SuppKey
DW.PARTSUPP
PKey
SuppKey
f
Qty
Date
Qty
Cost
NN
Cost
Date = SysDate()
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
19
Conceptual Model

Candidate Relationships


a single candidate concept, a single target concept
used when a certain DW concept is populated by a
finite set of more than one candidate source concepts
Active Candidate Relationship


a certain candidate that has been selected for the
population of the target concept
a specialization of candidate relationships
active canditate
candidate1
...
candidaten
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
target
{XOR}
20
Conceptual Model
Due to acccuracy
and small size
(< update window)
Necessary providers:
S1 and S2
S1.PartSupp
Annual
PartSupp’s
U
DW.PartSupp
S2.PartSupp
Recent
PartSupp’s
{XOR}
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
21
Conceptual Model
Necessary providers:
S1 and S2
Due to acccuracy
and small size
(< update window)
{Duration<4h}
U
Annual
PartSupp’s
S2.PARTSUPP
Recent
PartSupp’s
DW.PARTSUPP
PK
S1.PARTSUPP
PKey
SK
PKey
{XOR}
PKey
SK
SuppKey
Qty
γ
Date
f
Department
y
Ke
.P
y
S2 uppKe
S
.
2
S
S2.Date
SUM
SU (S2.Q
ty)
M(
S2
.C
os
t)
SuppKey
Date
f
Qty
Qty
Cost
Cost
SuppKey
NN
Cost
f
$2€
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
American to
European Date
Date = SysDate()
22
Outline





Motivation
Conceptual Model
Instantiation and Specialization Layers
Methodology for the usage of the
conceptual model
Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
23
Instantiation & Specialization Layers

The key issues:

generecity


identification of a small set of generic constructs to
capture all cases
usability

construction of a ‘palette’ of frequently used types
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
24
Instantiation & Specialization Layers

Metamodel layer



Template layer


a set of generic entities, able to represent any ETL
scenario
involves classes: Concept, Attribute, Transformation, ETL
Constraint and Relationship
a set of ‘built-in’ specializations of the entities of the
Metamodel layer, specifically tailored for the most
frequent elements of ETL scenarios
Schema layer


a specific ETL scenario
all the entities of the Schema layer are instances of the
classes of the Metamodel layer
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
25
Instantiation & Specialization Layers
Concept
Attribute
Transformation
ETL_Constraint
Relationship
Metamodel
Layer
IsA
Part Of
Fact Table
ER
Relationship
Dimension
ER Entity
Template
Layer
American to
European Date
Surrogate Key
Assignment
$2€
Candidate
Serial
Composition
Aggregation
Provider
InstanceOf
Candidate
1
SK
f
S2.PartSupp
Candidate
2
DW.PartSupp
γ
f
Schema
Layer
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
26
Instantiation & Specialization Layers

Template layer

Four groups of logical transformations





Filters
Unary transformations
Binary transformations
Composite transformations
Two groups of physical transformations


Transfer operations
File operations
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
27
Instantiation & Specialization Layers
Filters
Composite transformations
Selection (σ)
Not null (NN)
Primary key violation (PK)
Foreign key violation (FK)
Unique value (UN)
Domain mismatch DM)
Slowly changing dimension (Type
1,2,3) (SDC-1/2/3)
Format mismatch (FM)
Data type conversion (DTC)
Switch (σ*)
Extended union (U)
Unary transformations
Push
Aggregation (γ)
Projection (π)
Function application (f)
Surrogate key assignment(SK)
Tuple normalization (N)
Tuple denormalization (DN)
File operations
EBCDIC to ASCII conversion (EB2AS)
Sort file (Sort)
Transfer operations
Ftp (FTP)
Compress/Decompress (Z/dZ)
Encrypt/Decrypt (Cr/dCr)
Binary transformations
Union (U)
Join ()
Diff (Δ)
Update Detection (ΔUPD)
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
28
Outline






Introduction
Motivation
Conceptual Model
Instantiation and Specialization Layers
Methodology for the usage of the
conceptual model
Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
29
Methodology

Step 1


Step 2


Candidates and active candidates for the
involved data stores
Step 3


Identification of the proper data stores
Attribute mapping between the providers and
the consumers
Step 4

Annotating the diagram with runtime
constraints
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
30
Outline






Introduction
Motivation
Conceptual Model
Instantiation and Specialization Layers
Methodology for the usage of the
conceptual model
Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
31
Conclusions

Our contributions lies in:



The proposal of a novel conceptual model
which is customized for the tracing of interattribute relationships and the respective ETL
activities
A customizable and extensible construction
The introduction of a 'palette' of a set of
frequently used ETL activities
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
32
On-going/Future Work
The Arktos II project is aimed towards the
 Conceptual modeling
 Logical modeling
 Optimization
 What-if analysis
of ETL scenarios
http://www.dblab.ece.ntua.gr/
~pvassil/projects/arktos_II
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
33
Thank you
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
34
Back-up slides
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
35
Logical Model [DMDW’02]
DS.PS_NEW1.PKEY,
DS.PS_OLD1.PKEY
SUPPKEY=1
DS.PS1.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
COST
DATE
DS.PS_NEW1
DIFF1
DS.PS1
Add_SPK1
SK1
rejected
DS.PS_OLD1
DS.PS_NEW2.PKEY,
DS.PS_OLD2.PKEY
SUPPKEY=2
A2EDate
$2€
rejected
U
rejected
Log
Log
Log
DS.PS2.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
COST
DATE=SYSDATE
QTY>0
DS.PS_NEW2
DIFF2
DS.PS2
Add_SPK2
NotNULL
SK2
rejected
DS.PS_OLD2
DSA
Log
AddDate
CheckQTY
rejected
Log
PKEY, DAY
MIN(COST)
S1_PARTSUPP
FTP1
Aggregate1
DW.PARTSUPP
DW.PARTSUPP.DATE,
DAY
S2_PARTSUPP
FTP2
Sources
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
TIME

V1
PKEY, MONTH
AVG(COST)
Aggregate2
V2
DW
36
Conceptual Model
concept
attribute
transformation
Note
ETL_constraint
provider
1:1
provider
N:M
serial
composition
active canditate
part of
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
candidate1
...
candidaten
target
{XOR}
37
The lifecycle of a Data Warehouse
and its ETL processes
Administration
of DW
Logical
Model for
DW, Sources
& Activities
Conceptual
Model for
DW, Sources
& Activities
Logical Design
Tuning –
Full Activity
Description
Reverse Engineering
of Sources &
Software
Requirements
Construction
Collection
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
Metrics
Physical
Model for
DW, Sources
& Activities
Software &
SW Metrics
38
Conceptual Model
«metaclass»
ETL_Constraint
1
+attributes
«metaclass»
PartOf
+transformation
1
1
1
1
«metaclass»
Transformation
+name
+symbol
1
«metaclass»
Serial Composition
1
+initiating
1
*
+consequent
1
«metaclass»
Provider
+input
1
*
+transformation
+output
*
*
«metaclass»
Attribute
+name
Tag
+content
1
1 +input
+output
*
*
1
«metaclass»
Concept
+name
*
1
+schema
1
«metaclass»
Relationship
«metaclass»
Candidate
-candidate
1
1
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
-target
«metaclass»
Active Candidate
39
Conceptual Model

General Notes




It is not a process/workflow model
It is orthogonal to the conceptual models
which are available for the modeling of DW
star schemata
It is specifically tailored for the back end of the
DW
Any of the proposals for the DW front end can
be combined with our approach
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
40
Conceptual Model

Serial Composition of
Transformations


a single initiating
transformation, a single
subsequent transformation
combine several
transformations in a single
provider relationship
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02
serial
composition
41
Download