Graph-Based Modeling of ETL Activities with Multi

advertisement
Graph-Based Modeling of ETL
Activities with Multi-Level
Transformations and Updates
Alkis Simitsis1, Panos Vassiliadis2, Manolis Terrovitis1, Spiros
Skiadopoulos1
(1) National Technical University of Athens
{asimi,mter,spiros}@dbnet.ece.ntua.gr
(2) University of Ioannina
pvassil@cs.uoi.gr
Outline
Background & Motivation
Database updates and composite
transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005
2
Outline
Background & Motivation
Database updates and composite
transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005
3
Extract-Transform-Load (ETL)
Extract
Sources
DaWaK'05, Copenhagen, August 2005
Transform
& Clean
DSA
Load
DW
4
Motivation
DS.PS_NEW
DS.PS_NEW1.PKEY,
DS.PS_OLD1.PKEY
SUPPKEY=1
DS.PS1.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
COST
DATE
1
DIFF1
DS.PS1
Add_SPK1
SK1
rejected
DS.PS_OLD
A2EDate
$2€
rejected
U
rejected
1
DS.PS_NEW
DS.PS_NEW2.PKEY,
DS.PS_OLD2.PKEY
SUPPKEY=2
Log
Log
Log
DS.PS2.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
COST
DATE=SYSDATE
QTY>0
2
DIFF2
DS.PS2
Add_SPK2
NotNULL
SK2
rejected
DS.PS_OLD
AddDate
CheckQTY
rejected
2
Log
Log
DSA
PKEY, DAY
MIN(COST)
S1_PARTSU
PP
FTP1
DW.PARTSU
PP
Aggregate1
DW.PARTSUPP.DATE,
DAY
S2_PARTSU
PP
FTP2
Sources
DaWaK'05, Copenhagen, August 2005
TIME

V1
PKEY, MONTH
AVG(COST)
Aggregate2
V2
DW
5
Background
Traditional workflow modeling has treated
workflows as graphs with control-flow semantics
We take advantage of the data-centric, script-based
nature of ETL activities to model their internals as
graphs, too, as a graph, which we call Architecture
Graph
Our previous efforts [DMDW’02] handled simple
cleanings and transformations and templates to
simplify the definition of scenarios [CAiSE’03]
DaWaK'05, Copenhagen, August 2005
6
Background [DMDW’02, CAiSE’03]
POPULATED_FIELD
IN.A1
IN
NotNull_A1
OUT
IN
A1
A1
A1
A1
A2
A2
A2
A2
A3
A3
A3
A3
A4
A4
A4
A4
R
Legend:
NotNull_A1
SK_A2
R
DaWaK'05, Copenhagen, August 2005
…
Black-box model:
no semantics in the
graph
7
Why is it important?
What part of the scenario is affected if we delete
an attribute? Which attributes/tables are involved
in the population of an attribute?
Straightforward to follow the data propagation chain
How “good” is my design of the ETL scenario?
Detection of inconsistencies
Detection of important, vulnerable or useless attributes
Well-defined quality measurement theory based on
graphs
DaWaK'05, Copenhagen, August 2005
8
Graph-based modeling of data
centric workflows is important!!!
IN
PKEY
AddSPK
1
OUT
IN
SK1
OUT
Transitive:
In
Out
Deg
SKEY
8
0
8
Avg/
attribute
2.83
2.83
5.67
Avg/
entity
5.67
5.67
11.33
Vulnerable point
in the scenario
PKEY
PKEY
PKEY
SUPPKEY
SUPPKEY
SUPPKEY
PKEY
SKEY
SOURCE
LOOKUP_
PS
SK
DaWaK'05, Copenhagen, August 2005
9
Contribution
We incorporate update semantics in our graphbased modeling of ETL activities.
We also cover complex transformations like
negation, aggregation and self-joins.
We introduce a systematic way of transforming the
Architecture Graph to allow zooming in and out
at multiple levels of abstraction.
Long version and ER’05: details for adding
internal semantics
DaWaK'05, Copenhagen, August 2005
10
Outline
Background & Motivation
Database updates and composite
transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005
11
Updates and Transformations
In this paper, we consider adding graph
modeling techniques for several kinds of
activities:
Update (INS, UPD, DEL) activities
Aggregates
Rules employing negation and aliases
Functions
We use LDL++ as language that
declaratively describes the semantics
DaWaK'05, Copenhagen, August 2005
12
Updates to the database
An update expression is of the form
head <- query part, update part
with the following semantics:
1.
2.
a query to the database for the tuples that abide by the
query part
we update the predicate of the update part as
specified in the rule.
raise1(Name, Sal, NewSal) <employee(Name, Sal), Sal = 1100,
NewSal = Sal * 1.1,
- employee(Name, Sal),
+ employee(Name, NewSal).
DaWaK'05, Copenhagen, August 2005
(a)
(b)
(c)
(d)
13
Updates to the database
raise1
head
Functio
n_*
input
Name
Sal
Name
=
Sal
Emplo
yee
output
Sal
+/-
-
Name
Sal
+
1100
=
NewSal
1.1
DaWaK'05, Copenhagen, August 2005
=
NewSal
Scale
14
Updates to the database
A side-effect rule is treated as an activity, with the
corresponding node. The output schema of the activity is
derived from the structure of the predicate of the head of
the rule.
For every predicate with a + or – in the body of the rule,
a respective provider edge from the output schema of the
side-effect activity is assumed.
For every predicate that appears in the rule without a +
or – tag, we assume the respective input schema.
Provider edges from this predicate towards these
schemata are added as usual. The same applies for the
attributes of the input and output schemata of the side
effect activity.
DaWaK'05, Copenhagen, August 2005
15
Aggregation
Aggregation in LDL:
1. grouping of values to a bag and
2. application of an aggregate function over the
values of the bag.
R16: aggregate1.a_in(skey,suppkey,date,qty,cost)<dw.partsupp(skey,suppkey,date,qty,cost)
R17: temp(skey,day,<cost>) <aggregate1.a_in(skey,suppkey,date,qty,cost).
R18: aggregate1.a_out(skey,day,min_cost) <temp(skey,day,all_costs),
aggr(min,all_costs,min_cost).
R19: v1(skey,day,min_cost) <aggregate1.a_out(skey,day,min_cost).
DaWaK'05,
Copenhagen, August 2005
16
aggrega
te1
Aggregation
R18
R17
hea
in
skey
hea
d
temp
g
suppkey
aggr
out
skey
skey
day
day
g
=
day
d
all_
costs
min_
cost
<cost>
min
qty
=
<>
min
min_
cost
cost
DaWaK'05, Copenhagen, August 2005
17
Aggregation
Relations which create a set from the values of a
field employ a pair of regulator edges through an
intermediate node ‘<>’.
Provider relations for attributes used as groupers
are tagged with ‘g’.
One of the attributes of the aggr function node
consumes data from a constant that indicates
which aggregate function should be used (e.g.,
avg, min, max).
DaWaK'05, Copenhagen, August 2005
18
Outline
Background & Motivation
Database updates and composite
transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005
19
Zooming in and out
Activity Level
S1
S2
C
A
B
D
There is a principled way
of zooming in and out, in
various levels of
abstraction:
The attribute level
The schema level
The activity level
S3
E
S4
B.
OUT1
B.IN
S4
B.P
B.
OUT2
Schema Level
a1
a1
a2
a2
a3
Attribute Level
DaWaK'05, Copenhagen, August 2005
20
Zooming in and out
For each node x of the architecture graph
G(V,E) representing a schema:
1. for each provider edge (xa,y) or (y,xa),
involving an attribute of x and an entity y,
external to x, introduce the respective provider
edge between x and y (unless it already
exists, of course);
2. remove the provider edges (xa,y) and (y,xa) of
the previous step;
3. remove the nodes of the attributes of x and the
respective part-of edges.
DaWaK'05, Copenhagen, August 2005
21
aggrega
te1
Ga
Zooming in and out
R18
R17
aggrega
te1
hea
in
R17
R18
Gs
skey
he
he
temp
g
ad
temp
aggr
aggr
d
out
skey
skey
day
day
ad
suppkey
in
hea
d
out
g
=
day
all_
costs
min_
cost
<cost>
min
qty
=
<>
min
min_
cost
cost
DaWaK'05, Copenhagen, August 2005
22
aggrega
te1
Ga
Zooming in and out
R18
R17
aggrega
te1
hea
in
R17
R18
Gs
skey
he
he
temp
g
ad
temp
aggr
d
out
skey
skey
day
day
ad
suppkey
in
hea
d
aggr
out
g
=
day
all_
costs
min_
cost
<cost>
min
Measure
Definition
Ga
Gs
Size
Size(G)
7
25
Length
Max provider path
3
2
Complexity
0.5*ext. edges +
int. edges
8
36
F_IN+F_OUT
1
F (IN+OUT)
DaWaK'05, Copenhagen,
August 2005
Cohesion
qty
=
<>
min
min_
cost
cost
0.75
23
Outline
Background & Motivation
Database updates and composite
transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005
24
Summary
We have incorporated update semantics in our graphbased modeling of ETL activities.
We also cover complex transformations like negation,
aggregation and self-joins.
We have introduced a systematic way of transforming the
Architecture Graph to allow zooming in and out at
multiple levels of abstraction
Long version and ER’05: internal semantics for activities
and quality measures for the design of ETL activities
http://www.cs.uoi.gr/~pvassil/publications/2005_ER_AG/ETL_blueprints_long.pdf
DaWaK'05, Copenhagen, August 2005
25
Arktos II
On-going/Future Work
This work is part of the Arktos II project
http://www.cs.uoi.gr/~pvassil/projects/arktos_II
Future work includes research in
What-if analysis of ETL scenarios
Measures for the quality of the design of
ETL scenarios
DaWaK'05, Copenhagen, August 2005
26
Arktos II
Thank you!
http://www.cs.uoi.gr/~pvassil/projects/arktos_II
DaWaK'05, Copenhagen, August 2005
27
Backup Slides
DaWaK'05, Copenhagen, August 2005
28
Vision – big picture
The major research goal is to be able to
have a metadata repository that incorporates
metadata
on the static part of an information system, i.e.,
tables, constraints, query forms, etc
on the dynamic part, i.e., data-centric software
modules
We invest on a graph-based modeling
approach, based on the flexibility of graphs
as modeling tools
DaWaK'05, Copenhagen, August 2005
29
Preliminaries
Data types
Constants
Attributes
RecordSets
Function types
Functions
DaWaK'05, Copenhagen, August 2005
Integer
1
PKEY
R
$2€
my$2€
30
Relationships
Instance-Of Relationships
Part-Of Relationships
Regulator Relationships
Provider Relationships
Derived Provider
Relationships
DaWaK'05, Copenhagen, August 2005
31
Activities
Parameters
Input 1
Activity
Name
Input 2
Input Schemata
Rejected
Rows
Output Schema
Rejections Schema
Parameter List
Output/Rejection Operational Semantics
DaWaK'05, Copenhagen, August 2005
Output
32
Importance Metrics
Dependency: the in-degree of the node with
respect to the provider edges;
Responsibility: the out-degree of the node
with respect to the provider edges;
Degree: dependency + responsibility
Local vs. Transitive
DaWaK'05, Copenhagen, August 2005
33
Functions
Functions are treated as any other predicate in LDL, with the
following special characteristics:
The function involves a list of parameters, the last of which
is the return value of the function.
All function parameters referenced in the body of the rule
either as homonyms with attributes, of other predicates or
through equalities with such attributes, are linked through
equality regulator relationships with these attributes.
The return value is possibly connected to the output
through a provider relationship (or with some other
predicate of the body, through a regulator relationship).
DaWaK'05, Copenhagen, August 2005
34
Aliases & Negation
Alias relationships. An alias relationship is introduced
whenever the same predicate appears in the same rule (e.g.,
in the case of a self-join). All the nodes representing these
occurrences of the same predicate are connected through
alias relationships to denote their semantic
interrelationship. Note that due to the fact that intraactivity programs do not directly interact with external
recordsets or activities, this practically involves the rare
case of internal intermediate rules
Negation. When a predicates appears negated in a rule
body, then the respective part-of edge between the rule and
the literal’s node is tagged with ‘⌐’. Note that negated
predicates can appear only in the rule body.
DaWaK'05, Copenhagen, August 2005
35
Activity semantics in LDL
R06:
a_in1(pkey,suppkey,date,qty,cost)<ps(pkey,suppkey,date,qty,cost).
R07:
a_in2(pkey,source,skey)<lookUp(l_pkey,source,l_skey),
pkey=l_pkey, skey=l_skey,source=1.
R08:
a_out(pkey,suppkey,date,qty,cost,skey)<a_in1(pkey,date,qty,cost),
a_in2(pkey,source,l_skey).
R09:
D2E.a_in(skey,suppkey,date,qty,cost)<sk.a_out(pkey,suppkey,date,qty,cost,skey)
DaWaK'05, Copenhagen, August 2005
36
SK
Activities
Program
R06
hea
R08
d
R09
head
he
ad
DSA_PS
a_in1
a_out
d2e.
a_in
pkey
pkey
pkey
skey
suppkey
suppkey
suppkey
suppkey
date
date
date
date
qty
qty
cost
qty
cost
cost
cost
cost
=
R07
head
skey
lookup
a_in2
1
l_key
pkey
=
DaWaK'05, Copenhagen, August 2005
source
source
l_skey
skey
37
Zooming in and out
A
A
P
A.IN1
A.IN1
P1
A.OUT
A.OUT
P
A.IN2
P2
DaWaK'05, Copenhagen, August 2005
A.REJ
A.IN2
A.REJ
38
Download