Data Warehouse Design - Faculty of Computer Science & Engineering

advertisement
Data Warehouse Design
by Duong Tuan Anh
Faculty of Computer Science and Engineering,
HCM City University of Technology.
Sept. 2011
1
Outline





Requirement Specification
Conceptual Design
Logical Design
Conclusions
Case study
(M. Golfarelli, 1998)
2
1. Requirement Specification




This phase consists in collecting and filtering the user
requirements.
It involves the designer, end-users of DW and produces
the specifications concerning
 the choice of facts
 preliminary indications of the workload
The choice of facts is based on the documentation of the
operational information system.
Facts are concepts of main interest for the decision
making process, and correspond to events occurring in
the enterprise world.
3





If the operational information system is documented by
ER schemes, a fact can be represented by an entity or an
n-ary relationship.
If it is documented by relational schemes, facts
correspond to relation schemes.
In general, entities or relationships representing
frequently updated data are good candidates for defining
facts.
The preliminary workload is expressed in pseudo-natural
language and is aimed at enabling the designer to
identify dimensions and measures during conceptual
design.
For each fact, it should specify the most interesting
measures and aggregations.
4
2. Conceptual Design of Data Warehouse





The conceptual design of a DW produces a
dimensional scheme, structured according to the
Dimension Fact Model (DF model). A dimensional
scheme consists of a set of fact schemes.
The basic components of a fact schemes are fact,
dimensions and hierarchies.
A fact is a focus of interest for the enterprise;
A dimension determines a point of view adopted for
representing facts;
A hierarchy determines how fact instances may be
aggregated and selected significantly for the
decision-making process.
5
A fact scheme




A fact scheme is structured as a tree whose root is a
fact.
The fact is represented by a box which reports the fact
name and, typically, one or more numeric attributes
which "measure" the fact from different points of view.
Each vertex directly attached to the fact is a dimension.
Figure 1 reports a fact scheme for fact SALE in a store
chain; quantity sold and returns are fact attributes.
6
Fig 1. A simple fact scheme for a chain of stores
7
Hierarchies

Subtrees rooted in dimensions are hierarchies. Their
vertices, represented by circles, are attributes which
may assume a discrete set of values; their arcs
represent x-to-one relationships between pairs of
attributes.
state
sales manager
city
store
8
Hierarchies



The dimension in which a hierarchy is rooted defines its
finest aggregation granularity; the attributes in the
vertices along each sub-path of the hierarchy starting
from the dimension define progressively coarser
granularities.
The fact scheme in Figure 1 has three dimensions:
week, product and store.
Some terminal vertices in the fact scheme may be
represented by lines instead of circles (size and
address in Figure 1); these vertices are nondimension attributes.
9
A non-dimension attribute



A non-dimension attribute contains additional
information about an attribute in the hierarchy.
A non-dimension attribute can not be used for
aggregation.
Ex: address is a non-dimension attribute.
Some arcs within the hierarchies may be marked by
a dash: these arcs express optional relationships
between pairs of attributes.
10
A fact instance




A fact expresses a many-to-many relationship
among dimensions.
Each combination of values of the dimensions
defines a fact instance; fact instances are the
elemental information with the DW.
EX: a fact instance describes “the quantity of one
product sold during one week in a store, and the
corresponding total returns”.
A fact scheme may also have no fact attributes.
[Factless fact table]
11
Additivity





Querying a DW is aimed at extracting summary data to
fill a report to be analysis for decisional purposes.
It’s useful to aggregate fact instances into clusters at
different levels of abstractions (roll-up).
Most fact attributes should be additive. This means that
the sum operator can be used to aggregate attribute
values along all hierarchies.
EX: the number of sales: the number of sales for a given
manager is the sum of the number of sales for all stores
managed by that sales manager.
A fact attribute is called semi-additive if it is not additive
along one or more dimensions, non-additive if it is
additive along no dimension.
12
Overlapping compatible fact schemes




Different facts are represented in different fact schemes.
The queries the user formulated on the DW may require
comparing fact attributes taken from related schemes.
Two fact schemes are said to be compatible if they share at
least one dimension attribute. Two compatible schemes F
and G may be overlapped to create a resulting scheme H.
In the simplest case:



The set of the fact attributes in H is the union of the sets in F and
G.
The dimensions in H are the intersection of those in F and G;
assuming that a given dimension is common to F and G if at least
one dimension attribute is shared.
Each hierarchy in H include all and only the dimension attributes
included in the corresponding hierarchies of both F and G.
13
Figure 2: Fact scheme overlapping
14
The two scheme share the time, job and store dimension. The
scheme resulting from the two schemes is given in Figure 3.
Figure 3: The scheme resulting from the two overlapping schemes
15
Representing query pattern on a fact
scheme





The basic OLAP operators for formulating typical
queries on DWs: roll-up, drill down, drill across and
slice-and-dice.
Roll-up: aggregate fact attributes to view data at a
higher level of abstraction.
Drill-down: disaggregate fact attributes in order to
introduce further details.
Drill-cross: relate and compare distinct facts.
Slice-and-dice: select and project facts so as to
reduce their dimensionality.
16
Query pattern




On a fact scheme, a query may be represented by a
query pattern, which consists in a set of markers placed
on the dimension attributes.
One or more markers can be placed within each
hierarchy, to indicate at what level(s) fact instances must
be aggregated.
A dimension may contain no markers, to indicate that
none of its attributes is involved in the query.
Figure 4 shown the query pattern for the query: “total
quantity sold and average returns per unit sold for each
week and for each type of product”. The average returns
per unit sold is the ratio between the total returns and the
quantity sold.
17
Query pattern
18
From ER schemes to fact schemes


It’s natural to derive the conceptual model of a DW
from the existing ER schemes.
The methodology to build a DF model consists of
the following steps:



Classifying entities
Defining facts
For each fact:





Building the attribute tree
Pruning and grafting the attribute tree
Defining dimensions
Defining fact attributes
Defining hierarchies
19
The ER scheme for the sale fact scheme
Figure 5. The (simplified) ER scheme for the sale fact scheme
Note: Each instance of relationship SALE represents an
item referring to a single product within a purchase ticket.
20
Classifying Entities (Moody & Kortink, 2000)
The first step in producing fact schemes from ER model is to
classify the entities into 3 categories: transaction entities,
component entities and classification entities.




Transaction entities. Transaction entities record details about
particular events that occur in the business. The events that
decision makers want to understand and analyse.
EX: orders, insurance claims, salary payments and hotel
bookings.
Key characteristics of a transaction entity:
 It describes an event that happens at a point of time.
 It contains measurements or quantities that may be summarized.
EX:
 an insurance claim records a particular business event and the
amount claimed.
 A sale records an event that relates a product sold in an order
and the sum of amount and the number of units sold.
21
Component entities





A component entity is one which is directly related to a
transaction entity via a one-to-many relationship. Component
entities define the details or “components” of each business
transaction.
Component entities answer the “who”, “what”, “when”, “how” and
“why” of a business event.
EX: a sales transaction may be defined by a number of
components:
 Customer: who made the purchase.
 Product: what was sold.
 Location: where it was sold.
 Period: when it was sold.
An important component of any transaction is time - historical
analysis is an important part of any data warehouse.
Component entities form the basis for constructing dimensions in
fact schemes.
22
Classification Entities




Classification entities are entities which are
related to component entities by a chain of one-tomany relationships.
They are functionally dependent on a component
entities (directly or transitively).
Classification entities represent hierarchies
enbedded in the fact schemes, which may be
collapsed into component entities to form
dimensions in the fact schemes.
EX: store, city, sales-manager, type, category
23
Defining facts





Facts are concepts of primary interest for the decision making
process. They correspond to events occurring dynamically in
the enterprise world.
A fact is represented on ER scheme either by an entity type F
or by an n-ary relationship between entity types E1,…,En.
In the latter case, for the sake of simplicity, we should
transform the relationship into an associate entity. The
attributes of the relationship become attributes of the
associate entity and the identifier of the entity is the
combination of the identifiers of the participant entities.
Figure 6 shows how relationship sale is transformed into an
entity.
Facts come from transaction entities.
24
Figure 6: Transformation of relationship sale
into an associate entity
Each fact identified in the ER scheme becomes the root of a
different fact scheme.
In the sale example, the fact of main interest is the sale of a
product, represented in the ER scheme by relationship sale.
25
Building the attribute tree

Given a portion of interest of an ER scheme and an
entity F belonging to it, we call attribute tree the tree
such that:




Each vertex corresponds to an attribute of the scheme;
The root corresponds to the identifier of F;
For each vertex v, the corresponding attribute functionally
determines all the attributes corresponding to the
descendant of v.
The attribute tree will be used to build the fact
scheme for the fact corresponding to F.
26
Let F be the entity chosen to represent a fact; the attribute tree may
be constructed automatically by the following procedure:
translate(F, identifier(F))
procedure translate(E,v):
// E is the current entity, v is the current vertex
{
for each attribute a E | a identifier(F) do
addchild(v, a); // add child a to vertex v
for each entity G connect to E by a x-to-one relationship R do
{ for each attribute b R do addchild(v, b);
addchild(v, identifier(G));
translate(G, identifier(G));
}
}
Note: If F is identified by a combination of two or more attributes,
identifier(F) denotes their concatenation.
27
Notes on attribute tree





The arcs corresponding to optional attributes or optional
relationships of the ER scheme should be marked by a
dash.
A one-to-one relationship can be inserted into the
attribute tree. But it is often worth grafting from the
attribute tree the attributes following one-to-one
relationships or representing them as non-dimension
attributes.
Generalization hierarchies in the ER scheme are
equivalent to one-to-one relationships.
X-to-many relationships can not be inserted into the
attribute tree.
Notice that an n-ary relationship is equivalent to n binary
relationships.
28
Attribute tree
Figure 7. Attribute tree for the sale example
29
Pruning the attribute tree



Not all the attributes represented in the attribute tree
are interesting for the DW. Thus, the attribute tree
may be pruned and grafted in order to eliminate the
unnecessary levels of details.
Pruning is carried out by dropping any subtrees from
the tree. The attributes dropped will not be included
in the fact scheme.
EX: on the sale example, the subtree including city
and state may be dropped.
30
Grafting the attribute tree
Grafting is used when, though a vertex of the tree expresses
an uninteresting information, its descendants must be
preserved.
 EX: one may to classify products by category, without
considering by the information on their type.
 Let v be the vertex to be eliminated, and v’ its father:
graft(v):
{
for each v’’| v’’ is child of v do
addchild(v’, v’’);
drop v;
}
Thus, grafting is carried out by moving the entire subtree with
root in v to v’. Attribute v will not be included in the fact
scheme and the corresponding aggregation level will be lost;
all the descendent levels will be maintained.

31
Figure 8: Attribute tree for the sale example after
grafting vertex ticket number
In the sale example, the attribute ticket number is
uninteresting, the vertex ticket number is grafted, the
attribute tree is tranformed to Figure 6.
32
Defining dimensions





Dimensions determine how fact instances may be aggregated for
the decision making process. Dimensions come from component
entities.
The dimensions must be chosen in the attribute tree among the
children vertices of the root.
The choice of dimension is crucial for the DW design since it
determines the granularity of fact instances.
Time is an important dimension for DWs.
ER schemes can be classified into snapshot and temporal.
 A snapshot scheme describes the current state of the application.
 A temporal scheme describes the evolution of the application
over a range of times; old versions of data are explicitly
represented and stored.
33
Time in DWs

DW from a temporal scheme:



DW from a snapshot scheme:



Time is explicitly represented as an attribute and it is a
candidate to define a dimension.
If time appears in the attribute tree as a child of some nonroot vertex, we can graft the tree so that time becomes a
child of the root (i.e. becomes a dimension).
Time is not explicitly represented, but time should be added
as a dimension to the fact scheme.
EX: In the sale example, the attributes chosen as
dimensions are product, store and ranges of the date
attribute corresponding to weeks.
Time as a attribute in a fact scheme can be a time
point or a time interval.
34
Defining fact attributes





Fact attributes are typically either counts of the number of
instances of entity F, or the sum/average/maximum/minimum
of expressions involving numerical attributes of the attribute
tree (the attributes chosen as dimensions for the fact scheme
are excluded).
A fact may have no attributes, if the only information to be
recorded is the occurrence of the fact.
The fact attributes determined, if any, are reported on the fact
scheme.
At this step, it is useful for the phase of logical design to build
a glossary which associates each fact attribute to an
expression describing how it can be calculated from the
attributes of the ER scheme.
In the sale example, the glossary may be compiled as follows:
quantity sold = SUM(SALE.qty)
total returns = SUM(SALE.qty * SALE.unitPrice)
number of customers = COUNT(SALE)
35
Defining fact attributes (cont.)




If attribute unitPrice had been placed on entity PRODUCT in the
E/R scheme:
total returns = SUM(SALE.qty * PRODUCT.unitPrice)
The aggregation operators are meant to work on all the instances
of SALE which relate to the same week, store and product.
In some cases, aggregation is not necessary to define fact
attributes, since it has already been executed at the relational
level.
For instance, each instance of entity SALE in the E/R scheme
might describe the total sales for a given product, store and
week; in this case, instances of the entity correspond one-to-one
to fact instances, and entity attributes may be directly translated
into fact attributes.
36
Defining hierarchies




The last step in building the fact scheme is the definition of
hierarchies on dimensions. Along each hierarchy, attributes
must be arranged into a tree such that a x-to-one relationship
holds between each node and its descendants.
A hierarchy in an ER scheme is any sequence of entities
joined together by one-to-many relationships, all aligned in the
same direction.
EX: <purchase-ticket, store, city, state>
The attribute tree already shows an organization for
hierarchies; at this stage, it is still possible to prune and graft
the attribute tree in order to eliminate irrelevant details (e.g. in
most cases, a vertex connected to its father by a one-to-one
relationship is grafted).
37
Defining hierarchies (cont.)



It is possible to add new levels of aggregation by
defining ranges for numerical attributes; typically, this is
done on the time dimension.
In the sale example, the time dimension is enriched by
introducing attribute month, defined as a range of week.
During this phase, the attributes which should not be
used for aggregation but only for informative purposes
may be identified as non-dimension attributes.
38
4. Logical Design



At this time, it is necessary to choose the target
logical model, relational or multidimensional.
In this chapter, we consider the relational case.
The dimensional scheme can be mapped on the
relational model by adopting the star scheme.
39
Translation into tables

During this phase, the fact and dimension tables are created
starting from the dimensional scheme and according to the
logical model adopted.

If the star scheme is adopted, each fact scheme f having the
set of dimensions Dim(f) = {d1,…,dn} and the set of measures
M = {m1,…,mz} is translated into one fact table:
FT_f(k1, …,kn, m1,…,mz)
and n dimension tables
DT_d1(k1, a11,…,a1v1, a’11,…,a’1u1)
………………………
DT_d1(kn, an1,…,anv1, a’n1,…,a’nu1)
where the hierachies on di includes the dimension attributes
ai1,…,aivi and the non-attributes a’i1,…,a’iui
40

The star scheme for the SALE example turns out to
be:
FT_SALE(prodkey, dateKey, storeKey, qtySold,
revenue, noOfCustomer)
DT_PROD(prodkey, product, weight, diet, brand,
city, type, category, department, deptmanager,…)
DT_DATE(dateKey, date, dayOfWeek, holiday,
month,…)
DT_STORE(storeKey, store, phone, address,
salesManager, city, country, state, saleDistrict)
41
Conclusion



There is a methodology to derive conceptual design
of a DW from the ER scheme describing the
operational information system.
The DF model is independent of the target logical
model (multidimensional or relational).
The DF model is a collection of tree structured fact
schemes whose basic elements are facts, attribute,
dimensions and hierarchies.
42
References


1. Golfarelli, M., Maio, D. and Rizzi, S., Conceptual
Design of data warehouses from ER schemes, in
Proc. HICSS-31, VII, Kona, Hawaii, 1998, 334-343.
2. Golfarelli, M., and Rizzi, S., A Methodology
Framework for Data Warehouse Design, Proc. ACM
1st Int. Workshop on Data Warehousing and OLAP
(DOLAP 98), Nov. 7, 1998, Washington D.C., USA.
43
Case Study
Figure 9
44



The attribute tree for fact ADMISSION represented
by entity ADMISSION is shown in Fig. 10.a.
Notice that a surgery subtree has been created
even though the optional causes relationship is a
many-to-one relationship (with one-side on
ADMISSION entity).
Transforming this attribute tree in the one shown in
Figure 10.b requires 4 steps:





Grafting date+time+op.the
Pruning date of surgery, time of surgery and surgeon
Pruning the subtree rooted in op.th
Pruning name and physician;
Grafting patcode.
45
46
47






The fact scheme derived is shown in Fig. 11.
Dimension month is defined as a range on attribute
date;
Dimension age5 as a range of attribute age (5 years
intervals).
The hierarchies on dimensions month and age5 is
defined by adopting progressively wider ranges.
Dimension type of surgery is optional.
The grossary for the fact attributes is as below:




Number of admissions = COUNT(ADMISSION)
Value = SUM(has.value)
Number of days = SUM(ADMISSION.nDays)
Score = SUM(DRG.weight)
48
Download