Chapter 2. Introduction to Databases and Data Warehouses

advertisement
Chapter 2
Databases & Data Warehouses
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
1
Outline

Database Concepts



Steps in Database Design
Entity-Relationship Model
Logical Database Design




Relational Model
Object-Relational Model
Physical Database Design
Data Warehouse Concepts



Logical Data Warehouse Design
Physical Data Warehouse Design
Data Warehouse Architecture
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
2
Steps in Database Design




Requirements specification: Collects information about users’
needs with respect to the database system
Conceptual design: Builds a user-oriented representation of
the database without any implementation considerations
Logical design: Translates the conceptual schema from the
previous phase into an implementation model common to
several DBMSs, e.g., relational or object-relational
Physical design: Customizes the logical schema from the
previous phase to a particular platform, e.g., Oracle or SQL
Server
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
3
Entity Relationship Model: University Example
Academic staff
(0,n)
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)
Participates
Project
(1,n)
Start date
End date
Project id
Project acronym
Project name
Description (0,1)
Start date
End date
Budget
Funding agency
Teaches
(partial,exclusive)
(1,1)
(0,n)
Assistant
Professor
Section
Thesis title
Thesis description
Status
Tenure date (0,1)
/No courses
Semester
Year
Homepage (0,1)
(1,1)
CouSec
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Prerequisite
(0,n)
Has prereq Is a prereq
Course
(1,1)
(0,n)
Advisor
(0,n)
(1,n)
Course id
Course name
Level
4
ER Model: Entity and Relationship Types (1)


Entity Relationship Model: Most often used conceptual model
for database design
Entity type: Represents a set of real-world objects of
interest to an application


Relationship type: Represents an association between
objects



Entity: object belonging to an entity type
Relationship: association belonging to a relationship type
Role: Participation of an entity type in a relationship type
Cardinalities in a role: Minimum and maximum number of
times that the entity may participate in the relationship type
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
5
ER Model: Entity and Relationship Types (2)

Types of roles



Relationship types


Binary vs n-ary: two vs more than two participating entity types
Binary relationship types:


Optional vs mandatory: minimum cardinality 0 vs 1
Monovalued vs multivalued: maximum cardinality 1 vs n
One-to-one, one-to-many, or many-to-many depending on the
maximum cardinality of the two roles
Recursive relationship types: the same entity type
participates more than once in the relationship type

Role names are then necessary to distinguish different roles
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
6
ER Model: Attributes



Attributes: Structural characteristics describing objects and
relationships
Attributes have cardinalities, as for roles
Depending on cardinalities, attributes may be




Optional vs mandatory
Monovalued vs multivalued
Complex attributes: Attributes composed of other attributes
Derived attributes: Their value for each instance may be
calculated from other elements of the schema
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
7
ER Model: Identifiers and Weak Entity Types


Identifier or key: One or several attributes that uniquely
identify a particular object
Weak entity type: Do not have identifier of their own



A weak entity type is dependent on the existence of another
entity type, called the identifying or owner entity type
An identifying relationship type relates a weak entity type to
its owner


Regular entity types are called strong entity types
Normal relationship types are called regular relationship types
Partial key: Set of attributes that can uniquely identify weak
entities related to the same owner entity
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
8
ER Model: Generalization (1)

Generalization or is-a relationship: Relates perspectives of
the same concept at different abstraction levels





Supertype: Entity at higher abstraction level
Subtype: Entity at lower abstraction level
Population inclusion: Every instance of the subtype is also an
instance of the supertype
Inheritance: All characteristics (e.g., attributes, roles) of the
supertype are inherited by the subtype
Substitutability: Each time an instance of a supertype is
required, an instance of the subtype can be used instead
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
9
ER Model: Generalization (2)


Several generalization relationships accounting for the same
criterion can be grouped
Types of generalizations



Total vs partial: depending on whether every instance of the
supertype is also an instance of one of the subtypes
Disjoint vs overlapping: depending on whether an instance may
belong to one or several subtypes
Multiple inheritance: A subtype that has several supertypes

May induce conflicts when a property of the same name is inherited
from two supertypes
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
10
ER Model: Summary of Notation
Professor
Identifier
Other attributes
Entity Type
Teaches
role name
Attributes
(1,1)
Relationship Type
Role
(0,1)
(1,1)
(0,n)
(1,n)
Simple attribute
Complex attribute
Attribute 1
Attribute 2
Optional att. (0,1)
Multivalued attr. (1,n)
Cardinalities
Attribute types
Section
Partial Identifier
Other Attributes
Weak Entity Type
(total,exclusive)
(partial,exclusive)
(total,overlapping)
(partial,overlapping)
CouSec
Identifying
Relationship Type
Generalization
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Generalization types
11
Relational Model: University Example
Academic staff
Employee no
First name
Last name
Address
Email
Homepage
Professor
Employee no
Status
Tenure date (0,1)
No courses
Assistant
Employee no
Thesis title
Thesis description
Advisor
Academic staff
Research area
Employee no
Research area
Participates
Employee no
Project id
Start date
End date
Project
Project id
Project acronym
Project name
Description (0,1)
Start date
End date
Budget
Funding agency
Course
Section
Course id
Semester
Year
Homepage (0,1)
Employee no
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Course id
Course name
Level
Prerequisite
Course id
Has prereq
12
Relational Model (1)

Based on a simple data structure


Each attribute is defined over a domain or data type





A relation or table, composed of attributes or columns
Typical domains: integer, float, date, string, …
Tuple or row: element of the table
Provides several declarative integrity constraints
Not null attribute: a value for the attribute must be provided
Key: column(s) that uniquely identify one row of the table


Simple vs composite key: depending on whether key is composed of
one or several columns
Primary vs alternate key: one of the keys must be chosen as primary
by the designer, the others are alternate keys
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
13
Relational Model (2)

Referential integrity




Check constraint: defines a predicate that must be valid
when inserting or updating a tuple in a relation


Defines a link between two tables (or twice the same one)
A set of attributes in one table (foreign key) references the primary
key of the other table
Values in the foreign key columns must also exist in the primary key
of the referenced table
Predicate can only involve the tuple being inserted/deleted
All other constraints must be expressed by triggers: an
event-condition-action rule that is automatically activated
when a relation is updated
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
14
Translating from ER to Relational Schemas (1)

Rule 1: A strong entity type is associated with a table



Table contains the simple monovalued attributes and the simple
component of the monovalued complex attributes of the entity type
Table also defines not null constraints for mandatory attributes
Identifier of the entity type define the primary key of the table
Academic staff
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Academic staff
Employee no
First name
Last name
Address
Email
Homepage
15
Translating from ER to Relational Schemas (2)

Rule 2: A weak entity type is transformed as a strong entity
type, but the associated table also contains the identifier of
the owner entity


A referential integrity constraint relates this identifier to the table
associated with the owner entity type
Primary key of the table: partial identifier of the weak entity type +
identifier of the owner entity type
Section
Semester
Year
...
(1,1)
CouSec
Course
(1,n)
Course id
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Section
Course id
Semester
Year
...
16
Translating from ER to Relational Schemas (3)

Rule 3: A binary one-to-one relationship type is represented
by including the identifier of one of the participating entity
types in the table associated to the other entity type



A referential integrity constraint relates this identifier to the table
associated with the corresponding entity type
Simple monovalued attributes and simple components of monovalued
complex attributes are also included in the table
Not null constraint for mandatory attributes are also defined
Department
...
(1,1)
Dean
Professor
(0,1)
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Department
...
Dean
17
Translating from ER to Relational Schemas (4)

Rule 4: A binary one-to-many relationship type is
represented by including the identifier of the entity type with
cardinality 1 in the table associated to the other entity type


A corresponding referential integrity constraint is added
Attributes and not null constraints are also added as in previous rules
Assistant
...
(1,1)
Advisor
Professor
(0,n)
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Assistant
...
Advisor
18
Translating from ER to Relational Schemas (5)

Rule 5: A binary many-to-many relationship type or an n-ary
relationship type is associated with a table containing the
identifiers of the entity types that participate in all its roles



Corresponding referential integrity constraints are added
Attributes and not null constraints are also added as in previous rules
Identifier if any is the key of the relation, but the combination of all
role identifiers can be used as a key
Academic staff
Employee no
...
(0,n)
Participates
Start date
End date
Project
(1,n)
Project id
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Participates
Employee no
Project id
Start date
End date
19
Translating from ER to Relational Schemas (6)

Rule 6: A multivalued attribute is associated with a table
containing the identifier of the entity or relationship type to
which it belongs


Corresponding referential integrity constraint is added
Primary key of table is composed of all its attributes
Academic staff
...
Research areas (1,n)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Academic staff
Research area
Employee no
Research area
20
Translating from ER to Relational Schemas (7)

Rule 7: A generalization can be translated in 3 ways

The supertype and the subtype are associated with tables containing
the identifier of the entity type. There is a referential integrity
constraint from the table of the subtype to the table of the subtype
Academic staff
Academic staff
Employee no
Name
...
Assistant
Thesis title
...
Employee no
Name
...
Assistant
Professor
Employee no
Thesis title
...
Employee no
Status
...
Professor
Status
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
21
Translating from ER to Relational Schemas (8)

Rule 7 (cont.):

Only the supertype is associated with a table, which contains also the
attributes of the subtype (these are optional attributes)
Academic staff
Employee no
Name
…
Thesis title (0,1)
...
Status
...

Only the subtype is associated with a table containing all attributes
of the subtype and the supertype
Assistant
Professor
Employee no
Name
…
Thesis title
...
Employee no
Name
…
Status
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
22
Defining Relational Schemas in SQL
create table AcademicStaff as (
EmployeeNo integer primary key,
FirstName character varying (30) not null,
LastName character varying (30) not null,
Address character varying (50) not null,
Email character varying (30) not null,
Homepage character varying (64) not null );
create table AcademicStaffResearchArea as (
EmployeeNo integer not null,
ResearchArea character varying (30) not null,
primary key (EmployeeNo,ResearchArea),
foreign key EmployeeNo references AcademicStaff(EmployeeNo) );
…
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
23
Redundancies and Update Anomalies
Participates
Employee no
Project id
Start date
End date
Project acronym
Project name

Employee no
Research area
Department id
Relation with redundancies may induce anomalies in the
presence of insertions, updates, and deletions



Affiliation
In relation Participates the information about a project is repeated
for each staff member who works on that project
In relation Affiliation the information about a research area is
repeated for all departments to which the staff member is affiliated
Dependencies and normal forms are used to describe such
anomalies
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
24
Functional Dependencies

Given a relation R and to sets of attributes X and Y in R, a
functional dependency X Y holds if in all tuples of the
relation, each value of X is associated with at most one
value of Y





In Participates above, Project Id {Project acronym, Project name}
A key is a particular case of a functional dependency
A prime attribute is an attribute that is part of a key
A full functional dependency is a dependency X Y in
which the removal of an attribute from X invalidates the
dependency
A dependency X Z is transitive if there is a set of
attributes Y such that X Y and Y Z hold
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
25
Multivalued Dependencies

Given a relation R and to sets of attributes X and Y in R, a
multivalued dependency X Y holds if the value of X
determines a set of values for Y, independently of any other
attributes


In Affiliation above, Employee no Research area and
Employee no Department Id
A multivalued dependency X Y is trivial if either Y X
or X Y = R, otherwise it is nontrivial
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
26
Normal Forms (1)



Normal Form: Integrity constraint certifying that a relation
schema satisfies particular properties
In the relational model the relations are in first normal form
since all attributes are atomic and monovalues
A relation schema is in the second normal form if every
nonprime attribute is functionally dependent on every key
Participates
Employee no
Project id
Start date
End date
Project acronym
Project name
Participates
Employee no
Project id
Start date
End date
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Project
Project id
Project acronym
Project name
27
Normal Forms (2)


A relation is in the third normal form if it is in the second
normal form and there are no transitive dependencies
between a key and a nonprime attribute
Assistant
Assistant
Employee no
Thesis title
Thesis description
Advisor id
Advisor first name
Advisor last name
Employee no
Thesis title
Thesis description
Advisor id
Professor
Employee no
First name
Last name
A relation is in the Boyce-Codd normal form if for every
nontrivial dependency X  Y, X is a key or contains a key
Participates
Employee no
Project id
Start date
End date
Location
Participates
Employee no
Project id
Start date
End date
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Location
Location
Project id
28
Normal Forms (3)

A relation is in the fourth normal form if for every nontrivial
dependency X Y, X is a key or contains a key
Affiliation
Affiliation
Research
Employee no
Research area
Department id
Employee no
Research area
Employee no
Research area
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
29
Weaknesses of the Relational Model

It provides a very simple data structure, which disallows
multivalued and complex attributes


The set of types provided by relational DBMSs is very
restrictive (integer, float, string, date, …)



complex objects are split into several tables performance problems
Insufficient for complex application domains
There is no integration of operations with data structures
It is not possible to directly reference an object by use of a
surrogate or a pointer

links are based on comparison of values performance problems
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
30
Object-Relational Model: Motivations


Preserves the foundations of the relational model
Organizes data using an object model


Extensions






Allow attributes to have complex types groups related facts into a
single row
Complex and/or multivalued attributes
User-defined types with associated methods
System-generated identifiers
Inheritance among types
Defined in the SQL:1999 and SQL:2003 standards
Current DBMSs (Oracle, …) introduced object-relational
extensions, but do not necessarily comply with the standard
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
31
Object-Relational Model: Definition (1)


Allow composite types, in addition to predefined types
Complex values may be stored in a column of a table




Row type: structured values (i.e., composite attributes)
Array type: variable-sized vectors of values
Multiset type: Unordered collection of values, with duplicates
permitted (bags)
Composite types can be nested

This is considered an “advanced feature” in the standard
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
32
Object-Relational Model: Definition (2)


Two sorts of user-defined types
Distinct types: Represented internally by a predefined type,
but cannot be mixed in expressions


Structured user-defined types:  class declarations in OO
languages





Does not make sense to compare a person’s name and a SSN
May have attributes, which can be of any SQL type
May have methods, which can be instance or static (class) methods
Attributes are encapsulated through observer and mutator functions
Both of them can be used as a domain of a column, of an
attribute of another type, or a table
Comparison/ordering of values: with user-defined methods
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
33
Object-Relational Model: Definition (3)

SQL:2003 supports single inheritance:


A subtype can






A subtype inherits attributes and methods from its supertype
Define additional attributes and methods
Overload inherited methods: provide another definition with a
different signature
Override inherited methods: provide another implementation with a
the same signature
Final types: cannot have subtypes
Final methods: cannot be redefined
Instantiable type: includes constructor methods for creating
instances
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
34
Object-Relational Model: Definition (4)



Two types of tables
Relational tables: usual ones, where domains for attributes
are all predefined or user-defined types
Typed tables: defined on structured user-defined types




Constraints (e.g. keys) defined in the table declaration
Have a self-referencing column that contains identifiers (surrogates)
generated by the DBMS
Row identifiers can be used for establishing links between tables
Ref type: values are the unique identifiers

Always associated with a specified structured type
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
35
Object-Relational Model: University Example
Academic staff Type
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)
Participates (0,n)
Assistant Type
Thesis title
Thesis description
Advisor
Project Type
Participates Type
Employee
Project
Start date
End date
Professor Type
Status
Tenure date (0,1)
No courses
Advises (0,n)
Teaches (0,n)
Project id
Project acronym
Project name
Description (0,1)
Start date
End date
Budget
Funding agency
Participates (1,n)
Section Type
Course Type
Semester
Year
Homepage (0,1)
Professor
Course
Course id
Course name
Level
Sections (1,n)
Has prereq (0,n)
Is a prereq (0,n)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
36
Translating from ER to OR Schemas (1)

Rule 1: A strong entity type is associated with a structured
type containing all its attributes



Multivalued attributes defined with the array or the multiset type
Composite attributes defined with the row or the structured type
A table of this structured type must be declared


Table defines not null constraints for mandatory attributes
Identifier of the entity type define the primary key of the table
Academic staff
Academic staff Type
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
37
Translating from ER to OR Schemas (2)

Rule 2: A weak entity type is transformed as a strong entity
type, but the structured type contains the identifier of its
owner
Section
Semester
Year
Homepage (0,1)
(1,1)
CouSec
(1,n)
Course
Course id
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Section Type
Semester
Year
Homepage (0,1)
Course
Course Type
Course id
...
38
Translating from ER to OR Schemas (3)


Rule 3: A regular relationship type is associated with
structured type containing the identifiers of all participating
entity types, and all attributes of the relationship
A table of this structured type must be declared



Table defines not null constraints for mandatory attributes
Set of identifiers of entity types define the primary key of the table
Inverse references may also be defined
Academic staff
Employee no
...
(0,n)
Participates
Start date
End date
(1,n)
Project
Project id
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Participates
Type
Acad. staff
Project
Start date
End date
Academic staff
Type
…
Participates (0,n)
Project
Type
…
Participates (1,n)
39
Translating from ER to OR Schemas (4)

Rule 4: Only single inheritance is supported by the model 
multiple inheritance must be removed by transforming some
generalization links into binary one-to-one relationships, to
which Rule 3 is applied
Assistant
Assistant Type
...
Student
Major
...
PhD Student
Thesis title
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Assistant
Assistant Type
...
Student
Major
...
PhD Student
PhD Student
Thesis title
…
Student
40
Defining Object-Relational Schemas in SQL
create type AcademicStaffType as (
EmployeeNo integer,
Name row (
FirstName character varying (30),
LastName character varying (30) ),
Address character varying (50),
Email character varying (30),
Homepage character varying (64),
ResearchAreas character varying (30) multiset,
Participates ref(ParticipatesType) multiset scope Participates
references are checked )
ref is system generated;
create table AcademicStaff of AcademicStaffType (
constraint AcademicStaffPK primary key (EmployeeNo),
Email with options constraint AcademicStaffEmailNN Email not null,
ref is oid system generated );
…
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
41
Physical Database Design



Specifies how database records are stored, accessed, and
related in order to ensure adequate performance of a
database application
Requires to know the specificities of an application,
properties of data, and usage patterns
Involves analyzing the transactions/queries that run
frequently, that are critical, and the periods of time in which
there will be a high demand on the database (peak load)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
42
Performance of Database Applications

Factors for measuring performance of database applications




A compromise has to be made among these factors




Transaction throughput: # transaction processed in a time interval
Response time: Elapsed time for the completion of a transaction
Disk storage: Space required to store the database file
Space-time trade-off: Reduce time to perform an operation by using
more space, and vice versa
Query-update trade-off: Access to data can be more efficient by
imposing some structure upon it
However the more elaborate the structure, the more time is needed
to built it and maintain it when content changes
After initial physical design is done, it is necessary to
monitor it and tune it
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
43
Data Organization


A database is organized in secondary storage into files, each
composed of records, at their turn composed of fields
In a disk, data is stored in disk blocks (pages)




Set by the operating system during disk formatting
Transfer of data between main memory and disk takes place
in units of disk blocks
DBMSs store data on database blocks (pages)
Selection of a database block depends on several issues


Locking granularity may be at the block level, not the record level
For disk efficiency, database block size must be equal to or be a
multiple of the disk block size
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
44
File Organization


Arrangement of data in a file into records and blocks
Heap files (unordered files): Records placed in the file in the
order as they are inserted


Sequential files (ordered files): Records sorted on the values
of ordering fields


Efficient insertions, slow retrieval
Fast retrieval, slow inserting and deleting
Hash files: Use a hash function that calculates the block
(bucket) of a record based on one or several fields


Collision: Bucket is filled and a new record must be inserted
Fastest retrieval, but collision management degrades performance
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
45
Indexes



Additional access structures that speed up retrieval of
records in response to search conditions
Provide alternative ways of accessing the records based on
the indexing fields on which the index is constructed
Many different types of indexes



Clustering vs nonclustering: Whether records in the file are physically
ordered according to the fields of the index
Single column vs multiple column, depending on the number of
indexing files
Unique vs nonunique: Whether duplicate values are allowed
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
46
Indexes, Clustering

Types of indexes (cont.)


Sparse or dense: Whether there are index records for all search
values
Single-level vs multilevel: Whether an index is split in several smaller
indexes, with an index to these indexes


Multi-level indexes often implemented by using B-trees of B+-trees
Clustering: Tables physically stored together as they share
common columns (cluster key) and are often used together


Improves data retrieval
Cluster key stored only once  storage efficiency
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
47
Outline

Database Concepts



Steps in Database Design
Entity-Relationship Model
Logical Database Design



Relational Model
Object-Relational Model
Physical Database Design
 Data Warehouse Concepts



Logical Data Warehouse Design
Physical Data Warehouse Design
Data Warehouse Architecture
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
48
Data Warehouses: Definition (1)

Operational databases (online transaction processing
systems or OLTP), are not suitable for data analysis




Contain detailed data, do not include historical data, perform poorly
for complex queries due to normalization
Data warehouses address requirements of decision-making
users
A data warehouse is a collection of subject-oriented,
integrated, nonvolatile, and time-varying data to support
data management decisions
Subject-oriented: focus on particular subjects of analysis
(e.g., purchase, inventory, and sales in a retail company)

Operational databases focus on specific functions of an application
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
49
Data Warehouses: Definition (2)

Integrated: Join together data from several operational and
external systems problems due to differences in data
definition and content


Nonvolatile: Data modification and removal is disallowed 
scope of the data is long period of time


Operational DBs keep data for a short period of time, as required for
daily operations
Time-varying: Keep different values for the same information and the time when changes to these values occurred


In operational DBs these problems are solved in the design phase
Operational DBs typically do not have temporal support
Online analytical processing (OLAP): Allows decision-making
users to perform interactive analysis of data
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
50
Multidimensional Model


Dimensions: Perspectives for analyzing data
Cells (facts): Contain measures, values that are to be analyzed
11
30
Q3 26
12
35
32
Q4 14
20
47
31
games
17
14
18
Q2 27
measure
values
20
35
33
18
14
10
12
Q1 21
23
18
Milan 24 18 28 14
Rome 33 25 23 25
Nice 12 20 24 33
Paris
10
dimensions
Time (Quarter)

DWs and OLAP use a multidimensional view of data
Represented as a data cube or an hypercube
S
(C tor
ity e
)

DVDs
books
CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
51
Hierarchies



Data granularity: Level of detail of measures
Data analyzed at different granularities (abstraction levels)
Hierarchies relate low-level (detailed) concepts to higherlevel (general concepts)



Example: Store – City – Region/Province – Country
Given two related levels in a hierarchy, lower level is called
child, higher level is called parent
Instances of these levels are called members
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
52
Kinds of Hierarchies

Balanced: Same number of levels from leaf members to root
All
Store dimension
France
Country level
Region/
Province level
Ile-de-France
Provence-AlpesCôte d'Azur
Paris
Nice
City level
Store level

Store 1
Store 2
Italy
...
Store 3
... ...
...
Lazio
Lombardy
Rome
Milan
Store 10
Store 11
Store 12
Noncovering (ragged): Parent of some members does not
belong to the level immediately above

E.g., small countries do not have Region/Province level
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
53
Kinds of Hierarchies

Unbalanced: May not have instances at the lower levels


E.g., some cities do not have stores, sales are through wholesalers
Parent-child (recursive): Involve twice the same level
Lepage
Ivanov
Vancamp
Jottard
Hallez
Nguyen
Claus
Rochat
Weiss
Pirlot
Spiegel
Quintin
Praet
Berger
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
54
Measure Aggregation and Summarizability


Measures are aggregated when using hierarchies for
visualizing data at different abstraction levels
Summarizability conditions ensure correct aggregation



Disjointness of instances: Grouping of instances in a level with
respect to the parent in the next level must result in disjoint sets
Completeness: All instances are included in the hierarchy and each
instance is related to one parent in the next level
Correct use of aggregation functions: Type of measures determine
the kind of aggregation functions that can be applied
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
55
Measure Classification: Additivity

Additive measures (flow or rate measures): Can be
meaningfully summarized using addition along all
dimensions


Semiadditive measures (stock or level measures): Can be
meaningfully summarized using addition along some (not
all) dimensions


E.g., sales amount can be summarized when the hierarchies in Store,
Time, and Product dimensions are traversed
E.g., inventory quantities, can be aggregated in the Store dimension,
but cannot be aggregated in the Time dimension
Nonadditive measures (value-per-unit measures): Cannot be
meaningfully summarized using addition along any
dimension

E.g., item price, cost per unit, exchange rate
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
56
Measure Classification: Aggregation Complexity

Distributive measures: Defined by an aggregation function
that can be computed in a distributed way



Algebraic measures: Defined by an aggregation function that
has a bounded number of arguments, each is obtained by
applying a distributive function


Data is partitioned into n sets, aggregate function applied to each
set, aggregated value is computed by applying a function to these n
subaggregate values
E.g., count, sum, min, max
E.g., average, computed from sum and count, which are distributive
Holistic measures: Cannot be computed from other
subaggregate values

E.g., median, mode, rank
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
57
OLAP Operations: Roll up
Transforms detailed measures into summarized ones when
one moves up in a hierarchy
12
35
32
Q4 14
20
47
31
games
DVDs
books
CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
43
51
Q1 33
30
42
68
Q2 27
14
11
30
Q3 26
12
35
32
Q4 14
20
47
31
games
books
39
41
Q3 26
57
37
30
17
11
Roll-up to the Country level
18
14
23
Q2 27
14
35
20
18
12
10
33
Q1 21
10
Time (Quarter)
18
Italy
France
51
S
(C tor
ity e
)
Milan 24 18 28 14
Rome 33 25 23 25
Nice
12 20 24 33
Paris
Time (Quarter)

DVDs
CDs
Product (Category)
58
OLAP Operations: Drill down
Opposite to the roll-up operation, i.e., it moves from a more
general level to a detailed level in a hierarchy

Q3 26
12
35
32
Q4 14
20
47
31
games
DVDs
books
CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Time (Quarter)
30
17
11
Drill-down to the
Month level
18
14
23
Q2 27
14
35
20
18
12
10
33
Q1 21
10
Time (Quarter)
18
S
(C tor
ity e
)
S
(C tor
ity e
)
Milan 24 18 28 14
Rome 33 25 23 25
Nice
12 20 24 33
Paris
Milan
8
6
9
5
Rome 10 8 11 8
Nice 4
7
8 10
Paris
6
0
1
Jan 7
2
6 13
4
3
1
7
Feb 8
4
8 12
9 ...
...
Mar 6
4
4 10
...
8
4
1
...
... ... ... ...
5
Dec
4
4
games
16
7
DVDs
books
CDs
Product (Category)
59
OLAP Operations: Pivot or Rotate
Rotates the axes of a cube to provide an alternative
presentation of the data
S
(C tor
ity e
)
(C Pro
a t du
eg c
or t
y)
Milan 24 18 28 14
Rome 33 25 23 25
Nice
12 20 24 33
Paris
games
10
31
14
11
13
Rome
33
28
35
32
Milan
24
23
25
18
33
47
12
18
20
Nice
21
Q4 14
14
17
32
26
20
35
27
28
12
21
47
Q3 26
Pivot
Paris
19
30
Store (City)
11
17
14
18
Q2 27
14
35
20
18
12
10
33
Q1 21
23
18
DVDs 35 30 32 31
CDs 18 11 35 47
games 10 14 12 20
books
10
Time (Quarter)

DVDs
books
CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Q1 Q2 Q3 Q4
Time (Quarter)
60
OLAP Operations: Slice
30
Q3 26
12
35
32
Q4 14
20
47
31
games
17
11
18
14
23
Q2 27
14
35
20
18
12
10
Slice on Store.City = ‘Paris’
33
Q1 21
10
Time (Quarter)
18
S
(C tor
ity e
)
Milan 24 18 28 14
Rome 33 25 23 25
Nice
12 20 24 33
Paris
DVDs
Time (Quarter)
Performs a selection on one dimension of a cube, resulting
in a subcube

Q1 21
10
18
35
Q2 27
14
11
30
Q3 26
12
35
32
Q4 14
20
47
31
games
DVDs
books
CDs
Product (Category)
books
CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
61
OLAP Operations: Dice
30
Q3 26
12
35
32
Q4 14
20
47
31
23
11
17
14
20
Q2 27
Dice on Store.Country = ‘France’
and Time.Quarter= ‘Q1’ or ‘Q2’
18
35
33
18
14
10
12
Q1 21
10
Time (Quarter)
18
Nice
Paris
12
20
24
Q1 21
10
18
35
Q2 27
14
11
30
games
books
33
DVDs
CDs
Product (Category)
games
DVDs
books
CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
62
14
S
(C tor
ity e
)
Milan 24 18 28 14
Rome 33 25 23 25
Nice
12 20 24 33
Paris
S
Time
(C tor
(Quarter)
ity e
)
Defines a selection on two or more dimensions, thus again
defining a subcube

Other OLAP Operations

Drill-across: Executes queries involving more than one cube,
which share at least one dimension


E.g., compare sales data in a cube with purchasing data in another
one
Drill-through: Allows one to move from data at the bottom
level in a cube, to data in the operational systems from
which the cube was derived

E.g., for determining the reason for outlier values in a data cube
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
63
Implementations of a Multidimensional Model


Relational OLAP (ROLAP): Store data in relational databases,
support extensions to SQL and special access methods to
efficiently implement the model and its operations
Multidimensional OLAP (MOLAP): Store data in special data
structures (e.g., arrays) and implement OLAP operations in
these structures


Better performance than ROLAP for query and aggregation,
less storage capacity than ROLAP
Hybrid OLAP (HOLAP): Combine both technologies,

E.g., detailed data stored in relational databases, aggregations kept
in a separate MOLAP store
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
64
Logical Data Warehouse Design


In ROLAP systems, tables organized in specialized structures
Star schema: One fact table and a set of dimension tables



Snowflake schema: Avoids redundancy of star schemas by
normalizing dimension tables



Referential integrity constraints between fact table and dimension
tables
Dimension tables may contain redundancy in the presence of
hierarchies
Normalized tables optimize storage space, but decrease performance
Starflake schema: Combination of the star and snowflake
schemas, some dimensions normalized, other not
Constellation schema: Multiple fact tables that share
dimension tables
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
65
Logical DW Design: Star Schema
Product
Store
Product key
Product number
Product name
Description
Size
Category name
Category descr.
Department name
Department descr.
...
Store key
Store number
Store name
Store address
Manager name
City name
City population
City area
State name
State population
State area
State major activity
...
Promotion
Promotion key
Promotion descr.
Discount pct.
Type
Start date
End date
...
Sales
Product fkey
Store fkey
Promotion fkey
Time fkey
Amount
Quantity
Time
Time key
Date
Event
Weekday flag
Weekend flag
Season
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
66
Logical DW Design: Snowflake Schemas
Product
Product key
Product number
Product name
Description
Size
Category fkey
...
Promotion
Promotion key
Promotion descr.
Discount pct.
Type
Start date
End date
...
Category
Category key
Category name
Description
Department fkey
...
Sales
Product fkey
Store fkey
Promotion fkey
Time fkey
Amount
Quantity
Department
Department key
Department name
Description
...
Time
Time key
Date
Event
Weekday flag
Weekend flag
Season
...
Store
Store key
Store number
Store name
Store address
Manager name
City fkey
...
City
City key
City name
City population
City area
State fkey
...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
State
State key
State name
State population
State area
State major activity
...
67
Logical DW Design: Constellation Schemas
Promotion
Promotion key
Promotion descr.
Discount pct.
Type
Start date
End date
...
Product
Product key
Product number
Product name
Description
Size
Category name
Category descr.
Department name
Department descr.
...
Sales
Product fkey
Store fkey
Promotion fkey
Time fkey
Amount
Quantity
Time
Time key
Date
Event
Weekday flag
Weekend flag
Season
...
Purchases
Product fkey
Supplier fkey
Order time fkey
Due time fkey
Amount
Quantity
Freight cost
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Store
Store key
Store number
Store name
Store address
Manager name
City name
City population
City area
State name
State population
State area
State major activity
...
Supplier
Supplier key
Supplier name
Contact person
Supplier address
City name
State name
...
68
Physical Data Warehouse Design: Views (1)



Data warehouses are several orders of magnitude larger
than operational database
Physical design is crucial for ensuring adequate response
time for complex queries
View: Table derived from a query


Materialized view : View physically stored in a database



Query may involve base tables or other views
Enhances query performance by precalculating costly operations
This is achieved at the expense of storage space
Updating materialized views: Modifications of underlying
base tables must be propagated into the view

Must be performed incrementally to ensure efficiency
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
69
Physical Data Warehouse Design: Views (2)

In a DW not all possible aggregations can be materialized





Number of aggregations grows exponentially with the number of
dimensions and hierarchies
Selection of materialized views: Identify the queries to be
prioritized, which determines the views to be materialized
DBMSs are providing tools that tune the selection of
materialized views given previous queries
Queries must be rewritten in order to best exploit existing
materialized views to improve response time
Drawback of materialized view approach: Requires one to
anticipate queries to be materialized

DW queries are often ad hoc and cannot always be anticipated
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
70
Physical Data Warehouse Design: Indexes (1)

Traditional indexes are not appropriate for data warehouses


Bitmap indexes: For columns with a low number of distinct
values



Designed for OLTP transactions that access a small number of tuples
Each value is associated with a bit vector whose length is the
number of records in the table
ith bit is set if ith row has that value in the column
Small in size, logical operators used to manipulate them
Professor
Index on Rank
PId
Name
...
Rank
Rec# Assistant Associate
P1
John Smith
...
Assistant
1
1
0
0
P2
Ada Jones
...
Associate
2
0
1
0
P3
Bill Kock
...
Full
3
0
0
1
P4
Ed Low
...
Associate
4
0
1
0
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
Full
71
Physical Data Warehouse Design: Indexes (2)

Join indexes: Materialize relational join by keeping pairs of
row identifiers that participate in the join
Index on
Product
P1
P2
P3
...
Sales
P1,C1,...
P1,C2,...
P2,C1,...
Index on
Client
C1
C2
...
P3,C2,...
...


Index selection: Select indexes to minimize execution cost of
a series of queries and updates
DBMSs are providing tools to assist DBAs to this task
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
72
Fragmentation or Partitioning


Divide the contents of a relation into several files that can be
more efficiently processed
Vertical fragmentation: Splits attribute of a table into groups
that are stored independently


E.g., most often used attributes in one partition
Horizontal fragmentation: Divides records of a table into
groups according to a particular criterion

Range partitioning: Partitioned wrt range of values



E.g., partitioning based on time, each partition contains data about a
particular time period (year, month)
Hash partitioning: Partitioned wrt hash function
Combined partitioning: Combine these two methods

E.g., rows first partitioned according to range values, then subdivided
using a hash function
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
73
Data Warehouse Architecture (1)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
74
Data Warehouse Architecture (2)

Data sources



Back-end tier



Extraction-Transformation-Loading (ETL) tools for manipulating data
from sources
Data staging area: Intermediate database where manipulation is
done
OLAP tier


Operational databases
Other internal or external sources of information (e.g. files)
OLAP Server: Supports multidimensional data and operations
Front-end tier: Deals with data analysis and visualization

Composed of OLAP tools, reporting tools, statistical tools, datamining tools, …
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
75
Back-End Tier

Extraction: Gathers data from multiple heterogeneous data
sources




Transformation: Modifies data to conform to the data
warehouse format




May be operational databases or files in various formats
May be internal or external to the organization
Uses APIs such as ODBC, JDBC, … for achieving interoperability
Cleaning: Removes errors, inconsistencies, format transformation
Integration: Reconciles data from different sources
Aggregation: Summarizes data according to the granularity (level of
detail) of the DW
Loading: Feeds the DW with transformed data

Also includes refreshing the data warehouse at a specified frequency
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
76
Data Warehouse Tier


Enterprise data warehouse: Centralized DW that
encompasses all areas in an organization
Data mart: Specialized DW targeted to a particular
functional area or user group


Their data can be derived from the enterprise DW or collected from
data sources
Metadata repository: Describes the content of the DW


Business metadata: Meaning (semantics) of data, organization rules,
policies, constraints, …
Technical metadata: How data is structured/stored in the computer


Data sources, data warehouse, and data marts: logical and physical
schemas, security information, monitoring information …
ETL process: Data lineage (trace to sources), rules, defaults, refresh and
purging rules, algorithms for summarization, …
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
77
OLAP Tier

OLAP servers that provides multidimensional view from DWs
and data marts



Most database products provide OLAP extensions and
related tools for manipulating cubes
However, no standardized language for querying data cubes



Can be ROLAP, MOLAP, or HOLAP
Oracle uses Java and query language OLAP DML
SQL Server uses .NET and query language MDX
XMLA (XML for Analysis) aims at providing a common
language for exchanging multidimensional data
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
78
Front-End Tier

OLAP tools: Allow interactive exploration and manipulation
of the warehouse data


Reporting tools: Enable production, delivery and
management of reports (paper and web-based)



Facilitate formulation of ad hoc queries (no prior knowledge of them)
Use predefined queries
Statistical tools: Used to analyze and visualize the cube data
using statistical methods
Data-mining tools: Allow users to analyze data to discover
patterns, trends, enable predictions
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
79
Variations of the Architecture

Enterprise DW without data marts, or data marts without
enterprise DW




Building enterprise DW is costly, building a data mart is easier
Integrating independently created data marts is costly
No OLAP server: Client tools access directly DW
Extreme situation: No DW either, client tools access sources


Virtual DW: Set of materialized views over data sources
Easier to build, but does not provide a real DW solution


Does not contain historical data, no centralized metadata, no possibility
to clean and transform the data
No data staging area: If the source systems conform very
closely to the data in the DW

Rarely the case in real-world situations
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi
80
Download