Chapter 2 Databases & Data Warehouses Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Outline Database Concepts Steps in Database Design Entity-Relationship Model Logical Database Design Relational Model Object-Relational Model Physical Database Design Data Warehouse Concepts Logical Data Warehouse Design Physical Data Warehouse Design Data Warehouse Architecture Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 2 Steps in Database Design Requirements specification: Collects information about users’ needs with respect to the database system Conceptual design: Builds a user-oriented representation of the database without any implementation considerations Logical design: Translates the conceptual schema from the previous phase into an implementation model common to several DBMSs, e.g., relational or object-relational Physical design: Customizes the logical schema from the previous phase to a particular platform, e.g., Oracle or SQL Server Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 3 Entity Relationship Model: University Example Academic staff (0,n) Employee no Name First name Last name Address Email Homepage Research areas (1,n) Participates Project (1,n) Start date End date Project id Project acronym Project name Description (0,1) Start date End date Budget Funding agency Teaches (partial,exclusive) (1,1) (0,n) Assistant Professor Section Thesis title Thesis description Status Tenure date (0,1) /No courses Semester Year Homepage (0,1) (1,1) CouSec Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Prerequisite (0,n) Has prereq Is a prereq Course (1,1) (0,n) Advisor (0,n) (1,n) Course id Course name Level 4 ER Model: Entity and Relationship Types (1) Entity Relationship Model: Most often used conceptual model for database design Entity type: Represents a set of real-world objects of interest to an application Relationship type: Represents an association between objects Entity: object belonging to an entity type Relationship: association belonging to a relationship type Role: Participation of an entity type in a relationship type Cardinalities in a role: Minimum and maximum number of times that the entity may participate in the relationship type Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 5 ER Model: Entity and Relationship Types (2) Types of roles Relationship types Binary vs n-ary: two vs more than two participating entity types Binary relationship types: Optional vs mandatory: minimum cardinality 0 vs 1 Monovalued vs multivalued: maximum cardinality 1 vs n One-to-one, one-to-many, or many-to-many depending on the maximum cardinality of the two roles Recursive relationship types: the same entity type participates more than once in the relationship type Role names are then necessary to distinguish different roles Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 6 ER Model: Attributes Attributes: Structural characteristics describing objects and relationships Attributes have cardinalities, as for roles Depending on cardinalities, attributes may be Optional vs mandatory Monovalued vs multivalued Complex attributes: Attributes composed of other attributes Derived attributes: Their value for each instance may be calculated from other elements of the schema Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 7 ER Model: Identifiers and Weak Entity Types Identifier or key: One or several attributes that uniquely identify a particular object Weak entity type: Do not have identifier of their own A weak entity type is dependent on the existence of another entity type, called the identifying or owner entity type An identifying relationship type relates a weak entity type to its owner Regular entity types are called strong entity types Normal relationship types are called regular relationship types Partial key: Set of attributes that can uniquely identify weak entities related to the same owner entity Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 8 ER Model: Generalization (1) Generalization or is-a relationship: Relates perspectives of the same concept at different abstraction levels Supertype: Entity at higher abstraction level Subtype: Entity at lower abstraction level Population inclusion: Every instance of the subtype is also an instance of the supertype Inheritance: All characteristics (e.g., attributes, roles) of the supertype are inherited by the subtype Substitutability: Each time an instance of a supertype is required, an instance of the subtype can be used instead Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 9 ER Model: Generalization (2) Several generalization relationships accounting for the same criterion can be grouped Types of generalizations Total vs partial: depending on whether every instance of the supertype is also an instance of one of the subtypes Disjoint vs overlapping: depending on whether an instance may belong to one or several subtypes Multiple inheritance: A subtype that has several supertypes May induce conflicts when a property of the same name is inherited from two supertypes Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 10 ER Model: Summary of Notation Professor Identifier Other attributes Entity Type Teaches role name Attributes (1,1) Relationship Type Role (0,1) (1,1) (0,n) (1,n) Simple attribute Complex attribute Attribute 1 Attribute 2 Optional att. (0,1) Multivalued attr. (1,n) Cardinalities Attribute types Section Partial Identifier Other Attributes Weak Entity Type (total,exclusive) (partial,exclusive) (total,overlapping) (partial,overlapping) CouSec Identifying Relationship Type Generalization Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Generalization types 11 Relational Model: University Example Academic staff Employee no First name Last name Address Email Homepage Professor Employee no Status Tenure date (0,1) No courses Assistant Employee no Thesis title Thesis description Advisor Academic staff Research area Employee no Research area Participates Employee no Project id Start date End date Project Project id Project acronym Project name Description (0,1) Start date End date Budget Funding agency Course Section Course id Semester Year Homepage (0,1) Employee no Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Course id Course name Level Prerequisite Course id Has prereq 12 Relational Model (1) Based on a simple data structure Each attribute is defined over a domain or data type A relation or table, composed of attributes or columns Typical domains: integer, float, date, string, … Tuple or row: element of the table Provides several declarative integrity constraints Not null attribute: a value for the attribute must be provided Key: column(s) that uniquely identify one row of the table Simple vs composite key: depending on whether key is composed of one or several columns Primary vs alternate key: one of the keys must be chosen as primary by the designer, the others are alternate keys Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 13 Relational Model (2) Referential integrity Check constraint: defines a predicate that must be valid when inserting or updating a tuple in a relation Defines a link between two tables (or twice the same one) A set of attributes in one table (foreign key) references the primary key of the other table Values in the foreign key columns must also exist in the primary key of the referenced table Predicate can only involve the tuple being inserted/deleted All other constraints must be expressed by triggers: an event-condition-action rule that is automatically activated when a relation is updated Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 14 Translating from ER to Relational Schemas (1) Rule 1: A strong entity type is associated with a table Table contains the simple monovalued attributes and the simple component of the monovalued complex attributes of the entity type Table also defines not null constraints for mandatory attributes Identifier of the entity type define the primary key of the table Academic staff Employee no Name First name Last name Address Email Homepage Research areas (1,n) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Academic staff Employee no First name Last name Address Email Homepage 15 Translating from ER to Relational Schemas (2) Rule 2: A weak entity type is transformed as a strong entity type, but the associated table also contains the identifier of the owner entity A referential integrity constraint relates this identifier to the table associated with the owner entity type Primary key of the table: partial identifier of the weak entity type + identifier of the owner entity type Section Semester Year ... (1,1) CouSec Course (1,n) Course id ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Section Course id Semester Year ... 16 Translating from ER to Relational Schemas (3) Rule 3: A binary one-to-one relationship type is represented by including the identifier of one of the participating entity types in the table associated to the other entity type A referential integrity constraint relates this identifier to the table associated with the corresponding entity type Simple monovalued attributes and simple components of monovalued complex attributes are also included in the table Not null constraint for mandatory attributes are also defined Department ... (1,1) Dean Professor (0,1) ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Department ... Dean 17 Translating from ER to Relational Schemas (4) Rule 4: A binary one-to-many relationship type is represented by including the identifier of the entity type with cardinality 1 in the table associated to the other entity type A corresponding referential integrity constraint is added Attributes and not null constraints are also added as in previous rules Assistant ... (1,1) Advisor Professor (0,n) ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Assistant ... Advisor 18 Translating from ER to Relational Schemas (5) Rule 5: A binary many-to-many relationship type or an n-ary relationship type is associated with a table containing the identifiers of the entity types that participate in all its roles Corresponding referential integrity constraints are added Attributes and not null constraints are also added as in previous rules Identifier if any is the key of the relation, but the combination of all role identifiers can be used as a key Academic staff Employee no ... (0,n) Participates Start date End date Project (1,n) Project id ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Participates Employee no Project id Start date End date 19 Translating from ER to Relational Schemas (6) Rule 6: A multivalued attribute is associated with a table containing the identifier of the entity or relationship type to which it belongs Corresponding referential integrity constraint is added Primary key of table is composed of all its attributes Academic staff ... Research areas (1,n) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Academic staff Research area Employee no Research area 20 Translating from ER to Relational Schemas (7) Rule 7: A generalization can be translated in 3 ways The supertype and the subtype are associated with tables containing the identifier of the entity type. There is a referential integrity constraint from the table of the subtype to the table of the subtype Academic staff Academic staff Employee no Name ... Assistant Thesis title ... Employee no Name ... Assistant Professor Employee no Thesis title ... Employee no Status ... Professor Status ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 21 Translating from ER to Relational Schemas (8) Rule 7 (cont.): Only the supertype is associated with a table, which contains also the attributes of the subtype (these are optional attributes) Academic staff Employee no Name … Thesis title (0,1) ... Status ... Only the subtype is associated with a table containing all attributes of the subtype and the supertype Assistant Professor Employee no Name … Thesis title ... Employee no Name … Status ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 22 Defining Relational Schemas in SQL create table AcademicStaff as ( EmployeeNo integer primary key, FirstName character varying (30) not null, LastName character varying (30) not null, Address character varying (50) not null, Email character varying (30) not null, Homepage character varying (64) not null ); create table AcademicStaffResearchArea as ( EmployeeNo integer not null, ResearchArea character varying (30) not null, primary key (EmployeeNo,ResearchArea), foreign key EmployeeNo references AcademicStaff(EmployeeNo) ); … Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 23 Redundancies and Update Anomalies Participates Employee no Project id Start date End date Project acronym Project name Employee no Research area Department id Relation with redundancies may induce anomalies in the presence of insertions, updates, and deletions Affiliation In relation Participates the information about a project is repeated for each staff member who works on that project In relation Affiliation the information about a research area is repeated for all departments to which the staff member is affiliated Dependencies and normal forms are used to describe such anomalies Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 24 Functional Dependencies Given a relation R and to sets of attributes X and Y in R, a functional dependency X Y holds if in all tuples of the relation, each value of X is associated with at most one value of Y In Participates above, Project Id {Project acronym, Project name} A key is a particular case of a functional dependency A prime attribute is an attribute that is part of a key A full functional dependency is a dependency X Y in which the removal of an attribute from X invalidates the dependency A dependency X Z is transitive if there is a set of attributes Y such that X Y and Y Z hold Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 25 Multivalued Dependencies Given a relation R and to sets of attributes X and Y in R, a multivalued dependency X Y holds if the value of X determines a set of values for Y, independently of any other attributes In Affiliation above, Employee no Research area and Employee no Department Id A multivalued dependency X Y is trivial if either Y X or X Y = R, otherwise it is nontrivial Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 26 Normal Forms (1) Normal Form: Integrity constraint certifying that a relation schema satisfies particular properties In the relational model the relations are in first normal form since all attributes are atomic and monovalues A relation schema is in the second normal form if every nonprime attribute is functionally dependent on every key Participates Employee no Project id Start date End date Project acronym Project name Participates Employee no Project id Start date End date Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Project Project id Project acronym Project name 27 Normal Forms (2) A relation is in the third normal form if it is in the second normal form and there are no transitive dependencies between a key and a nonprime attribute Assistant Assistant Employee no Thesis title Thesis description Advisor id Advisor first name Advisor last name Employee no Thesis title Thesis description Advisor id Professor Employee no First name Last name A relation is in the Boyce-Codd normal form if for every nontrivial dependency X Y, X is a key or contains a key Participates Employee no Project id Start date End date Location Participates Employee no Project id Start date End date Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Location Location Project id 28 Normal Forms (3) A relation is in the fourth normal form if for every nontrivial dependency X Y, X is a key or contains a key Affiliation Affiliation Research Employee no Research area Department id Employee no Research area Employee no Research area Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 29 Weaknesses of the Relational Model It provides a very simple data structure, which disallows multivalued and complex attributes The set of types provided by relational DBMSs is very restrictive (integer, float, string, date, …) complex objects are split into several tables performance problems Insufficient for complex application domains There is no integration of operations with data structures It is not possible to directly reference an object by use of a surrogate or a pointer links are based on comparison of values performance problems Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 30 Object-Relational Model: Motivations Preserves the foundations of the relational model Organizes data using an object model Extensions Allow attributes to have complex types groups related facts into a single row Complex and/or multivalued attributes User-defined types with associated methods System-generated identifiers Inheritance among types Defined in the SQL:1999 and SQL:2003 standards Current DBMSs (Oracle, …) introduced object-relational extensions, but do not necessarily comply with the standard Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 31 Object-Relational Model: Definition (1) Allow composite types, in addition to predefined types Complex values may be stored in a column of a table Row type: structured values (i.e., composite attributes) Array type: variable-sized vectors of values Multiset type: Unordered collection of values, with duplicates permitted (bags) Composite types can be nested This is considered an “advanced feature” in the standard Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 32 Object-Relational Model: Definition (2) Two sorts of user-defined types Distinct types: Represented internally by a predefined type, but cannot be mixed in expressions Structured user-defined types: class declarations in OO languages Does not make sense to compare a person’s name and a SSN May have attributes, which can be of any SQL type May have methods, which can be instance or static (class) methods Attributes are encapsulated through observer and mutator functions Both of them can be used as a domain of a column, of an attribute of another type, or a table Comparison/ordering of values: with user-defined methods Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 33 Object-Relational Model: Definition (3) SQL:2003 supports single inheritance: A subtype can A subtype inherits attributes and methods from its supertype Define additional attributes and methods Overload inherited methods: provide another definition with a different signature Override inherited methods: provide another implementation with a the same signature Final types: cannot have subtypes Final methods: cannot be redefined Instantiable type: includes constructor methods for creating instances Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 34 Object-Relational Model: Definition (4) Two types of tables Relational tables: usual ones, where domains for attributes are all predefined or user-defined types Typed tables: defined on structured user-defined types Constraints (e.g. keys) defined in the table declaration Have a self-referencing column that contains identifiers (surrogates) generated by the DBMS Row identifiers can be used for establishing links between tables Ref type: values are the unique identifiers Always associated with a specified structured type Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 35 Object-Relational Model: University Example Academic staff Type Employee no Name First name Last name Address Email Homepage Research areas (1,n) Participates (0,n) Assistant Type Thesis title Thesis description Advisor Project Type Participates Type Employee Project Start date End date Professor Type Status Tenure date (0,1) No courses Advises (0,n) Teaches (0,n) Project id Project acronym Project name Description (0,1) Start date End date Budget Funding agency Participates (1,n) Section Type Course Type Semester Year Homepage (0,1) Professor Course Course id Course name Level Sections (1,n) Has prereq (0,n) Is a prereq (0,n) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 36 Translating from ER to OR Schemas (1) Rule 1: A strong entity type is associated with a structured type containing all its attributes Multivalued attributes defined with the array or the multiset type Composite attributes defined with the row or the structured type A table of this structured type must be declared Table defines not null constraints for mandatory attributes Identifier of the entity type define the primary key of the table Academic staff Academic staff Type Employee no Name First name Last name Address Email Homepage Research areas (1,n) Employee no Name First name Last name Address Email Homepage Research areas (1,n) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 37 Translating from ER to OR Schemas (2) Rule 2: A weak entity type is transformed as a strong entity type, but the structured type contains the identifier of its owner Section Semester Year Homepage (0,1) (1,1) CouSec (1,n) Course Course id ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Section Type Semester Year Homepage (0,1) Course Course Type Course id ... 38 Translating from ER to OR Schemas (3) Rule 3: A regular relationship type is associated with structured type containing the identifiers of all participating entity types, and all attributes of the relationship A table of this structured type must be declared Table defines not null constraints for mandatory attributes Set of identifiers of entity types define the primary key of the table Inverse references may also be defined Academic staff Employee no ... (0,n) Participates Start date End date (1,n) Project Project id ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Participates Type Acad. staff Project Start date End date Academic staff Type … Participates (0,n) Project Type … Participates (1,n) 39 Translating from ER to OR Schemas (4) Rule 4: Only single inheritance is supported by the model multiple inheritance must be removed by transforming some generalization links into binary one-to-one relationships, to which Rule 3 is applied Assistant Assistant Type ... Student Major ... PhD Student Thesis title ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Assistant Assistant Type ... Student Major ... PhD Student PhD Student Thesis title … Student 40 Defining Object-Relational Schemas in SQL create type AcademicStaffType as ( EmployeeNo integer, Name row ( FirstName character varying (30), LastName character varying (30) ), Address character varying (50), Email character varying (30), Homepage character varying (64), ResearchAreas character varying (30) multiset, Participates ref(ParticipatesType) multiset scope Participates references are checked ) ref is system generated; create table AcademicStaff of AcademicStaffType ( constraint AcademicStaffPK primary key (EmployeeNo), Email with options constraint AcademicStaffEmailNN Email not null, ref is oid system generated ); … Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 41 Physical Database Design Specifies how database records are stored, accessed, and related in order to ensure adequate performance of a database application Requires to know the specificities of an application, properties of data, and usage patterns Involves analyzing the transactions/queries that run frequently, that are critical, and the periods of time in which there will be a high demand on the database (peak load) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 42 Performance of Database Applications Factors for measuring performance of database applications A compromise has to be made among these factors Transaction throughput: # transaction processed in a time interval Response time: Elapsed time for the completion of a transaction Disk storage: Space required to store the database file Space-time trade-off: Reduce time to perform an operation by using more space, and vice versa Query-update trade-off: Access to data can be more efficient by imposing some structure upon it However the more elaborate the structure, the more time is needed to built it and maintain it when content changes After initial physical design is done, it is necessary to monitor it and tune it Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 43 Data Organization A database is organized in secondary storage into files, each composed of records, at their turn composed of fields In a disk, data is stored in disk blocks (pages) Set by the operating system during disk formatting Transfer of data between main memory and disk takes place in units of disk blocks DBMSs store data on database blocks (pages) Selection of a database block depends on several issues Locking granularity may be at the block level, not the record level For disk efficiency, database block size must be equal to or be a multiple of the disk block size Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 44 File Organization Arrangement of data in a file into records and blocks Heap files (unordered files): Records placed in the file in the order as they are inserted Sequential files (ordered files): Records sorted on the values of ordering fields Efficient insertions, slow retrieval Fast retrieval, slow inserting and deleting Hash files: Use a hash function that calculates the block (bucket) of a record based on one or several fields Collision: Bucket is filled and a new record must be inserted Fastest retrieval, but collision management degrades performance Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 45 Indexes Additional access structures that speed up retrieval of records in response to search conditions Provide alternative ways of accessing the records based on the indexing fields on which the index is constructed Many different types of indexes Clustering vs nonclustering: Whether records in the file are physically ordered according to the fields of the index Single column vs multiple column, depending on the number of indexing files Unique vs nonunique: Whether duplicate values are allowed Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 46 Indexes, Clustering Types of indexes (cont.) Sparse or dense: Whether there are index records for all search values Single-level vs multilevel: Whether an index is split in several smaller indexes, with an index to these indexes Multi-level indexes often implemented by using B-trees of B+-trees Clustering: Tables physically stored together as they share common columns (cluster key) and are often used together Improves data retrieval Cluster key stored only once storage efficiency Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 47 Outline Database Concepts Steps in Database Design Entity-Relationship Model Logical Database Design Relational Model Object-Relational Model Physical Database Design Data Warehouse Concepts Logical Data Warehouse Design Physical Data Warehouse Design Data Warehouse Architecture Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 48 Data Warehouses: Definition (1) Operational databases (online transaction processing systems or OLTP), are not suitable for data analysis Contain detailed data, do not include historical data, perform poorly for complex queries due to normalization Data warehouses address requirements of decision-making users A data warehouse is a collection of subject-oriented, integrated, nonvolatile, and time-varying data to support data management decisions Subject-oriented: focus on particular subjects of analysis (e.g., purchase, inventory, and sales in a retail company) Operational databases focus on specific functions of an application Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 49 Data Warehouses: Definition (2) Integrated: Join together data from several operational and external systems problems due to differences in data definition and content Nonvolatile: Data modification and removal is disallowed scope of the data is long period of time Operational DBs keep data for a short period of time, as required for daily operations Time-varying: Keep different values for the same information and the time when changes to these values occurred In operational DBs these problems are solved in the design phase Operational DBs typically do not have temporal support Online analytical processing (OLAP): Allows decision-making users to perform interactive analysis of data Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 50 Multidimensional Model Dimensions: Perspectives for analyzing data Cells (facts): Contain measures, values that are to be analyzed 11 30 Q3 26 12 35 32 Q4 14 20 47 31 games 17 14 18 Q2 27 measure values 20 35 33 18 14 10 12 Q1 21 23 18 Milan 24 18 28 14 Rome 33 25 23 25 Nice 12 20 24 33 Paris 10 dimensions Time (Quarter) DWs and OLAP use a multidimensional view of data Represented as a data cube or an hypercube S (C tor ity e ) DVDs books CDs Product (Category) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 51 Hierarchies Data granularity: Level of detail of measures Data analyzed at different granularities (abstraction levels) Hierarchies relate low-level (detailed) concepts to higherlevel (general concepts) Example: Store – City – Region/Province – Country Given two related levels in a hierarchy, lower level is called child, higher level is called parent Instances of these levels are called members Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 52 Kinds of Hierarchies Balanced: Same number of levels from leaf members to root All Store dimension France Country level Region/ Province level Ile-de-France Provence-AlpesCôte d'Azur Paris Nice City level Store level Store 1 Store 2 Italy ... Store 3 ... ... ... Lazio Lombardy Rome Milan Store 10 Store 11 Store 12 Noncovering (ragged): Parent of some members does not belong to the level immediately above E.g., small countries do not have Region/Province level Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 53 Kinds of Hierarchies Unbalanced: May not have instances at the lower levels E.g., some cities do not have stores, sales are through wholesalers Parent-child (recursive): Involve twice the same level Lepage Ivanov Vancamp Jottard Hallez Nguyen Claus Rochat Weiss Pirlot Spiegel Quintin Praet Berger Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 54 Measure Aggregation and Summarizability Measures are aggregated when using hierarchies for visualizing data at different abstraction levels Summarizability conditions ensure correct aggregation Disjointness of instances: Grouping of instances in a level with respect to the parent in the next level must result in disjoint sets Completeness: All instances are included in the hierarchy and each instance is related to one parent in the next level Correct use of aggregation functions: Type of measures determine the kind of aggregation functions that can be applied Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 55 Measure Classification: Additivity Additive measures (flow or rate measures): Can be meaningfully summarized using addition along all dimensions Semiadditive measures (stock or level measures): Can be meaningfully summarized using addition along some (not all) dimensions E.g., sales amount can be summarized when the hierarchies in Store, Time, and Product dimensions are traversed E.g., inventory quantities, can be aggregated in the Store dimension, but cannot be aggregated in the Time dimension Nonadditive measures (value-per-unit measures): Cannot be meaningfully summarized using addition along any dimension E.g., item price, cost per unit, exchange rate Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 56 Measure Classification: Aggregation Complexity Distributive measures: Defined by an aggregation function that can be computed in a distributed way Algebraic measures: Defined by an aggregation function that has a bounded number of arguments, each is obtained by applying a distributive function Data is partitioned into n sets, aggregate function applied to each set, aggregated value is computed by applying a function to these n subaggregate values E.g., count, sum, min, max E.g., average, computed from sum and count, which are distributive Holistic measures: Cannot be computed from other subaggregate values E.g., median, mode, rank Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 57 OLAP Operations: Roll up Transforms detailed measures into summarized ones when one moves up in a hierarchy 12 35 32 Q4 14 20 47 31 games DVDs books CDs Product (Category) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 43 51 Q1 33 30 42 68 Q2 27 14 11 30 Q3 26 12 35 32 Q4 14 20 47 31 games books 39 41 Q3 26 57 37 30 17 11 Roll-up to the Country level 18 14 23 Q2 27 14 35 20 18 12 10 33 Q1 21 10 Time (Quarter) 18 Italy France 51 S (C tor ity e ) Milan 24 18 28 14 Rome 33 25 23 25 Nice 12 20 24 33 Paris Time (Quarter) DVDs CDs Product (Category) 58 OLAP Operations: Drill down Opposite to the roll-up operation, i.e., it moves from a more general level to a detailed level in a hierarchy Q3 26 12 35 32 Q4 14 20 47 31 games DVDs books CDs Product (Category) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Time (Quarter) 30 17 11 Drill-down to the Month level 18 14 23 Q2 27 14 35 20 18 12 10 33 Q1 21 10 Time (Quarter) 18 S (C tor ity e ) S (C tor ity e ) Milan 24 18 28 14 Rome 33 25 23 25 Nice 12 20 24 33 Paris Milan 8 6 9 5 Rome 10 8 11 8 Nice 4 7 8 10 Paris 6 0 1 Jan 7 2 6 13 4 3 1 7 Feb 8 4 8 12 9 ... ... Mar 6 4 4 10 ... 8 4 1 ... ... ... ... ... 5 Dec 4 4 games 16 7 DVDs books CDs Product (Category) 59 OLAP Operations: Pivot or Rotate Rotates the axes of a cube to provide an alternative presentation of the data S (C tor ity e ) (C Pro a t du eg c or t y) Milan 24 18 28 14 Rome 33 25 23 25 Nice 12 20 24 33 Paris games 10 31 14 11 13 Rome 33 28 35 32 Milan 24 23 25 18 33 47 12 18 20 Nice 21 Q4 14 14 17 32 26 20 35 27 28 12 21 47 Q3 26 Pivot Paris 19 30 Store (City) 11 17 14 18 Q2 27 14 35 20 18 12 10 33 Q1 21 23 18 DVDs 35 30 32 31 CDs 18 11 35 47 games 10 14 12 20 books 10 Time (Quarter) DVDs books CDs Product (Category) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Q1 Q2 Q3 Q4 Time (Quarter) 60 OLAP Operations: Slice 30 Q3 26 12 35 32 Q4 14 20 47 31 games 17 11 18 14 23 Q2 27 14 35 20 18 12 10 Slice on Store.City = ‘Paris’ 33 Q1 21 10 Time (Quarter) 18 S (C tor ity e ) Milan 24 18 28 14 Rome 33 25 23 25 Nice 12 20 24 33 Paris DVDs Time (Quarter) Performs a selection on one dimension of a cube, resulting in a subcube Q1 21 10 18 35 Q2 27 14 11 30 Q3 26 12 35 32 Q4 14 20 47 31 games DVDs books CDs Product (Category) books CDs Product (Category) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 61 OLAP Operations: Dice 30 Q3 26 12 35 32 Q4 14 20 47 31 23 11 17 14 20 Q2 27 Dice on Store.Country = ‘France’ and Time.Quarter= ‘Q1’ or ‘Q2’ 18 35 33 18 14 10 12 Q1 21 10 Time (Quarter) 18 Nice Paris 12 20 24 Q1 21 10 18 35 Q2 27 14 11 30 games books 33 DVDs CDs Product (Category) games DVDs books CDs Product (Category) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 62 14 S (C tor ity e ) Milan 24 18 28 14 Rome 33 25 23 25 Nice 12 20 24 33 Paris S Time (C tor (Quarter) ity e ) Defines a selection on two or more dimensions, thus again defining a subcube Other OLAP Operations Drill-across: Executes queries involving more than one cube, which share at least one dimension E.g., compare sales data in a cube with purchasing data in another one Drill-through: Allows one to move from data at the bottom level in a cube, to data in the operational systems from which the cube was derived E.g., for determining the reason for outlier values in a data cube Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 63 Implementations of a Multidimensional Model Relational OLAP (ROLAP): Store data in relational databases, support extensions to SQL and special access methods to efficiently implement the model and its operations Multidimensional OLAP (MOLAP): Store data in special data structures (e.g., arrays) and implement OLAP operations in these structures Better performance than ROLAP for query and aggregation, less storage capacity than ROLAP Hybrid OLAP (HOLAP): Combine both technologies, E.g., detailed data stored in relational databases, aggregations kept in a separate MOLAP store Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 64 Logical Data Warehouse Design In ROLAP systems, tables organized in specialized structures Star schema: One fact table and a set of dimension tables Snowflake schema: Avoids redundancy of star schemas by normalizing dimension tables Referential integrity constraints between fact table and dimension tables Dimension tables may contain redundancy in the presence of hierarchies Normalized tables optimize storage space, but decrease performance Starflake schema: Combination of the star and snowflake schemas, some dimensions normalized, other not Constellation schema: Multiple fact tables that share dimension tables Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 65 Logical DW Design: Star Schema Product Store Product key Product number Product name Description Size Category name Category descr. Department name Department descr. ... Store key Store number Store name Store address Manager name City name City population City area State name State population State area State major activity ... Promotion Promotion key Promotion descr. Discount pct. Type Start date End date ... Sales Product fkey Store fkey Promotion fkey Time fkey Amount Quantity Time Time key Date Event Weekday flag Weekend flag Season ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 66 Logical DW Design: Snowflake Schemas Product Product key Product number Product name Description Size Category fkey ... Promotion Promotion key Promotion descr. Discount pct. Type Start date End date ... Category Category key Category name Description Department fkey ... Sales Product fkey Store fkey Promotion fkey Time fkey Amount Quantity Department Department key Department name Description ... Time Time key Date Event Weekday flag Weekend flag Season ... Store Store key Store number Store name Store address Manager name City fkey ... City City key City name City population City area State fkey ... Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi State State key State name State population State area State major activity ... 67 Logical DW Design: Constellation Schemas Promotion Promotion key Promotion descr. Discount pct. Type Start date End date ... Product Product key Product number Product name Description Size Category name Category descr. Department name Department descr. ... Sales Product fkey Store fkey Promotion fkey Time fkey Amount Quantity Time Time key Date Event Weekday flag Weekend flag Season ... Purchases Product fkey Supplier fkey Order time fkey Due time fkey Amount Quantity Freight cost Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Store Store key Store number Store name Store address Manager name City name City population City area State name State population State area State major activity ... Supplier Supplier key Supplier name Contact person Supplier address City name State name ... 68 Physical Data Warehouse Design: Views (1) Data warehouses are several orders of magnitude larger than operational database Physical design is crucial for ensuring adequate response time for complex queries View: Table derived from a query Materialized view : View physically stored in a database Query may involve base tables or other views Enhances query performance by precalculating costly operations This is achieved at the expense of storage space Updating materialized views: Modifications of underlying base tables must be propagated into the view Must be performed incrementally to ensure efficiency Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 69 Physical Data Warehouse Design: Views (2) In a DW not all possible aggregations can be materialized Number of aggregations grows exponentially with the number of dimensions and hierarchies Selection of materialized views: Identify the queries to be prioritized, which determines the views to be materialized DBMSs are providing tools that tune the selection of materialized views given previous queries Queries must be rewritten in order to best exploit existing materialized views to improve response time Drawback of materialized view approach: Requires one to anticipate queries to be materialized DW queries are often ad hoc and cannot always be anticipated Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 70 Physical Data Warehouse Design: Indexes (1) Traditional indexes are not appropriate for data warehouses Bitmap indexes: For columns with a low number of distinct values Designed for OLTP transactions that access a small number of tuples Each value is associated with a bit vector whose length is the number of records in the table ith bit is set if ith row has that value in the column Small in size, logical operators used to manipulate them Professor Index on Rank PId Name ... Rank Rec# Assistant Associate P1 John Smith ... Assistant 1 1 0 0 P2 Ada Jones ... Associate 2 0 1 0 P3 Bill Kock ... Full 3 0 0 1 P4 Ed Low ... Associate 4 0 1 0 Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi Full 71 Physical Data Warehouse Design: Indexes (2) Join indexes: Materialize relational join by keeping pairs of row identifiers that participate in the join Index on Product P1 P2 P3 ... Sales P1,C1,... P1,C2,... P2,C1,... Index on Client C1 C2 ... P3,C2,... ... Index selection: Select indexes to minimize execution cost of a series of queries and updates DBMSs are providing tools to assist DBAs to this task Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 72 Fragmentation or Partitioning Divide the contents of a relation into several files that can be more efficiently processed Vertical fragmentation: Splits attribute of a table into groups that are stored independently E.g., most often used attributes in one partition Horizontal fragmentation: Divides records of a table into groups according to a particular criterion Range partitioning: Partitioned wrt range of values E.g., partitioning based on time, each partition contains data about a particular time period (year, month) Hash partitioning: Partitioned wrt hash function Combined partitioning: Combine these two methods E.g., rows first partitioned according to range values, then subdivided using a hash function Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 73 Data Warehouse Architecture (1) Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 74 Data Warehouse Architecture (2) Data sources Back-end tier Extraction-Transformation-Loading (ETL) tools for manipulating data from sources Data staging area: Intermediate database where manipulation is done OLAP tier Operational databases Other internal or external sources of information (e.g. files) OLAP Server: Supports multidimensional data and operations Front-end tier: Deals with data analysis and visualization Composed of OLAP tools, reporting tools, statistical tools, datamining tools, … Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 75 Back-End Tier Extraction: Gathers data from multiple heterogeneous data sources Transformation: Modifies data to conform to the data warehouse format May be operational databases or files in various formats May be internal or external to the organization Uses APIs such as ODBC, JDBC, … for achieving interoperability Cleaning: Removes errors, inconsistencies, format transformation Integration: Reconciles data from different sources Aggregation: Summarizes data according to the granularity (level of detail) of the DW Loading: Feeds the DW with transformed data Also includes refreshing the data warehouse at a specified frequency Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 76 Data Warehouse Tier Enterprise data warehouse: Centralized DW that encompasses all areas in an organization Data mart: Specialized DW targeted to a particular functional area or user group Their data can be derived from the enterprise DW or collected from data sources Metadata repository: Describes the content of the DW Business metadata: Meaning (semantics) of data, organization rules, policies, constraints, … Technical metadata: How data is structured/stored in the computer Data sources, data warehouse, and data marts: logical and physical schemas, security information, monitoring information … ETL process: Data lineage (trace to sources), rules, defaults, refresh and purging rules, algorithms for summarization, … Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 77 OLAP Tier OLAP servers that provides multidimensional view from DWs and data marts Most database products provide OLAP extensions and related tools for manipulating cubes However, no standardized language for querying data cubes Can be ROLAP, MOLAP, or HOLAP Oracle uses Java and query language OLAP DML SQL Server uses .NET and query language MDX XMLA (XML for Analysis) aims at providing a common language for exchanging multidimensional data Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 78 Front-End Tier OLAP tools: Allow interactive exploration and manipulation of the warehouse data Reporting tools: Enable production, delivery and management of reports (paper and web-based) Facilitate formulation of ad hoc queries (no prior knowledge of them) Use predefined queries Statistical tools: Used to analyze and visualize the cube data using statistical methods Data-mining tools: Allow users to analyze data to discover patterns, trends, enable predictions Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 79 Variations of the Architecture Enterprise DW without data marts, or data marts without enterprise DW Building enterprise DW is costly, building a data mart is easier Integrating independently created data marts is costly No OLAP server: Client tools access directly DW Extreme situation: No DW either, client tools access sources Virtual DW: Set of materialized views over data sources Easier to build, but does not provide a real DW solution Does not contain historical data, no centralized metadata, no possibility to clean and transform the data No data staging area: If the source systems conform very closely to the data in the DW Rarely the case in real-world situations Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 80