BDDW-03-Multidimensional Model

advertisement

Bases de Dados e Data Warehouse

03

Multidimensional Modeling

BDDW 2006/2007

Notice

!

Author

" João Moura Pires (jmp@di.fct.unl.pt)

!

This material can be freely used for personal or academic purposes without any previous authorization from the author, only if this notice is maintained with.

!

For commercial purposes the use of any part of this material requires the previous authorization from the author.

Multidimensional Modeling - 2

Bibliography

!

Many examples are extracted and adapted from

" The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling

(Second Edition) - Ralph Kimball, Margy Ross

Table of Contents

!

Introduction

" Different users (OLTP versus OLAP)

" Basic Elements of a Data Warehouse

!

Basic Concepts

" Fact Tables

" Dimension Tables

" Star Schemas

Multidimensional Modeling - 3

Multidimensional Modeling - 4

Multidimensional Modeling

Introduction

Multidimensional Modeling 5

The different users

!

The users of an operational system

" turn the wheels of the organization. They take orders, sign up new customers, and log complaints. Users of an operational system almost always deal with one record at a time . They repeatedly perform the same operational tasks over and over .

!

The users of a data warehouse

" watch the wheels of the organization turn. They count the new orders and compare them with last week’s orders and ask why the new customers signed up and what the customers complained about. Users of a data warehouse almost never deal with one row at a time. Rather, their questions often require that hundreds or thousands of rows be searched and compressed into an answer set . To further complicate matters, users of a data warehouse continuously change the kinds of questions they ask.

Introduction

[Kimball, 2002]

Multidimensional Modeling - 6

Basic elements of the data warehouse

Introduction

[Kimball, 2002]

Multidimensional Modeling - 7

Data Staging Area

!

The data staging area of the data warehouse is both a storage area and a set of processes commonly referred to as extract-transformation-load ( ETL ). The data staging area is everything between the operational source systems and the data presentation area.

!

Extraction is the first step in the process of getting data into the data warehouse environment. Extracting means reading and understanding the source data and copying the data needed for the data warehouse into the staging area for further manipulation:

" cleansing the data (correcting misspellings, resolving domain conflicts, dealing with missing elements, or parsing into standard formats), combining data from multiple sources, deduplicating data , and assigning warehouse keys .

Introduction

[Kimball, 2002]

Multidimensional Modeling - 8

Data Staging Area

!

The data staging area is dominated by the simple activities of sorting and sequential processing. In many cases , the data staging area is not based on relational technology but instead may consist of a system of flat files . However, a normalized database for data staging storage is acceptable.

!

It is acceptable to create a normalized database to support the staging processes; however, this is not the end goal. The normalized structures must be off-limits to user queries because they defeat understandability and performance. As soon as a database supports query and presentation services, it must be considered part of the data warehouse presentation area. By default, normalized databases are excluded from the presentation area, which should be strictly dimensionally structured.

Introduction

[Kimball, 2002]

Multidimensional Modeling - 9

The Presentation Area

!

The presentation area is a series of integrated data marts . A data mart is a wedge of the overall presentation area pie. In its most simplistic form, a data mart presents the data from a single business process (that cross the boundaries of organizational functions).

!

A dimensional model contains the same information as a normalized model but packages the data in a format whose design goals are user understandability, query performance, and resilience to change.

!

The presentation area data marts must contain detailed, atomic data . Atomic data is required to withstand assaults from unpredictable ad hoc user queries.

!

All the data marts must be built using common dimensions and facts, which are refered has being conformed . Without shared, conformed dimensions and facts, a data mart is a standalone stovepipe application.

Introduction

[Kimball, 2002]

Multidimensional Modeling - 10

The Presentation Area

!

Data in the queryable presentation area of the data warehouse must be dimensional , must be atomic , and must adhere to the data warehouse bus architecture . Using the bus architecture is the secret to building distributed data warehouse systems.

!

If the presentation area is based on a relational database, then these dimensionally modeled tables are referred to as star schemas .

!

If the presentation area is based on multidimensional database or online analytic processing

(OLAP) technology, then the data is stored in cubes .

!

Dimensional modeling is applicable to both relational and multidimensional databases .

Introduction

[Kimball, 2002]

Multidimensional Modeling - 11

Multidimensional Modeling

Basic Concepts

Multidimensional Modeling 12

Fact Table

!

A fact table is the primary table in a dimensional model where the numerical performance measurements of the business are stored

!

The term fact is used to represent a business measure .

!

A measurement is taken at the intersection of all the dimensions (day, product, and store). This list of dimensions defines the grain of the fact table .

!

A row in a fact table corresponds to a measurement . A measurement is a row in a fact table. All the measurements in a fact table must be at the same grain .

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 13

Numerical values and Additive

!

The most useful facts are numeric and additive

" Additivity is crucial because data warehouse applications almost never retrieve a single fact table row. Rather, they bring back hundreds, thousands, or even millions of fact rows at a time, and the most useful thing to do with so many rows is to add them up.

" Semi-additive facts can be added only along some of the dimensions,

" Non-additive facts simply can’t be added at all. They can be counted or averaged.

!

It is theoretically possible for a measured fact to be textual; however, the condition arises rarely. In most cases, a textual measurement is a description of something and is drawn from a discrete list of values.

" Unless the text is unique for every row in the fact table, it belongs in the dimension table .

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 14

Fact table does not store “non-events”

!

It is very important that we do not try to fill the fact table with zeros representing nothing happening because these zeros would overwhelm most of our fact tables.

" if there is no sales activity on a given day in a given store for a given product, we leave the row out of the table.

!

By only including true activity, fact tables tend to be quite sparse. Despite their sparsity, fact tables usually make up 90 percent or more of the total space consumed by a dimensional database.

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 15

Main Categories of Fact tables

!

The fact table grains fall into one of three fundamental categories:

" transaction

" periodic snapshot

" accumulating snapshot.

!

Transaction grain fact tables are among the most common.

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 16

Primary Keys and Fact Tables

!

All fact tables have two or more foreign keys, that connect to the dimension tables’ primary keys

" For example, the product key in the fact table always will match a specific product key in the product dimension table.

!

The fact table itself generally has its own primary key made up of a subset of the foreign keys .

" Every fact table in a dimensional model has a composite key, and conversely, every table that has a composite key is a fact table.

" In a dimensional model , every table that expresses a many-to-many relationship must be a fact table . All other tables are dimension tables.

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 17

Primary Keys and Fact Tables (cont)

!

Only a subset of the components in the fact table composite key typically is needed to guarantee row uniqueness .

" Sometimes there are as few as two dimensions, such as the invoice number and the product key.

!

Once this subset has been identified, the rest of the dimensions take on a single value in the context of the fact table row’s primary key.

!

In most cases , there is no advantage to introducing a unique ROWID key to serve as the primary key in the fact table.

" However, such a key may be required to placate the database management system, especially if you can legitimately, from a business perspective, load multiple identical rows into the fact table .

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 18

Dimension Tables

!

The dimension tables contain the textual descriptors of the business .

" Well-designed dimensional model, dimension tables have many columns or attributes.

These attributes describe the rows in the dimension table.

" It is not uncommon for a dimension table to have 50 to 100 attributes. Dimension tables tend to be relatively shallow in terms of the number of rows

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 19

Dimension Tables

!

Each dimension is defined by its single primary key , which serves as the basis for referential integrity with any given fact table to which it is joined.

" The primary key defines the dimension granularity

!

Dimension attributes serve as the primary source of query constraints , groupings , and report labels .

" they are key to making the data warehouse usable and understandable. In many ways, the data warehouse is only as good as the dimension attributes

" Dimension tables are the entry points into the fact table . Robust dimension attributes deliver robust analytic slicing and dicing capabilities . The dimensions implement the user interface to the data warehouse

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 20

Dimension Tables - Attributes

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 21

Dimension Tables - attributes

!

The best attributes are textual and discrete .

!

Sometimes when we are designing a database it is unclear whether a numeric data field extracted from a production data source is a fact or dimension attribute .

" If it takes on lots of values and participates in calculations - it is a fact

" It is a discretely valued description that is more or less constant and participates in constraints - it is a dimensional attribute.

" Occasionally, we can’t be certain of the classification. In such cases, it may be possible to model the data field either way, as a matter of designer’s prerogative.

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 22

Dimension Tables - hierarchies

!

Dimension tables often represent hierarchical relationships in the business.

" In our sample product dimension table, products roll up into brands and then into categories.

!

The hierarchical descriptive information is stored redundantly , in the spirit of ease of use and query performance . Dimension tables typically are highly denormalized.

!

The dimensions are usually quite small (less than 10 percent of the total data storage requirements). Since dimension tables typically are geometrically smaller than fact tables, improving storage efficiency by normalizing or snowflaking has virtually no impact on the overall database size.

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 23

Star Schemas

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 24

Star Schemas

!

Model simplicity: symmetry and reduced number of tables

" Business users benefit from the simplicity because the data is easier to understand and navigate .

" The simplicity of a dimensional model also has performance benefits .

" Dimensional models are gracefully extensible to accommodate change .

!

Can accommodate completely new dimensions to the schema as long as a single value of that dimension is defined for each existing fact row

!

Can accommodate new, unanticipated facts to the fact table, assuming that the level of detail is consistent with the existing fact table.

!

Can supplement preexisting dimension tables with new , unanticipated attributes .

!

Can break existing dimension rows down to a lower level of granularity from a certain point in time forward

[Kimball, 2002]

Basic Concepts Multidimensional Modeling - 25

Typical result

Time

tempo_key dia_semana mês trimestre ano

Sales

tempo_key produto_key loja_key valor_vendido unidades_vendidas valor_custo

Basic Concepts

Product

produto_key descrição marca categoria

Store

loja_key nome morada tipo

[Kimball, 2002]

Multidimensional Modeling - 26

Typical result

!

Data for the first quarter for all stores by brand

Marca

M-1

M-2

M-3

M-4

Valor Vendido

780

1044

213

95

Unidades Vendidas

263

509

444

39

Metrics

Distinct values for the selected attribute

Textual Attribute of a Dimension

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 27

Typical SQL query for OLAP

Selecting the columns select p.marca, sum (f.valor_vendido), sum (f.unidades_vendidas) from vendas f, produto p, tempo t

aliases

where f.produto_key = p.produto_key

!

and f.tempo_key = t.time_key

!

and f.trimestre = “T1 1996” group by p.marca

order by p.marca

Join constraint

Join constraint

Application constraint

Grouping

Sorting

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 28

Processing the SQL query for OLAP

!

First the application constraints (slicing) are processed for each dimension.

" Ex: Month = “Mars”; Year = “1997”; Type of Store = “Hyper”;

Region = “..”

!

Each dimension produce a set of candidate keys.

" Ex: Date: all date_key referentes a Março de 1997

!

The candidate kes are concatenated (by cartesian product) to build the concatenated keys to be searched in the fact table

!

All the hits in the fact table are selected, grouped and aggregated accordingly the query

Basic Concepts

[Kimball, 2002]

Multidimensional Modeling - 29

Multidimensional Modeling

Further Reading and Summary

Multidimensional Modeling 30

Further Reading and Summary

!

Readings

!

(Kimball - The Data Warehouse toolkit, 2002) - pag 16 to 27.

!

What you should know

!

The basic idea of Kimball school: the DW is a collection of Multidimensional Data Marts that are built incrementally and are made compatible

!

The basic concepts of multidimensional modeling:

Facts, dimensions and Star-Schema

The key of a fact table as a subset of foreign keys.

Grain of a fact table. All the measurements in a fact table must be at the same grain.

Fact table does not store “non-events”

Dimension granularity.

Dimension attributes serve as the primary source of query constraints, groupings, and report labels.

Dimensional models are gracefully extensible to accommodate change

Multidimensional Modeling - 31

Download