Building Dimensional Models met fig

advertisement
1
Building Dimensional Models
1. Matrix Method for Getting Started
1.1.
Build the Matrix
1.2.
Use the 4-step method to design Fact Tables
2. Managing the dimensional Modelling Project
2.1.
Data Warehouse Bus Architecture Matrix
2.2.
Fact Table Diagram
2.3.
Fact Table Detail
2.4.
Dimension Table Detail
2.5.
Steps for the dimensional Modelling Team
2.6.
Identifying the Sources for each Fact Table and
Dimension Table
2.7.
Using a Data Modelling Tool
2
1. Matrix Method for Getting Started
To decide which dimensional model to build, we begin with a top-down
approach we call the Data Warehouse Bus Architecture matrix. The matrix
forces us to name all the data marts we could possibly build and to name all
the dimensions implied by those data marts.
Once we have identified all the possible data marts and dimensions, we can
get very specific about the design of individual fact tables within these marts.
For each fact table we apply the four-step method described earlier.
1.1
Build the Matrix
The rows for the Data Warehouse Bus Architecture matrix are the data marts
and the columns are the dimensions.
List the Data Marts (rows in the matrix)
To start building the matrix, we list a series of single-source data marts. A
single-source data mart is deliberately chosen to revolve around a single kind
of data source. In a second step, we propose multiple-source data marts that
combine the single-source designs into broader views of the business.
Example: Data marts for a large telephone company
Possible candidate set of single-source data marts for the telephone
company:
 customer billing statements
 scheduled service and installation orders
 trouble reports
 yellow page advertising orders
 customer service and billing inquiries
 marketing promotions
 call detail from the billing perspective
 customer inventory
 network inventory (switches, lines, computers)
 real estate inventory (poles, rights of way, buildings,…)
 labor and payroll
 computer system job processing
 purchase orders to suppliers
 deliveries from suppliers
Multiple-source data marts for the telephone company might include:
 combined field operations tracking (service orders, trouble reports,
customer/network inventory)
 customer relationship management (customer billing, customer inquiries,
promotion tracking)
 customer profitability
3
List the dimensions (columns in the matrix)
List the conceivable dimensions for each data mart we found.
For example, in the data mart ‘customer billing statements’ we might list the
following dimensions:
 Time (date of billing)
 Customer
 Service
 Rate category (including promotion)
 Local service provider
 …
Mark the intersections
With the rows (data marts) and columns (dimensions) of our matrix defined,
we systematically mark all the intersections where a dimension exists for a
data mart.
For our telephone company example, we might end up with a matrix that
looks like figure 7.1.
Figure 7.1 The Data Warehouse Bus Architecture matrix for a telephone
company
4
1.2 Use the Four-step Method to design Each Fact Table
Start the detailed logical and physical design of individual tables.
We discussed the four-step method in detail earlier, but we will review the
highlights here.
Step 1. Choose the Data Mart
We look down the row headings of our matrix and choose one of the data
marts. The first fact table in the design should come from a single-source data
mart.
Step 2. Declare the Grain
Declaring the grain is equivalent to saying what is an individual fact table
record. If an individual fact table represents the daily sales total in a retail
store, then that is the grain. If an individual fact table record is a line item on
an order, then that is the grain. If an individual fact table record is a customer
transaction, then that is the grain.
Step 3. Choose the Dimensions
Choose the dimensions for the particular fact table. For example ‘daily
inventory levels of individual stock items in a distribution center ‘ specifies the
time dimension, the stock item dimension and the location dimension.
Note that billing month is not the same as calendar month and separate fact
tables using these two interpretations of time must label them as distinct
dimensions. This is one of the lessons of conformed dimensions. When things
are the same, they should be exactly the same and have the same names.
When things are different, they must have different names.
Once a dimension is chosen, there may be a large number of descriptive
attributes, which can be used to populate the dimension. These descriptive
attributes may come from several sources. At this point it is helpful to make a
list of all the known descriptive attributes available to describe an item in the
dimension (a product, a customer, a service, a location, or a day)
Step 4. Choose the Facts
Add as many facts as possible within the context of the declared grain.
In the example of the telephone company:
1. Data Mart: Customer billing
2. Grain: the individual line item on each monthly customer bill
3. Dimensions: a time dimension, a customer dimension, a service (line item)
dimension, and perhaps a rate or promotion dimension
4. Facts: Line item amount, Line item quantity
5
2. Managing the dimensional Modelling Project
Four graphical tools to facilitate the project:
 Data Warehouse Bus Architecture matrix
 Fact table diagram
 Fact table detail
 Dimension table detail
2.1 Data Warehouse Bus Architecture matrix
The matrix (fig 7.1) developed by the design team can be used as a
presentation aid for meetings with other designers, administrators and end
users.
2.2 Fact table Diagram
Prepare a logical diagram for each completed fact table
The fact table diagram not only shows the specifics of a given fact table but
also the context of the fact table in the overall data mart. The fact table
diagram names the fact table, clearly states its grain, and shows all the
dimensions to which it is connected. It also shows all the other dimensions
that have been identified for the business, without connections.
The fact table diagram for the telephone billing line item example is shown in
Figure 7.2.
To aid in the understanding of the model, it is important to retain consistency
in the fact table diagrams. Think about the order that you place the
dimensions around the hub. We recommend that you put the time dimension
at the top and work your way around the fact table in order of importance of
the dimension.
Figure 7.2
The telephone billing fact table diagram. Disconnected
dimensions are shown on both sides of the diagram.
6
The supporting information for the Fact Table Diagram includes the name of
each dimension and a description of that dimension.
Keep your descriptions and diagrams together.
Dimension Name
Dimension Description
2.3 Fact Table Detail
The fact table detail provides a complete list of all the facts available through
the fact table. See Figure 7.3. This list includes actual facts in the physical
table, derived facts presented through DBMS views, and other facts that are
possible to calculate from the first two groups. Aggregation rules should be
provided with each fact to warn the reviewer that some facts may be
semiadditive or nonadditive across certain dimensions. For instance facts
such as temperatures are completely nonadditive across all dimensions, but
they can be averaged.
Figure 7.3 Fact table detail diagram showing dimension keys, basic facts,
and derived facts (with asterisks).
7
2.4
Dimension Table Detail
The dimension table detail diagram (see Figure 7.4) shows the individual
attributes within a single dimension. Each dimension will have a separate
diagram. The diagram shows the explicit grain of each dimension. The
dimension table detail shows the approximate cardinality of each dimension
attribute and allows the users to quickly see the multiple hierarchies and
relationships between the attributes.
Figure 7.4 Dimension table detail diagram (relative cardinalities shown in
parentheses)
8
Again, full descriptive information must be provided to support the diagram.
Figure 7.5 shows a sample of what should be documented for each dimension
attribute.
 Attribute name. The official business attribute name.
 Attribute definition. A brief but meaningful description of the business
attribute.
 Cardinality. The best intuitive estimate of how many distinct values this
attribute has relative to the number of rows in the whole dimension table.
 Sample data. Sample values that this attribute will contain. This is
particularly useful to really understand the attributes.
 Slowly changing policy. Type 1 (overwritten), Type 2 (new record
created when a change is detected), and Type 3 (an old and a new
version of this attribute is continuously maintained in the dimension
record). Type 0 (value is never updated).
Figure 7.5
Dimension attribute detail descriptions.
Many-to-many relationships, slowly changing dimensions, and artificial
attributes.
Advanced dimensional modelling concepts can also be reflected in the
dimension table detail diagram. The specific situations that should be
reflected in the model include many-to-many relationships, slowly changing
dimensions, and artificial attributes.
See Figure 7.6 for an example of Many-to-many and slowly changing
dimension attributes.
Figure 7.7 shows an example of an artificial attribute. Creation of a new
attribute that does not exist in the business today, but is created when we
need to combine similar dimensions into a single dimensions.
9
Figure 7.6
Many-to-many and slowly changing dimension attributes.
Figure 7.7
A dimension with a correlated attribute.
10
2.5






Steps for the Dimensional Modelling Team
Create the Initial draft (data marts, dimensions, data matrix and diagrams)
Track Base Facts
Track derived facts (see figure 7.8)
Get IS team input (present initial design to rest of IS Team)
Work with Core Business Users (select some key users to work on project)
Present to Business Users (confirm the design with a broader set of users)
Figure 7.8
Derived fact worksheet.
11
2.6
Identifying the Sources for Each Fact Table and
Dimension Table
While it may seem very basic to pull together a list of candidate sources with
descriptive information about each, such a list does not exist in a single place
for most organizations today. Each individual source may or may not have a
general description, and there us rarely a consolidated list of all of them.
Figure 7.9 shows a sample data source definition.
In your data source definitions, make sure that you include the following
information:
 Source. The name of the source system.
 Business owner. The name of the primary contact within the business
who is responsible for this data.
 IS owner. The name of the person who is responsible for this source
system.
 Platform. The operating environment where this system runs.
 Location. The actual location of the system. The name of the city and the
specific machine where this system runs. For distributed applications,
include the number of locations, too.
 Description. A brief description of what this system does.
Figure 7.9 Data Source definitions.
12
Source Data Ownership
Establishing responsibility for data quality and integrity can be extremely
difficult in the data warehousing environment. Operational system owners are
not measured on how accurate or complete the fields are. Fields like end user
name or customer SIC code are not usually validated, if they get filled in at all.
Data Providers
Operational system owners need to feel a responsibility to provide data to the
warehouse as a regular part of their operational processes.
Data providers include external-source data vendors, external sources from
other divisions of the organization, as well as people who are responsible for
a system and need to provide an extract file to the data warehouse team.
Detailed criteria for Selecting the Data Sources
Successful projects generally begin with one or two primary data sources.
Additional sources may be used for dimension construction, but the facts are
usually retrieved only from a single source system.
The decision must be made about where the data will be extracted from.
Some criteria to consider are data accessibility, data accuracy and project
scheduling.
Browsing the Data Content
To better understand the actual data content, a study can be performed
against current source system data. The browsing process helps you
understand the contents of the source files and get a sense for the cleanliness
of the data. Essentially, the process involves building a set of queries that
count the number of rows with each value of a given attribute, like gender, or
size, or status. Or bringing back all the distinct values of an attribute, such as
customer name, and sorting them to find duplicates or similar spellings for the
same customer.
If the source systems are already in relational tables, it may be possible to
point the ad hoc query tool at the source and query away.
13
Mapping Data from Source to Target
The source-to-target data map is the foundation for the development of the
data staging process. This map is to document specifically where the data can
be located. Figure 7.10 shows a sample source-to-target data map. This is
just one sample of what could be used to collect the information that is known
at this time. The columns that are included are:
 Table name. The name of the logical table in the data warehouse.
 Column name. The name of the logical column in the data warehouse.
 Data type. The data type of the logical column in the data warehouse
(char, number, date).
 Length. The length of the field of the logical column.
 Target column description. A description of the logical column.
 Source system. The name of the source system where data feeds the
target logical column. There may be more than one source system for a
single logical column.
 Source table/file. The name of the specific table or file where data feeds
the target logical column.
 Source column/field. The name of the specific column or field where data
feeds the target logical column.
 Data transform. Notes about any transformations that are required to
translate the source information into the format required by the target
column. This is not intended to be a complete list of the specific business
rules but a place to note any anomalies or issues that are known at this
time.
 Dimension/data mart. The name of the dimension or data mart that this
column represents.
 Attribute/fact. The name of the specific attribute within a dimension table
or fact table.
Figure 7.10
Sample source-to-target data map.
14
2.7 Using a Data Modelling Tool
You should use a data modelling tool to develop the physical data model,
preferably one that stores your model’s structure in a relational database.
As the data staging tools mature, information will flow more naturally from the
popular data modelling tools, through the transformation engine, and into the
metadata that users will access to learn about the data in the warehouse.
Retaining the relationship between the logical table design and the business
dimensional model is important to ensure the final data does indeed tie back
to the original business requirements.
Summary
In these chapters we saw the process used to apply dimensional modelling
techniques to a project. We introduced the Bus Architecture matrix to lay out
the data marts and dimensions. We reviewed the four-step method to design
a single data mart. We introduced diagramming techniques to use during the
modelling process. Also data sourcing and mapping were reviewed.
There are some supporting templates collected on the CD-rom that
accompanies the book. A list of available templates:
Template 7.1 Data Mart Matrix
Template 7.2 Dimensional Model Document
Template 7.3 Derived Fact Worksheet
Template 7.4 Logical table design
Template 7.5 Data Source Definition Document
Template 7.6 Source to Target Data Map
In this section we have focused on the data. In the next part, we have to step
back and follow a different path of the lifecycle…. the technical architecture.
Download