1 Building Dimensional Models 1. Matrix Method for Getting Started 1.1. Build the Matrix 1.2. Use the 4-step method to design Fact Tables 2. Managing the dimensional Modelling Project 2.1. Data Warehouse Bus Architecture Matrix 2.2. Fact Table Diagram 2.3. Fact Table Detail 2.4. Dimension Table Detail 2.5. Steps for the dimensional Modelling Team 2.6. Identifying the Sources for each Fact Table and Dimension Table 2.7. Using a Data Modelling Tool 2 1. Matrix Method for Getting Started To decide which dimensional model to build, we begin with a top-down approach we call the Data Warehouse Bus Architecture matrix. The matrix forces us to name all the data marts we could possibly build and to name all the dimensions implied by those data marts. Once we have identified all the possible data marts and dimensions, we can get very specific about the design of individual fact tables within these marts. For each fact table we apply the four-step method described earlier. 1.1 Build the Matrix The rows for the Data Warehouse Bus Architecture matrix are the data marts and the columns are the dimensions. List the Data Marts (rows in the matrix) To start building the matrix, we list a series of single-source data marts. A single-source data mart is deliberately chosen to revolve around a single kind of data source. In a second step, we propose multiple-source data marts that combine the single-source designs into broader views of the business. Example: Data marts for a large telephone company Possible candidate set of single-source data marts for the telephone company: customer billing statements scheduled service and installation orders trouble reports yellow page advertising orders customer service and billing inquiries marketing promotions call detail from the billing perspective customer inventory network inventory (switches, lines, computers) real estate inventory (poles, rights of way, buildings,…) labor and payroll computer system job processing purchase orders to suppliers deliveries from suppliers Multiple-source data marts for the telephone company might include: combined field operations tracking (service orders, trouble reports, customer/network inventory) customer relationship management (customer billing, customer inquiries, promotion tracking) customer profitability 3 List the dimensions (columns in the matrix) List the conceivable dimensions for each data mart we found. For example, in the data mart ‘customer billing statements’ we might list the following dimensions: Time (date of billing) Customer Service Rate category (including promotion) Local service provider … Mark the intersections With the rows (data marts) and columns (dimensions) of our matrix defined, we systematically mark all the intersections where a dimension exists for a data mart. For our telephone company example, we might end up with a matrix that looks like figure 7.1. Figure 7.1 The Data Warehouse Bus Architecture matrix for a telephone company 4 1.2 Use the Four-step Method to design Each Fact Table Start the detailed logical and physical design of individual tables. We discussed the four-step method in detail earlier, but we will review the highlights here. Step 1. Choose the Data Mart We look down the row headings of our matrix and choose one of the data marts. The first fact table in the design should come from a single-source data mart. Step 2. Declare the Grain Declaring the grain is equivalent to saying what is an individual fact table record. If an individual fact table represents the daily sales total in a retail store, then that is the grain. If an individual fact table record is a line item on an order, then that is the grain. If an individual fact table record is a customer transaction, then that is the grain. Step 3. Choose the Dimensions Choose the dimensions for the particular fact table. For example ‘daily inventory levels of individual stock items in a distribution center ‘ specifies the time dimension, the stock item dimension and the location dimension. Note that billing month is not the same as calendar month and separate fact tables using these two interpretations of time must label them as distinct dimensions. This is one of the lessons of conformed dimensions. When things are the same, they should be exactly the same and have the same names. When things are different, they must have different names. Once a dimension is chosen, there may be a large number of descriptive attributes, which can be used to populate the dimension. These descriptive attributes may come from several sources. At this point it is helpful to make a list of all the known descriptive attributes available to describe an item in the dimension (a product, a customer, a service, a location, or a day) Step 4. Choose the Facts Add as many facts as possible within the context of the declared grain. In the example of the telephone company: 1. Data Mart: Customer billing 2. Grain: the individual line item on each monthly customer bill 3. Dimensions: a time dimension, a customer dimension, a service (line item) dimension, and perhaps a rate or promotion dimension 4. Facts: Line item amount, Line item quantity 5 2. Managing the dimensional Modelling Project Four graphical tools to facilitate the project: Data Warehouse Bus Architecture matrix Fact table diagram Fact table detail Dimension table detail 2.1 Data Warehouse Bus Architecture matrix The matrix (fig 7.1) developed by the design team can be used as a presentation aid for meetings with other designers, administrators and end users. 2.2 Fact table Diagram Prepare a logical diagram for each completed fact table The fact table diagram not only shows the specifics of a given fact table but also the context of the fact table in the overall data mart. The fact table diagram names the fact table, clearly states its grain, and shows all the dimensions to which it is connected. It also shows all the other dimensions that have been identified for the business, without connections. The fact table diagram for the telephone billing line item example is shown in Figure 7.2. To aid in the understanding of the model, it is important to retain consistency in the fact table diagrams. Think about the order that you place the dimensions around the hub. We recommend that you put the time dimension at the top and work your way around the fact table in order of importance of the dimension. Figure 7.2 The telephone billing fact table diagram. Disconnected dimensions are shown on both sides of the diagram. 6 The supporting information for the Fact Table Diagram includes the name of each dimension and a description of that dimension. Keep your descriptions and diagrams together. Dimension Name Dimension Description 2.3 Fact Table Detail The fact table detail provides a complete list of all the facts available through the fact table. See Figure 7.3. This list includes actual facts in the physical table, derived facts presented through DBMS views, and other facts that are possible to calculate from the first two groups. Aggregation rules should be provided with each fact to warn the reviewer that some facts may be semiadditive or nonadditive across certain dimensions. For instance facts such as temperatures are completely nonadditive across all dimensions, but they can be averaged. Figure 7.3 Fact table detail diagram showing dimension keys, basic facts, and derived facts (with asterisks). 7 2.4 Dimension Table Detail The dimension table detail diagram (see Figure 7.4) shows the individual attributes within a single dimension. Each dimension will have a separate diagram. The diagram shows the explicit grain of each dimension. The dimension table detail shows the approximate cardinality of each dimension attribute and allows the users to quickly see the multiple hierarchies and relationships between the attributes. Figure 7.4 Dimension table detail diagram (relative cardinalities shown in parentheses) 8 Again, full descriptive information must be provided to support the diagram. Figure 7.5 shows a sample of what should be documented for each dimension attribute. Attribute name. The official business attribute name. Attribute definition. A brief but meaningful description of the business attribute. Cardinality. The best intuitive estimate of how many distinct values this attribute has relative to the number of rows in the whole dimension table. Sample data. Sample values that this attribute will contain. This is particularly useful to really understand the attributes. Slowly changing policy. Type 1 (overwritten), Type 2 (new record created when a change is detected), and Type 3 (an old and a new version of this attribute is continuously maintained in the dimension record). Type 0 (value is never updated). Figure 7.5 Dimension attribute detail descriptions. Many-to-many relationships, slowly changing dimensions, and artificial attributes. Advanced dimensional modelling concepts can also be reflected in the dimension table detail diagram. The specific situations that should be reflected in the model include many-to-many relationships, slowly changing dimensions, and artificial attributes. See Figure 7.6 for an example of Many-to-many and slowly changing dimension attributes. Figure 7.7 shows an example of an artificial attribute. Creation of a new attribute that does not exist in the business today, but is created when we need to combine similar dimensions into a single dimensions. 9 Figure 7.6 Many-to-many and slowly changing dimension attributes. Figure 7.7 A dimension with a correlated attribute. 10 2.5 Steps for the Dimensional Modelling Team Create the Initial draft (data marts, dimensions, data matrix and diagrams) Track Base Facts Track derived facts (see figure 7.8) Get IS team input (present initial design to rest of IS Team) Work with Core Business Users (select some key users to work on project) Present to Business Users (confirm the design with a broader set of users) Figure 7.8 Derived fact worksheet. 11 2.6 Identifying the Sources for Each Fact Table and Dimension Table While it may seem very basic to pull together a list of candidate sources with descriptive information about each, such a list does not exist in a single place for most organizations today. Each individual source may or may not have a general description, and there us rarely a consolidated list of all of them. Figure 7.9 shows a sample data source definition. In your data source definitions, make sure that you include the following information: Source. The name of the source system. Business owner. The name of the primary contact within the business who is responsible for this data. IS owner. The name of the person who is responsible for this source system. Platform. The operating environment where this system runs. Location. The actual location of the system. The name of the city and the specific machine where this system runs. For distributed applications, include the number of locations, too. Description. A brief description of what this system does. Figure 7.9 Data Source definitions. 12 Source Data Ownership Establishing responsibility for data quality and integrity can be extremely difficult in the data warehousing environment. Operational system owners are not measured on how accurate or complete the fields are. Fields like end user name or customer SIC code are not usually validated, if they get filled in at all. Data Providers Operational system owners need to feel a responsibility to provide data to the warehouse as a regular part of their operational processes. Data providers include external-source data vendors, external sources from other divisions of the organization, as well as people who are responsible for a system and need to provide an extract file to the data warehouse team. Detailed criteria for Selecting the Data Sources Successful projects generally begin with one or two primary data sources. Additional sources may be used for dimension construction, but the facts are usually retrieved only from a single source system. The decision must be made about where the data will be extracted from. Some criteria to consider are data accessibility, data accuracy and project scheduling. Browsing the Data Content To better understand the actual data content, a study can be performed against current source system data. The browsing process helps you understand the contents of the source files and get a sense for the cleanliness of the data. Essentially, the process involves building a set of queries that count the number of rows with each value of a given attribute, like gender, or size, or status. Or bringing back all the distinct values of an attribute, such as customer name, and sorting them to find duplicates or similar spellings for the same customer. If the source systems are already in relational tables, it may be possible to point the ad hoc query tool at the source and query away. 13 Mapping Data from Source to Target The source-to-target data map is the foundation for the development of the data staging process. This map is to document specifically where the data can be located. Figure 7.10 shows a sample source-to-target data map. This is just one sample of what could be used to collect the information that is known at this time. The columns that are included are: Table name. The name of the logical table in the data warehouse. Column name. The name of the logical column in the data warehouse. Data type. The data type of the logical column in the data warehouse (char, number, date). Length. The length of the field of the logical column. Target column description. A description of the logical column. Source system. The name of the source system where data feeds the target logical column. There may be more than one source system for a single logical column. Source table/file. The name of the specific table or file where data feeds the target logical column. Source column/field. The name of the specific column or field where data feeds the target logical column. Data transform. Notes about any transformations that are required to translate the source information into the format required by the target column. This is not intended to be a complete list of the specific business rules but a place to note any anomalies or issues that are known at this time. Dimension/data mart. The name of the dimension or data mart that this column represents. Attribute/fact. The name of the specific attribute within a dimension table or fact table. Figure 7.10 Sample source-to-target data map. 14 2.7 Using a Data Modelling Tool You should use a data modelling tool to develop the physical data model, preferably one that stores your model’s structure in a relational database. As the data staging tools mature, information will flow more naturally from the popular data modelling tools, through the transformation engine, and into the metadata that users will access to learn about the data in the warehouse. Retaining the relationship between the logical table design and the business dimensional model is important to ensure the final data does indeed tie back to the original business requirements. Summary In these chapters we saw the process used to apply dimensional modelling techniques to a project. We introduced the Bus Architecture matrix to lay out the data marts and dimensions. We reviewed the four-step method to design a single data mart. We introduced diagramming techniques to use during the modelling process. Also data sourcing and mapping were reviewed. There are some supporting templates collected on the CD-rom that accompanies the book. A list of available templates: Template 7.1 Data Mart Matrix Template 7.2 Dimensional Model Document Template 7.3 Derived Fact Worksheet Template 7.4 Logical table design Template 7.5 Data Source Definition Document Template 7.6 Source to Target Data Map In this section we have focused on the data. In the next part, we have to step back and follow a different path of the lifecycle…. the technical architecture.