Data Warehouse Design by Duong Tuan Anh Faculty of Computer Science and Engineering, HCM City University of Technology. Sept. 2011 1 Outline Requirement Specification Conceptual Design Logical Design Conclusions Case study (M. Golfarelli, 1998) 2 1. Requirement Specification This phase consists in collecting and filtering the user requirements. It involves the designer, end-users of DW and produces the specifications concerning the choice of facts preliminary indications of the workload The choice of facts is based on the documentation of the operational information system. Facts are concepts of main interest for the decision making process, and correspond to events occurring in the enterprise world. 3 If the operational information system is documented by ER schemes, a fact can be represented by an entity or an n-ary relationship. If it is documented by relational schemes, facts correspond to relation schemes. In general, entities or relationships representing frequently updated data are good candidates for defining facts. The preliminary workload is expressed in pseudo-natural language and is aimed at enabling the designer to identify dimensions and measures during conceptual design. For each fact, it should specify the most interesting measures and aggregations. 4 2. Conceptual Design of Data Warehouse The conceptual design of a DW produces a dimensional scheme, structured according to the Dimension Fact Model (DF model). A dimensional scheme consists of a set of fact schemes. The basic components of a fact schemes are fact, dimensions and hierarchies. A fact is a focus of interest for the enterprise; A dimension determines a point of view adopted for representing facts; A hierarchy determines how fact instances may be aggregated and selected significantly for the decision-making process. 5 A fact scheme A fact scheme is structured as a tree whose root is a fact. The fact is represented by a box which reports the fact name and, typically, one or more numeric attributes which "measure" the fact from different points of view. Each vertex directly attached to the fact is a dimension. Figure 1 reports a fact scheme for fact SALE in a store chain; quantity sold and returns are fact attributes. 6 Fig 1. A simple fact scheme for a chain of stores 7 Hierarchies Subtrees rooted in dimensions are hierarchies. Their vertices, represented by circles, are attributes which may assume a discrete set of values; their arcs represent x-to-one relationships between pairs of attributes. state sales manager city store 8 Hierarchies The dimension in which a hierarchy is rooted defines its finest aggregation granularity; the attributes in the vertices along each sub-path of the hierarchy starting from the dimension define progressively coarser granularities. The fact scheme in Figure 1 has three dimensions: week, product and store. Some terminal vertices in the fact scheme may be represented by lines instead of circles (size and address in Figure 1); these vertices are nondimension attributes. 9 A non-dimension attribute A non-dimension attribute contains additional information about an attribute in the hierarchy. A non-dimension attribute can not be used for aggregation. Ex: address is a non-dimension attribute. Some arcs within the hierarchies may be marked by a dash: these arcs express optional relationships between pairs of attributes. 10 A fact instance A fact expresses a many-to-many relationship among dimensions. Each combination of values of the dimensions defines a fact instance; fact instances are the elemental information with the DW. EX: a fact instance describes “the quantity of one product sold during one week in a store, and the corresponding total returns”. A fact scheme may also have no fact attributes. [Factless fact table] 11 Additivity Querying a DW is aimed at extracting summary data to fill a report to be analysis for decisional purposes. It’s useful to aggregate fact instances into clusters at different levels of abstractions (roll-up). Most fact attributes should be additive. This means that the sum operator can be used to aggregate attribute values along all hierarchies. EX: the number of sales: the number of sales for a given manager is the sum of the number of sales for all stores managed by that sales manager. A fact attribute is called semi-additive if it is not additive along one or more dimensions, non-additive if it is additive along no dimension. 12 Overlapping compatible fact schemes Different facts are represented in different fact schemes. The queries the user formulated on the DW may require comparing fact attributes taken from related schemes. Two fact schemes are said to be compatible if they share at least one dimension attribute. Two compatible schemes F and G may be overlapped to create a resulting scheme H. In the simplest case: The set of the fact attributes in H is the union of the sets in F and G. The dimensions in H are the intersection of those in F and G; assuming that a given dimension is common to F and G if at least one dimension attribute is shared. Each hierarchy in H include all and only the dimension attributes included in the corresponding hierarchies of both F and G. 13 Figure 2: Fact scheme overlapping 14 The two scheme share the time, job and store dimension. The scheme resulting from the two schemes is given in Figure 3. Figure 3: The scheme resulting from the two overlapping schemes 15 Representing query pattern on a fact scheme The basic OLAP operators for formulating typical queries on DWs: roll-up, drill down, drill across and slice-and-dice. Roll-up: aggregate fact attributes to view data at a higher level of abstraction. Drill-down: disaggregate fact attributes in order to introduce further details. Drill-cross: relate and compare distinct facts. Slice-and-dice: select and project facts so as to reduce their dimensionality. 16 Query pattern On a fact scheme, a query may be represented by a query pattern, which consists in a set of markers placed on the dimension attributes. One or more markers can be placed within each hierarchy, to indicate at what level(s) fact instances must be aggregated. A dimension may contain no markers, to indicate that none of its attributes is involved in the query. Figure 4 shown the query pattern for the query: “total quantity sold and average returns per unit sold for each week and for each type of product”. The average returns per unit sold is the ratio between the total returns and the quantity sold. 17 Query pattern 18 From ER schemes to fact schemes It’s natural to derive the conceptual model of a DW from the existing ER schemes. The methodology to build a DF model consists of the following steps: Classifying entities Defining facts For each fact: Building the attribute tree Pruning and grafting the attribute tree Defining dimensions Defining fact attributes Defining hierarchies 19 The ER scheme for the sale fact scheme Figure 5. The (simplified) ER scheme for the sale fact scheme Note: Each instance of relationship SALE represents an item referring to a single product within a purchase ticket. 20 Classifying Entities (Moody & Kortink, 2000) The first step in producing fact schemes from ER model is to classify the entities into 3 categories: transaction entities, component entities and classification entities. Transaction entities. Transaction entities record details about particular events that occur in the business. The events that decision makers want to understand and analyse. EX: orders, insurance claims, salary payments and hotel bookings. Key characteristics of a transaction entity: It describes an event that happens at a point of time. It contains measurements or quantities that may be summarized. EX: an insurance claim records a particular business event and the amount claimed. A sale records an event that relates a product sold in an order and the sum of amount and the number of units sold. 21 Component entities A component entity is one which is directly related to a transaction entity via a one-to-many relationship. Component entities define the details or “components” of each business transaction. Component entities answer the “who”, “what”, “when”, “how” and “why” of a business event. EX: a sales transaction may be defined by a number of components: Customer: who made the purchase. Product: what was sold. Location: where it was sold. Period: when it was sold. An important component of any transaction is time - historical analysis is an important part of any data warehouse. Component entities form the basis for constructing dimensions in fact schemes. 22 Classification Entities Classification entities are entities which are related to component entities by a chain of one-tomany relationships. They are functionally dependent on a component entities (directly or transitively). Classification entities represent hierarchies enbedded in the fact schemes, which may be collapsed into component entities to form dimensions in the fact schemes. EX: store, city, sales-manager, type, category 23 Defining facts Facts are concepts of primary interest for the decision making process. They correspond to events occurring dynamically in the enterprise world. A fact is represented on ER scheme either by an entity type F or by an n-ary relationship between entity types E1,…,En. In the latter case, for the sake of simplicity, we should transform the relationship into an associate entity. The attributes of the relationship become attributes of the associate entity and the identifier of the entity is the combination of the identifiers of the participant entities. Figure 6 shows how relationship sale is transformed into an entity. Facts come from transaction entities. 24 Figure 6: Transformation of relationship sale into an associate entity Each fact identified in the ER scheme becomes the root of a different fact scheme. In the sale example, the fact of main interest is the sale of a product, represented in the ER scheme by relationship sale. 25 Building the attribute tree Given a portion of interest of an ER scheme and an entity F belonging to it, we call attribute tree the tree such that: Each vertex corresponds to an attribute of the scheme; The root corresponds to the identifier of F; For each vertex v, the corresponding attribute functionally determines all the attributes corresponding to the descendant of v. The attribute tree will be used to build the fact scheme for the fact corresponding to F. 26 Let F be the entity chosen to represent a fact; the attribute tree may be constructed automatically by the following procedure: translate(F, identifier(F)) procedure translate(E,v): // E is the current entity, v is the current vertex { for each attribute a E | a identifier(F) do addchild(v, a); // add child a to vertex v for each entity G connect to E by a x-to-one relationship R do { for each attribute b R do addchild(v, b); addchild(v, identifier(G)); translate(G, identifier(G)); } } Note: If F is identified by a combination of two or more attributes, identifier(F) denotes their concatenation. 27 Notes on attribute tree The arcs corresponding to optional attributes or optional relationships of the ER scheme should be marked by a dash. A one-to-one relationship can be inserted into the attribute tree. But it is often worth grafting from the attribute tree the attributes following one-to-one relationships or representing them as non-dimension attributes. Generalization hierarchies in the ER scheme are equivalent to one-to-one relationships. X-to-many relationships can not be inserted into the attribute tree. Notice that an n-ary relationship is equivalent to n binary relationships. 28 Attribute tree Figure 7. Attribute tree for the sale example 29 Pruning the attribute tree Not all the attributes represented in the attribute tree are interesting for the DW. Thus, the attribute tree may be pruned and grafted in order to eliminate the unnecessary levels of details. Pruning is carried out by dropping any subtrees from the tree. The attributes dropped will not be included in the fact scheme. EX: on the sale example, the subtree including city and state may be dropped. 30 Grafting the attribute tree Grafting is used when, though a vertex of the tree expresses an uninteresting information, its descendants must be preserved. EX: one may to classify products by category, without considering by the information on their type. Let v be the vertex to be eliminated, and v’ its father: graft(v): { for each v’’| v’’ is child of v do addchild(v’, v’’); drop v; } Thus, grafting is carried out by moving the entire subtree with root in v to v’. Attribute v will not be included in the fact scheme and the corresponding aggregation level will be lost; all the descendent levels will be maintained. 31 Figure 8: Attribute tree for the sale example after grafting vertex ticket number In the sale example, the attribute ticket number is uninteresting, the vertex ticket number is grafted, the attribute tree is tranformed to Figure 6. 32 Defining dimensions Dimensions determine how fact instances may be aggregated for the decision making process. Dimensions come from component entities. The dimensions must be chosen in the attribute tree among the children vertices of the root. The choice of dimension is crucial for the DW design since it determines the granularity of fact instances. Time is an important dimension for DWs. ER schemes can be classified into snapshot and temporal. A snapshot scheme describes the current state of the application. A temporal scheme describes the evolution of the application over a range of times; old versions of data are explicitly represented and stored. 33 Time in DWs DW from a temporal scheme: DW from a snapshot scheme: Time is explicitly represented as an attribute and it is a candidate to define a dimension. If time appears in the attribute tree as a child of some nonroot vertex, we can graft the tree so that time becomes a child of the root (i.e. becomes a dimension). Time is not explicitly represented, but time should be added as a dimension to the fact scheme. EX: In the sale example, the attributes chosen as dimensions are product, store and ranges of the date attribute corresponding to weeks. Time as a attribute in a fact scheme can be a time point or a time interval. 34 Defining fact attributes Fact attributes are typically either counts of the number of instances of entity F, or the sum/average/maximum/minimum of expressions involving numerical attributes of the attribute tree (the attributes chosen as dimensions for the fact scheme are excluded). A fact may have no attributes, if the only information to be recorded is the occurrence of the fact. The fact attributes determined, if any, are reported on the fact scheme. At this step, it is useful for the phase of logical design to build a glossary which associates each fact attribute to an expression describing how it can be calculated from the attributes of the ER scheme. In the sale example, the glossary may be compiled as follows: quantity sold = SUM(SALE.qty) total returns = SUM(SALE.qty * SALE.unitPrice) number of customers = COUNT(SALE) 35 Defining fact attributes (cont.) If attribute unitPrice had been placed on entity PRODUCT in the E/R scheme: total returns = SUM(SALE.qty * PRODUCT.unitPrice) The aggregation operators are meant to work on all the instances of SALE which relate to the same week, store and product. In some cases, aggregation is not necessary to define fact attributes, since it has already been executed at the relational level. For instance, each instance of entity SALE in the E/R scheme might describe the total sales for a given product, store and week; in this case, instances of the entity correspond one-to-one to fact instances, and entity attributes may be directly translated into fact attributes. 36 Defining hierarchies The last step in building the fact scheme is the definition of hierarchies on dimensions. Along each hierarchy, attributes must be arranged into a tree such that a x-to-one relationship holds between each node and its descendants. A hierarchy in an ER scheme is any sequence of entities joined together by one-to-many relationships, all aligned in the same direction. EX: <purchase-ticket, store, city, state> The attribute tree already shows an organization for hierarchies; at this stage, it is still possible to prune and graft the attribute tree in order to eliminate irrelevant details (e.g. in most cases, a vertex connected to its father by a one-to-one relationship is grafted). 37 Defining hierarchies (cont.) It is possible to add new levels of aggregation by defining ranges for numerical attributes; typically, this is done on the time dimension. In the sale example, the time dimension is enriched by introducing attribute month, defined as a range of week. During this phase, the attributes which should not be used for aggregation but only for informative purposes may be identified as non-dimension attributes. 38 4. Logical Design At this time, it is necessary to choose the target logical model, relational or multidimensional. In this chapter, we consider the relational case. The dimensional scheme can be mapped on the relational model by adopting the star scheme. 39 Translation into tables During this phase, the fact and dimension tables are created starting from the dimensional scheme and according to the logical model adopted. If the star scheme is adopted, each fact scheme f having the set of dimensions Dim(f) = {d1,…,dn} and the set of measures M = {m1,…,mz} is translated into one fact table: FT_f(k1, …,kn, m1,…,mz) and n dimension tables DT_d1(k1, a11,…,a1v1, a’11,…,a’1u1) ……………………… DT_d1(kn, an1,…,anv1, a’n1,…,a’nu1) where the hierachies on di includes the dimension attributes ai1,…,aivi and the non-attributes a’i1,…,a’iui 40 The star scheme for the SALE example turns out to be: FT_SALE(prodkey, dateKey, storeKey, qtySold, revenue, noOfCustomer) DT_PROD(prodkey, product, weight, diet, brand, city, type, category, department, deptmanager,…) DT_DATE(dateKey, date, dayOfWeek, holiday, month,…) DT_STORE(storeKey, store, phone, address, salesManager, city, country, state, saleDistrict) 41 Conclusion There is a methodology to derive conceptual design of a DW from the ER scheme describing the operational information system. The DF model is independent of the target logical model (multidimensional or relational). The DF model is a collection of tree structured fact schemes whose basic elements are facts, attribute, dimensions and hierarchies. 42 References 1. Golfarelli, M., Maio, D. and Rizzi, S., Conceptual Design of data warehouses from ER schemes, in Proc. HICSS-31, VII, Kona, Hawaii, 1998, 334-343. 2. Golfarelli, M., and Rizzi, S., A Methodology Framework for Data Warehouse Design, Proc. ACM 1st Int. Workshop on Data Warehousing and OLAP (DOLAP 98), Nov. 7, 1998, Washington D.C., USA. 43 Case Study Figure 9 44 The attribute tree for fact ADMISSION represented by entity ADMISSION is shown in Fig. 10.a. Notice that a surgery subtree has been created even though the optional causes relationship is a many-to-one relationship (with one-side on ADMISSION entity). Transforming this attribute tree in the one shown in Figure 10.b requires 4 steps: Grafting date+time+op.the Pruning date of surgery, time of surgery and surgeon Pruning the subtree rooted in op.th Pruning name and physician; Grafting patcode. 45 46 47 The fact scheme derived is shown in Fig. 11. Dimension month is defined as a range on attribute date; Dimension age5 as a range of attribute age (5 years intervals). The hierarchies on dimensions month and age5 is defined by adopting progressively wider ranges. Dimension type of surgery is optional. The grossary for the fact attributes is as below: Number of admissions = COUNT(ADMISSION) Value = SUM(has.value) Number of days = SUM(ADMISSION.nDays) Score = SUM(DRG.weight) 48