Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee) Business Systems Intelligence: 1. Data Warehousing 2 2 of 25 44 Entity Relationship Data Model (ERD) ER Approach works by dividing the data into many discreet entities Each entity becomes a Table in the physical schema Why it has been so successful – Coupled with the concept of Normalization it drives all the redundancy out of the database – Change (or add or delete) the data at just one point – Can build very fast access methods (index) – Results in efficient transactional processing Where is the catch? 3 of 25 44 ERD: Where Is The Catch? Lets have a look at a typical ER data model first Some Observations – A Symmetric Model • All the tables look the same • Which table is more important ? Which is the largest? • Which tables contain numerical measurements of the business? • Which table contain nearly static descriptive attributes? – Very hard to visualize and keep it in head – A large number of possible connections to any two (or more) tables 4 of 25 44 A Typical OLTP Oriented ER Data Model 5 of 25 44 ERD: Catch Continues ERD and Normalization result in large number of tables – Hard to be understood by the users (db programmers) – Hard to navigate by DBMS software in an optimum way Real value of ERD is in using tables individually or in pairs Too complex for queries that span multiple tables with a large number of records 6 of 25 44 How to Simplify a Data Model Two general methods – De-Normalization – Dimensional Modeling De-Normalization – Reverses the effect of Normalization – Reintroduce redundancy while reducing the number of tables – Popular approaches: Pre-Join de-normalization, Column Replication or Movement and Aggregation 7 of 25 44 Data Warehouse Design The database component of a data warehouse is described using a technique called dimensionality modelling Logical design technique that aims to present the data in a standard, intuitive form that allows for high-performance access Uses the concepts of ER modelling with some important restrictions 8 of 25 44 Data Warehouse Design (cont…) Every dimensional model (DM) is composed of – One table with a composite primary key, called the fact table – A set of smaller tables called dimension tables Each dimension table has a simple (noncomposite) primary key that corresponds exactly to one of the components of the composite key in the fact table. 9 of 25 44 Data Warehouse Design Forms star-like structure, which is called a star schema or star join – – Star schema is a logical structure that has a fact table containing factual data in the centre, surrounded by dimension tables containing reference data, can be denormalised. Facts are generated by events that occurred in the past, and are unlikely to change, regardless of how they are analysed 10 of 25 44 Data Warehouse Design (cont…) Dimension Table Dimension Table Fact Table Dimension Table Dimension Table Star Schema 11 of 25 44 Data Warehouse Design (cont…) Bulk of data in data warehouse is in fact tables, which can be extremely large. Important to treat fact data as read-only reference data that will not change over time. Most useful fact tables contain one or more numerical measures, or ‘facts’ that occur for each record and are numeric and additive. 12 of 25 44 Data Warehouse Design (cont…) Dimension tables usually contain descriptive textual information. Dimension attributes are used as the constraints in data warehouse queries. Star schemas can be used to speed up query performance by denormalizing reference information into a single dimension table. 13 of 25 44 Dimensional Modeling Models the data around two basic concepts: Facts & Dimensions. Facts – Facts are numeric measurements (values) that represent a specific business aspect or activity. – Facts can be computed or derived at run-time (metrics). – Examples : Unit Cost, Sale Amount, Quantity Sold 14 of 25 44 Dimensional Modeling (cont…) Dimensions – Dimensions are qualifying characteristics that provide additional perspectives to a given fact. – Examples: Date (Day, Month, Qtr, Year), Product (Type, Category) 15 of 25 44 Dimensional Modeling (cont...) Every dimensional model (DM) is composed of one (or more) fact tables, and a set of smaller dimension tables. Look on Fact table through one (or more) dimensions. – What is the sale amount in Consumer Product category, for elderly customers in the second quarter of 2004? Forms ‘star-like’ structure, which is called a star schema or star join. 16 of 25 44 Example Dimensional Model 17 of 25 44 Time Dimension Exercise Every Data Warehouse will need Time information – I.e. a Time Dimension Compose a generic Time Dimension Table – E.g. what are the different attributes you can use to describe 11th February, 2008. 18 of 25 44 Time Dimension Exercise create table time_dimension ( date_key full_date day_of_week day_num_in_month day_num_overall day_name day_abbrev weekday_flag week_num_in_year week_num_overall week_begin_date week_begin_date_key month month_num_overall month_name month_abbrev quarter year yearmo fiscal_month fiscal_quarter fiscal_year last_day_in_month_flag same_weekday_year_ago_date primary key (date_key)); Number not null, Date, Number, Number, Number, Varchar2(9), Varchar2(3), Varchar2(1), Number, Number, Date, Number, Number, Number, Varchar2(9), Varchar2(3), Number, Number, Number, Number, Number, Number, Varchar2(1), Date, 19 of 25 44 Data Model Design for Data Warehouses Nine-Step Methodology includes following steps: 1. 2. 3. 4. 5. 6. 7. 8. 9. Choosing the subject Choosing the grain Identifying and conforming the dimensions Choosing the facts Storing pre-calculations in the fact table Rounding out the dimension tables Choosing the duration of the database Tracking slowly changing dimensions Deciding the query priorities and the query mode 20 of 25 44 Step 1: Choosing The Subject The subject (or function) refers to the subject matter of a particular data mart A business process is a major operational process in an organization – Typically supported by a legacy system (database) or an OLTP – Examples: Orders, Invoices, Inventory etc. 21 of 25 44 Step 1: Choosing The Subject (cont…) First data mart built should be the one that is most likely to be – Delivered on time – Within budget – To answer the most commercially important business questions 22 of 25 44 Step 2: Choosing The Grain Grain is the fundamental, atomic level of data to be represented. Decide what a record of the fact table is to represent. – Grain is also termed as unit of analyses. – Typical grains • Individual Transactions • Daily aggregates (snapshots) • Monthly aggregates 23 of 25 44 Step 2: Choosing The Grain (cont…) Identify dimensions of the fact table. The grain decision for the fact table also determines the grain of each dimension table. Also include time as a core dimension, which is always present in star schemas. Sometimes grain varies for different facts within same business process. How? 24 of 25 44 Step 3: Identifying & Conforming The Dimensions Dimensions set the context for asking questions about the facts in the fact table. If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other. A dimension used in more than one data mart is referred to as being conformed. 25 of 25 44 Step 3: Identifying & Conforming The Dimensions Choose the dimensions that apply to each fact in the fact table. – Typical dimensions: time, product, customer etc. – Need to identify the descriptive attributes that explain each dimension – Need to determine hierarchies within each dimension? 26 of 25 44 Steps 4 & 5: Choosing The Facts The grain of the fact table determines which facts can be used in the data mart. Facts should be numeric and additive. – Example: Quantity Sold, Amount etc. Unusable facts include: – Non-numeric facts – Non-additive facts – Fact at different granularity from other facts in table Storing Pre-Calculations in the Fact Table – Once the facts have been selected each should be reexamined to determine whether there are opportunities to use pre-calculations. 27 of 25 44 Step 6: Rounding Out The Dimension Tables Text descriptions are added to the dimension tables. Text descriptions should be as intuitive and understandable to the users as possible. Usefulness of a data mart is determined by the scope and nature of the attributes of the dimension tables. (See exercise on Time Dimension) 28 of 25 44 Step 7: Choosing The Duration Of The Database Duration measures how far back in time the fact table goes. Very large fact tables raise at least two very significant data warehouse design issues. – Often difficult to source increasing old data. – It is mandatory that the old versions of the important dimensions be used, not the most current versions. Known as the ‘Slowly Changing Dimension’ problem. 29 of 25 44 Step 8: Tracking Slowly Changing Dimensions The slowly changing dimension problem means that the proper description of the old dimension data must be used with old fact data. Often, a generalized key must be assigned to important dimensions in order to distinguish multiple snapshots of dimensions over a period of time. 30 of 25 44 Step 8: Tracking Slowly Changing Dimensions (cont…) Three basic types of slowly changing dimensions: – Where a changed dimension attribute is overwritten. – Where a changed dimension attribute causes a new dimension record to be created. – Where a changed dimension attribute causes an alternate attribute to be created so that both the old and new values of the attribute are simultaneously accessible in the same dimension record. 31 of 25 44 Step 9: Deciding The Query Priorities And The Query Modes Most critical physical design issues affecting the end-user’s perception includes: – Physical sort order of the fact table on disk – Presence of pre-stored summaries or aggregations. Additional physical design issues include administration, backup, indexing performance, and security. 32 of 25 44 Dimensional Modelling Dimensional Modelling is a logical design technique that seeks to present the data in a standard framework that is intuitive and allows for high-performance access 33 of 25 44 Dimensional Modelling (cont…) Fact table – Consists of a multi-part primary key and usually contains numeric data – Numeric data is aggregated based on the multi-part primary key – Additive is crucial because the DW applications typically retrieve data based on more than one set of facts Dimension table – Contains descriptive information – Are the entry point for queries 34 of 25 44 Dimensional Modelling Dimensional Model: SELECT description, SUM(quoted_price), SUM(quantity), SUM(unit_price) , SUM(total_comm) FROM order_fact of, part_dimension pd WHERE of.part_nr = pd.part_nr GROUP BY description; ER-Model: SELECT description, SUM(quoted_price), SUM(quantity), SUM(unit_price), SUM(total_comm) FROM order o, order_detail od, part p, customer c, slsrep s WHERE o.order_nr = od.order_nr AND p.part_nr = od.part_nr AND o.customer_nr = c.customer_nr AND s.slsrep_nr = c.slsrep_nr GROUP BY description; Notice that the dimensional model only joins two tables, while the ER model joins all five in the ER Diagram. This is very typical of highly normalized ER models. Imagine a typical normalized database with 100s of tables 35 of 25 44 Dimensional Modelling 36 of 25 44 Simple DW Example Supermarket (Chain Store) – Business Area: Sales – Grain: Individual Purchases – Dimensions: • • • • • Time Product Store Customer Employee – Facts • Total Sales • Number of items • Total Cost Value 37 of 25 44 Data Warehouse Design (cont…) Customer Time Individual Purchases Employee Products Store 38 of 25 44 More Detailed Exercise A more detailed exercise – See handout 39 of 25 44 Exam Question Example You are working on a data warehousing project for the examinations department at the Dublin Institute of Technology (DIT). The examinations department looks after all exam and continuous asessment results for all of the students within the Institute. The purpose the data warehousing project is to allow new reporting capabilities so that examinations department staff can examine grade patterns for particular courses; monitor average grades for modules and patterns in grades for courses given by particular staff members; and to help with student retention efforts. Currently in the examinations departments transactional systems information about each student is indexed by a student number, and includes name, address, date of birth, etc. Similarly, information stored about the instructors working within the Institute is indexed by staff ID and includes name, address, department, etc. The information that needs to be stored about each course includes the course title, course code, and the weighting given to the continuous assessment element and exam element of a particular course. At the Institute’s progression boards each year the continuous assessment and exam results for each student, for each course, are entered into the Institute’s transactional grade storage system and this informaion should be trasnferred across to the data warehouse. Design a star schema for the above scenario. (20 marks) Discuss how the star schema supports the reporting requirements outlined in the above scenario. (15 marks) 40 of 25 44 Different Types Of Dimensional Model The star-schema can be extended in two ways: – Snow flake model – Multi-star model (also know as fact constellation) 41 of 25 52 Example Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key item_key item_name brand type supplier_type branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures location location_key street city state_or_province country 42 of 25 52 Example Snowflake Schema time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_key item_name brand type supplier_key supplier supplier_key supplier_type branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures location location_key street city_key city city_key city state_or_province country 43 of 25 52 Example Fact Constellation time item time_key day day_of_the_week month quarter year item_key item_name brand type supplier_type branch branch_key branch_name branch_type Sales Fact Table time_key item_key item_key shipper_key branch_key from_location location_key to_location units_sold dollars_sold avg_sales Measures Shipping Fact Table time_key location location_key street city province_or_state country dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type 44 of 25 44 Summary Over the last two lectures we have introduced them idea of data warehouses Data warehouses evolved to address the issues of using transactional databases to answer new kinds of questions IBM have a very detailed RedBook on dimensional modelling that is well worth looking at: http://www.redbooks.ibm.com/redbooks/pdfs/sg247138.pdf