IMS 4212: Data Warehousing / Business Intelligence Data Warehousing Part 1—Topics • Doing vs. Deciding—OLTP vs. OLAP • Data Warehouses – Fact tables, Dimension tables, Granularity – DW in an integrated Business Intelligence system • Design Steps • Designing Fact Tables • Designing Dimension Tables – The Time dimension • Fact Table Exercises Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 1 IMS 4212: Data Warehousing / Business Intelligence "With uncertainty present…" With the introduction of uncertainty—the fact of ignorance and necessity of acting upon opinion rather than knowledge—into this Eden-like situation, its character is completely changed. With uncertainty absent, man's energies are devoted altogether to doing things; it is doubtful whether intelligence itself would exist in such a situation; in a world so built that perfect knowledge was theoretically possible, it seems likely that all organic readjustments would become mechanical, all organisms automata. With uncertainty present, doing things, the actual execution of activity, becomes in a real sense a secondary part of life; the primary problem or function is deciding what to do and how to do it. The two most important characteristics of social organization brought about by the fact of uncertainty have already been noticed. In the first place, goods are produced for a market, on the basis of an entirely impersonal prediction of wants, not for the satisfaction of the wants of the producers themselves. The producer takes the responsibility of forecasting the consumers' wants. In the second place, the work of forecasting and at the same time a large part of the technological direction and control of production are still further concentrated upon a very narrow class of the producers, and we meet with a new economic functionary, the entrepreneur. Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu Frank H. Knight University of Chicago 1921 2 IMS 4212: Data Warehousing / Business Intelligence Doing vs. Deciding • Organizations do many things – List thirty transactions that your project organization executes or does • Managers decide things – List thirty decisions that your project organization makes – Identify where in the organizational hierarchy the decision lies – What is the consequence/importance of the decision? – What information influences each decision? Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 3 IMS 4212: Data Warehousing / Business Intelligence Doing vs. Deciding / OLTP vs OLAP • Are systems designed to support the execution of events suitable for the making of decisions? • Event/transaction support requires – High throughput – High reliability – Accuracy – DB structures tuned for storage & performance • Online Transaction Processing (OLTP) systems support events – Provide data or information to support transactions – Record acts → New data Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 4 IMS 4212: Data Warehousing / Business Intelligence OLTP vs. OLAP—Let Me Count the Ways… • Online Analytical Processing (OLAP) or Business Intelligence (BI) systems are oriented at decision making and analysis • What are the problems with using our OLTP databases to support managerial decision making? ? Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 5 IMS 4212: Data Warehousing / Business Intelligence Still Counting… • Managers unlikely to understand where the data they need is in the database • Managers not skilled at reassembling the component data of a concept recorded in multiple tables • Managers need summarized and aggregated data more than they need details • Different summaries needed on the same data • Retrieval slows down the OLTP systems Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu • Historical and current data likely to be useful together • Data from multiple systems is likely to be useful • External data is equally likely to be useful – Economic conditions – Competitor actions – Demographic information • And the granddaddy of all: What-if analysis requires messing with the data 6 IMS 4212: Data Warehousing / Business Intelligence The Data Warehouse • The DW is a separate storage structure • Designed to optimize query execution – Not storage efficiency – Not transaction throughput • Expected to be loaded during down times • Supports "readability" • May sacrifice details for summaries • Data and structures anticipate user needs – Recurring decisions – Flexible exploration Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 7 IMS 4212: Data Warehousing / Business Intelligence Steps and Components • Source Systems—provide raw data to the DW • Integration Services—Provide transformation and loading services from source data to DW • Data Warehouse—Customized data store for Business Intelligence • Analysis Services—Tools for data mining and reporting • Reporting Services—Our old friend acting on an enhanced data store Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 8 IMS 4212: Data Warehousing / Business Intelligence Storage Strategies • The DW stores transformed data that – May be accessed directly to support analysis – Supports actions of the Analysis Services to provide enhanced and efficient analysis • Multiple Strategies • We will look at the widely used approach using – Fact tables, – Dimension tables, – Arranged in a Star Schema or Snowflake Schema (or both) Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 9 IMS 4212: Data Warehousing / Business Intelligence Fact Tables Contain Facts (duhhhh) of Interest SALES TimeKeyOrdered TimeKeyShipped TimeKeyPmntRcvd ProductKey CategoryKey CustomerKey SalesTerrKey SalesRepKey UnitsSold SalesPrice ValueSold TotalDiscounts Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu • No PK designated for fact table • Natural PK is TimeKeyOrdered, ProductKey, CustomerKey – This defines the granularity of the data • CategoryKey FD on ProductKey • SalesTerrKey, SalesRepKey FD on CustomerKey • UnitsSold, TotalDiscounts – Summed from source data – Additive • SalesPrice is not additive • ValueSold is derivable and additive 10 IMS 4212: Data Warehousing / Business Intelligence Star Schema & Dimension Tables DimDate DimCustomer TimeKey CustomerKey SALES TimeKeyOrdered TimeKeyShipped TimeKeyPmntRcvd ProductKey CategoryKey CustomerKey SalesTerrKey SalesRepKey UnitsSold SalesPrice ValueSold TotalDiscounts DimSalesRep SalesRepKey DimSalesTerr SalesTerrKey Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu DimCategory CategoryKey DimProduct ProductKey • Dimension Tables represent concepts (entities) used to group data in the fact tables • Also contain descriptive attributes of the entity represented by the dimension table • Simplest way for nontechnical users to picture the data • Relate to FKs in the fact tables 11 IMS 4212: Data Warehousing / Business Intelligence Snowflake Schema & Dimension Tables DimSalesTerr DimDate SalesTerrKey TimeKey SALES DimSalesRep SalesRepKey TimeKeyOrdered TimeKeyShipped TimeKeyPmntRcvd ProductKey CustomerKey SalesRepKey UnitsSold SalesPrice ValueSold TotalDiscounts DimCustomer ProductKey DimCategory CategoryKey Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu DimGeography GeographyKey DimCustomer CustomerKey • Fewer direct links from dimension tables to fact table • Dimension tables relate to each other • Natural hierarchical relationships in data are preserved – Implications for drilldown reports • Increases complexity of data retrieval for nontechnical users 12 IMS 4212: Data Warehousing / Business Intelligence Granularity SALES SalesRecID FDOWeek ProductID CategoryID CustomerID SalesTerrID SalesRepID UnitsSold SalesPrice ValueSold TotalDiscounts • The granularity of the fact tables is critical • There are alternative levels of granularity – Finer granularity → more detail, more records Use SalesDate instead of Month – Coaser granularity → less detail, fewer records Use SalesMonth instead of SalesDate • Finer granularity can be aggregated in the DW to find the coarser granularity values • Coarse granularity cannot be decomposed • Granularity decisions are made for each of the FKs from the dimension tables Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 13 IMS 4212: Data Warehousing / Business Intelligence Fact Tables (Part 2) SALES TimeKeyOrdered TimeKeyShipped TimeKeyPmntRcvd ProductKey CategoryKey CustomerKey SalesTerrKey SalesRepKey UnitsSold SalesPrice ValueSold TotalDiscounts • Identifying Fact Tables and their facts is an art • No obvious mapping from OLTP tables to Fact or Dimension Tables • The same DB table can contribute to multiple fact tables • Requires analysis to discover central concepts that will become fact tables – Decision maker interviews – Reporting requirements Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 14 IMS 4212: Data Warehousing / Business Intelligence Fact Tables (Part 2—cont.) SALES TimeKeyOrdered TimeKeyShipped TimeKeyPmntRcvd ProductKey CategoryKey CustomerKey SalesTerrKey SalesRepKey UnitsSold SalesPrice ValueSold TotalDiscounts • Look for a logical concept or event which measures of interest are about – A sale (invoice) – An order (purchase order) – An enrollment (college DB) • The concept/event should support the requirements • The event is likely to be based on an OLTP table – Not every OLTP table will become a fact table • This concept/event will form the foundation for a fact table Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 15 IMS 4212: Data Warehousing / Business Intelligence Fact Tables--Measures SALES TimeKeyOrdered TimeKeyShipped TimeKeyPmntRcvd ProductKey CategoryKey CustomerKey SalesTerrKey SalesRepKey UnitsSold SalesPrice ValueSold TotalDiscounts • Measures are the facts to be recorded for each row in the fact table • Measures are often additive – UnitsSold, TotalDiscounts, ValueSold • Some are not additive – SalesPrice • Sometimes nonadditive measures are transformed into additive measures – ValueSold = (UnitsSold * SalesPrice) - TotalDiscounts Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 16 IMS 4212: Data Warehousing / Business Intelligence Fact Tables—Measures (cont.) SALES TimeKeyOrdered TimeKeyShipped TimeKeyPmntRcvd ProductKey CategoryKey CustomerKey SalesTerrKey SalesRepKey UnitsSold SalesPrice ValueSold TotalDiscounts • Measures may come from several sources—often not just values from a single OLTP source table • Other candidates in our example – COGS – CurrentInterestRate – CompetitorPrice – GrossMargin – NetMargin – ShippingCost – ShippingWeight Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 17 IMS 4212: Data Warehousing / Business Intelligence Fact Tables--Dimensions • Dimensions are ways of looking at the data – Users may indicate they look at {fact table subject} "by" {dimension name} – Sales by week – Sales by customer – Sales by product category • Dimensions lead us to Dimension Tables – Descriptive attributes about the dimension – Foreign key to the fact table Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 18 IMS 4212: Data Warehousing / Business Intelligence Dimension Tables • Dimension tables are often based on an OLTP entity • Denormalized to include descriptive attributes from other tables – Product might include • SupplierName • CategoryName • SubCategoryName • SupplierCountry • In Snowflake dimension tables related hierarchical information may be retained in the hierarchical tables Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 19 IMS 4212: Data Warehousing / Business Intelligence Dimension Tables—Primary Keys DimCustomer CustomerKey CustomerID SourceSystem LastName FirstName : • Dimension tables should always be given an artificial identity PK— even if there is a suitable OLTP table PK • If tables are ever loaded from multiple sources the natural PK may become invalid – E.g., merging sales data from two business units with different databases • Retain the business PK as an attribute in the dimension table • Possibly include source system identifier for the row Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 20 IMS 4212: Data Warehousing / Business Intelligence Dimension Tables—Time • Time is a hugely common "by" dimension • Decide on time granularity – Daily, Weekly, Hourly? • You might consider two time dimensions – Daily for grossest categorization – Hour for additional precision DimDate TimeKey CalendarYear CalendarQuarter CalendarMonthName CalendarNumberOfMonth DayNumberOfMonth DayNumberOfYear DayNumberOfWeek FiscalYear FiscalQuarter ManufacturingYear ManufacturingQuarter ManufacturingMonth SeasonName HolidayFlag ThanksgivingWeekendFlag PreChristmasFlag DimTimeOfDay HourNum TimeOfDayName Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 21 IMS 4212: Data Warehousing / Business Intelligence Dimensions—Time (cont.) DimDate TimeKey CalendarYear CalendarQuarter CalendarMonthName CalendarNumberOfMonth DayNumberOfMonth DayNumberOfYear DayNumberOfWeek FiscalYear FiscalQuarter ManufacturingYear ManufacturingQuarter ManufacturingMonth SeasonName HolidayFlag ThanksgivingWeekendFlag PreChristmasFlag • The time dimension table maps from the measured time attribute associated with the fact table record to various labels and aggregations associated with that value • Facilitates summarizing by various aggregates with a single time dimension measure • TimeKey PK is often a datetime data type to the date level of precision Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 22 IMS 4212: Data Warehousing / Business Intelligence Fact Tables--Granularity • In the olden days granularity decisions were made at the DW DB design stage • Granularity decisions traded off – Number of records and computational overhead associated with more detailed granularity – Lack of precision with coarser granularity • Modern computational power supports finer granularity • Analysis services provides support for fast computation over large data sets • Just don't go overboard Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 23 IMS 4212: Data Warehousing / Business Intelligence Fact Table Exercise #2 COURSE SECTION DeptCode CourseNo Name CreditHrs LabHrs SectionID DeptCode <AK> CourseNo <AK> SecNo <AK> Term <AK> Year <AK> Room <FK1> Days Time InstructorID <FK2> Has ENROLLMENT Has SectionID <FK1> StudentID <FK2> Grade <FK3> • Expand entities around the core of our University ERD – See next slide • Consider two business goals – Understand real credit hour revenue – Understand classroom utilization • Identify and design Fact and Dimension Tables Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu STUDENT Has GRADE Grade GradePts Has : StudentID LastName FirstName : 24 IMS 4212: Data Warehousing / Business Intelligence Fact Table Exercise #2 (Cont.) Payment PaymentID StudentID PaymentDate PaymentAmt Enrollment SectionID StudentID Grade SectionBook Textbook ISBN SectionID QtyOrdered QtySold ISBN Title Year : WholsalePrice ListPrice Student StudentID Lname : Resident Y/N BrightFutureStat StudentID AcadYear PercentCovered Section SectionID DeptCode CourseNum SecNum Term Year Days Time InstructorID Capacity BldgID RoomNum Course DeptCode CourseNum CrHr Title Description LecHr LabHr DeptID Room SalaryHistory EmployeeID StartDate EndDate Salary Status Instructor Department BldgID RoomNum Type Capacity FeeSchedule DeprtID Name ChairID DeptID OfficeBldg OfficeRm AcadYear InStateUGCrHr OutStateUGCrHr InStateGrCrHr OutStateGrCrHr InStateDCrHr OutStateDCrHr HealthFee ActivityFee Building BldgID Name Floors SqFt EmployeeID LastName FirstName DeptID CurrentSalary Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 25 IMS 4212: Data Warehousing / Business Intelligence External Data • What external data might you want to have in a salesoriented DW? Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 26 IMS 4212: Data Warehousing / Business Intelligence Next Time • Transformations to load the DW from the source OLTP (and other) data sources – Automated support – Do it yourself • Analysis Services—putting our DW to work Dr. Lawrence West, Management Dept., University of Central Florida lwest@bus.ucf.edu 27