18_DWhse_1 - University of Central Florida

advertisement
IMS 4212: Data Warehousing / Business Intelligence
Data Warehousing Part 1—Topics
• Doing vs. Deciding—OLTP vs. OLAP
• Data Warehouses
– Fact tables, Dimension tables, Granularity
– DW in an integrated Business Intelligence system
• Design Steps
• Designing Fact Tables
• Designing Dimension Tables
– The Time dimension
• Fact Table Exercises
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
1
IMS 4212: Data Warehousing / Business Intelligence
"With uncertainty present…"
With the introduction of uncertainty—the fact of ignorance and necessity of acting
upon opinion rather than knowledge—into this Eden-like situation, its character is
completely changed. With uncertainty absent, man's energies are devoted altogether
to doing things; it is doubtful whether intelligence itself would exist in such a
situation; in a world so built that perfect knowledge was theoretically possible, it
seems likely that all organic readjustments would become mechanical, all organisms
automata. With uncertainty present, doing things, the actual execution of activity,
becomes in a real sense a secondary part of life; the primary problem or function is
deciding what to do and how to do it. The two most important characteristics of
social organization brought about by the fact of uncertainty have already been
noticed. In the first place, goods are produced for a market, on the basis of an
entirely impersonal prediction of wants, not for the satisfaction of the wants of the
producers themselves. The producer takes the responsibility of forecasting the
consumers' wants. In the second place, the work of forecasting and at the same time
a large part of the technological direction and control of production are still further
concentrated upon a very narrow class of the producers, and we meet with a new
economic functionary, the entrepreneur.
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
Frank H. Knight
University of Chicago 1921
2
IMS 4212: Data Warehousing / Business Intelligence
Doing vs. Deciding
• Organizations do many things
– List thirty transactions that your project organization
executes or does
• Managers decide things
– List thirty decisions that your project organization
makes
– Identify where in the organizational hierarchy the
decision lies
– What is the consequence/importance of the decision?
– What information influences each decision?
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
3
IMS 4212: Data Warehousing / Business Intelligence
Doing vs. Deciding / OLTP vs OLAP
• Are systems designed to support the execution of events
suitable for the making of decisions?
• Event/transaction support requires
– High throughput
– High reliability
– Accuracy
– DB structures tuned for storage & performance
• Online Transaction Processing (OLTP) systems
support events
– Provide data or information to support transactions
– Record acts → New data
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
4
IMS 4212: Data Warehousing / Business Intelligence
OLTP vs. OLAP—Let Me Count the Ways…
• Online Analytical Processing (OLAP) or Business
Intelligence (BI) systems are oriented at decision
making and analysis
• What are the problems with using our OLTP databases
to support managerial decision making?
?
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
5
IMS 4212: Data Warehousing / Business Intelligence
Still Counting…
• Managers unlikely to
understand where the data
they need is in the database
• Managers not skilled at
reassembling the component
data of a concept recorded in
multiple tables
• Managers need summarized
and aggregated data more
than they need details
• Different summaries needed
on the same data
• Retrieval slows down the
OLTP systems
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
• Historical and current data
likely to be useful together
• Data from multiple systems is
likely to be useful
• External data is equally likely
to be useful
– Economic conditions
– Competitor actions
– Demographic information
• And the granddaddy of all:
What-if analysis requires
messing with the data
6
IMS 4212: Data Warehousing / Business Intelligence
The Data Warehouse
• The DW is a separate storage structure
• Designed to optimize query execution
– Not storage efficiency
– Not transaction throughput
• Expected to be loaded during down times
• Supports "readability"
• May sacrifice details for summaries
• Data and structures anticipate user needs
– Recurring decisions
– Flexible exploration
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
7
IMS 4212: Data Warehousing / Business Intelligence
Steps and Components
• Source Systems—provide raw data to the DW
• Integration Services—Provide transformation and
loading services from source data to DW
• Data Warehouse—Customized data store for Business
Intelligence
• Analysis Services—Tools for data mining and reporting
• Reporting Services—Our old friend acting on an
enhanced data store
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
8
IMS 4212: Data Warehousing / Business Intelligence
Storage Strategies
• The DW stores transformed data that
– May be accessed directly to support analysis
– Supports actions of the Analysis Services to provide
enhanced and efficient analysis
• Multiple Strategies
• We will look at the widely used approach using
– Fact tables,
– Dimension tables,
– Arranged in a Star Schema or Snowflake Schema
(or both)
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
9
IMS 4212: Data Warehousing / Business Intelligence
Fact Tables Contain Facts (duhhhh) of Interest
SALES
TimeKeyOrdered
TimeKeyShipped
TimeKeyPmntRcvd
ProductKey
CategoryKey
CustomerKey
SalesTerrKey
SalesRepKey
UnitsSold
SalesPrice
ValueSold
TotalDiscounts
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
• No PK designated for fact table
• Natural PK is TimeKeyOrdered,
ProductKey, CustomerKey
– This defines the granularity of the
data
• CategoryKey FD on ProductKey
• SalesTerrKey, SalesRepKey FD on
CustomerKey
• UnitsSold, TotalDiscounts
– Summed from source data
– Additive
• SalesPrice is not additive
• ValueSold is derivable and additive
10
IMS 4212: Data Warehousing / Business Intelligence
Star Schema & Dimension Tables
DimDate
DimCustomer
TimeKey
CustomerKey
SALES
TimeKeyOrdered
TimeKeyShipped
TimeKeyPmntRcvd
ProductKey
CategoryKey
CustomerKey
SalesTerrKey
SalesRepKey
UnitsSold
SalesPrice
ValueSold
TotalDiscounts
DimSalesRep
SalesRepKey
DimSalesTerr
SalesTerrKey
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
DimCategory
CategoryKey
DimProduct
ProductKey
• Dimension Tables
represent concepts
(entities) used to group
data in the fact tables
• Also contain
descriptive attributes
of the entity
represented by the
dimension table
• Simplest way for
nontechnical users to
picture the data
• Relate to FKs in the
fact tables
11
IMS 4212: Data Warehousing / Business Intelligence
Snowflake Schema & Dimension Tables
DimSalesTerr
DimDate
SalesTerrKey
TimeKey
SALES
DimSalesRep
SalesRepKey
TimeKeyOrdered
TimeKeyShipped
TimeKeyPmntRcvd
ProductKey
CustomerKey
SalesRepKey
UnitsSold
SalesPrice
ValueSold
TotalDiscounts
DimCustomer
ProductKey
DimCategory
CategoryKey
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
DimGeography
GeographyKey
DimCustomer
CustomerKey
• Fewer direct links
from dimension tables
to fact table
• Dimension tables
relate to each other
• Natural hierarchical
relationships in data
are preserved
– Implications for
drilldown reports
• Increases complexity
of data retrieval for
nontechnical users
12
IMS 4212: Data Warehousing / Business Intelligence
Granularity
SALES
SalesRecID
FDOWeek
ProductID
CategoryID
CustomerID
SalesTerrID
SalesRepID
UnitsSold
SalesPrice
ValueSold
TotalDiscounts
• The granularity of the fact tables is critical
• There are alternative levels of granularity
– Finer granularity → more detail, more records
Use SalesDate instead of Month
– Coaser granularity → less detail, fewer records
Use SalesMonth instead of SalesDate
• Finer granularity can be aggregated in the DW to find
the coarser granularity values
• Coarse granularity cannot be decomposed
• Granularity decisions are made for each of the FKs
from the dimension tables
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
13
IMS 4212: Data Warehousing / Business Intelligence
Fact Tables (Part 2)
SALES
TimeKeyOrdered
TimeKeyShipped
TimeKeyPmntRcvd
ProductKey
CategoryKey
CustomerKey
SalesTerrKey
SalesRepKey
UnitsSold
SalesPrice
ValueSold
TotalDiscounts
• Identifying Fact Tables and their facts is
an art
• No obvious mapping from OLTP tables
to Fact or Dimension Tables
• The same DB table can contribute to
multiple fact tables
• Requires analysis to discover central concepts that will
become fact tables
– Decision maker interviews
– Reporting requirements
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
14
IMS 4212: Data Warehousing / Business Intelligence
Fact Tables (Part 2—cont.)
SALES
TimeKeyOrdered
TimeKeyShipped
TimeKeyPmntRcvd
ProductKey
CategoryKey
CustomerKey
SalesTerrKey
SalesRepKey
UnitsSold
SalesPrice
ValueSold
TotalDiscounts
• Look for a logical concept or event which
measures of interest are about
– A sale (invoice)
– An order (purchase order)
– An enrollment (college DB)
• The concept/event should support the requirements
• The event is likely to be based on an OLTP table
– Not every OLTP table will become a fact table
• This concept/event will form the foundation for a fact
table
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
15
IMS 4212: Data Warehousing / Business Intelligence
Fact Tables--Measures
SALES
TimeKeyOrdered
TimeKeyShipped
TimeKeyPmntRcvd
ProductKey
CategoryKey
CustomerKey
SalesTerrKey
SalesRepKey
UnitsSold
SalesPrice
ValueSold
TotalDiscounts
• Measures are the facts to be recorded for
each row in the fact table
• Measures are often additive
– UnitsSold, TotalDiscounts, ValueSold
• Some are not additive
– SalesPrice
• Sometimes nonadditive measures are transformed into
additive measures
– ValueSold = (UnitsSold * SalesPrice) - TotalDiscounts
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
16
IMS 4212: Data Warehousing / Business Intelligence
Fact Tables—Measures (cont.)
SALES
TimeKeyOrdered
TimeKeyShipped
TimeKeyPmntRcvd
ProductKey
CategoryKey
CustomerKey
SalesTerrKey
SalesRepKey
UnitsSold
SalesPrice
ValueSold
TotalDiscounts
• Measures may come from several
sources—often not just values from a
single OLTP source table
• Other candidates in our example
– COGS
– CurrentInterestRate
– CompetitorPrice
– GrossMargin
– NetMargin
– ShippingCost
– ShippingWeight
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
17
IMS 4212: Data Warehousing / Business Intelligence
Fact Tables--Dimensions
• Dimensions are ways of looking at the data
– Users may indicate they look at {fact table subject}
"by" {dimension name}
– Sales by week
– Sales by customer
– Sales by product category
• Dimensions lead us to Dimension Tables
– Descriptive attributes about the dimension
– Foreign key to the fact table
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
18
IMS 4212: Data Warehousing / Business Intelligence
Dimension Tables
• Dimension tables are often based
on an OLTP entity
• Denormalized to include descriptive
attributes from other tables
– Product might include
• SupplierName
• CategoryName
• SubCategoryName
• SupplierCountry
• In Snowflake dimension tables related hierarchical
information may be retained in the hierarchical tables
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
19
IMS 4212: Data Warehousing / Business Intelligence
Dimension Tables—Primary Keys
DimCustomer
CustomerKey
CustomerID
SourceSystem
LastName
FirstName
:
• Dimension tables should always
be given an artificial identity PK—
even if there is a suitable OLTP
table PK
• If tables are ever loaded from multiple
sources the natural PK may become invalid
– E.g., merging sales data from two business units with
different databases
• Retain the business PK as an attribute in the dimension
table
• Possibly include source system identifier for the row
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
20
IMS 4212: Data Warehousing / Business Intelligence
Dimension Tables—Time
• Time is a hugely common
"by" dimension
• Decide on time granularity
– Daily, Weekly, Hourly?
• You might consider two time
dimensions
– Daily for grossest categorization
– Hour for additional precision
DimDate
TimeKey
CalendarYear
CalendarQuarter
CalendarMonthName
CalendarNumberOfMonth
DayNumberOfMonth
DayNumberOfYear
DayNumberOfWeek
FiscalYear
FiscalQuarter
ManufacturingYear
ManufacturingQuarter
ManufacturingMonth
SeasonName
HolidayFlag
ThanksgivingWeekendFlag
PreChristmasFlag
DimTimeOfDay
HourNum
TimeOfDayName
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
21
IMS 4212: Data Warehousing / Business Intelligence
Dimensions—Time (cont.)
DimDate
TimeKey
CalendarYear
CalendarQuarter
CalendarMonthName
CalendarNumberOfMonth
DayNumberOfMonth
DayNumberOfYear
DayNumberOfWeek
FiscalYear
FiscalQuarter
ManufacturingYear
ManufacturingQuarter
ManufacturingMonth
SeasonName
HolidayFlag
ThanksgivingWeekendFlag
PreChristmasFlag
• The time dimension table
maps from the measured time
attribute associated with the
fact table record to various
labels and aggregations
associated with that value
• Facilitates summarizing by
various aggregates with a
single time dimension measure
• TimeKey PK is often a
datetime data type to the date level of precision
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
22
IMS 4212: Data Warehousing / Business Intelligence
Fact Tables--Granularity
• In the olden days granularity decisions were made at
the DW DB design stage
• Granularity decisions traded off
– Number of records and computational overhead
associated with more detailed granularity
– Lack of precision with coarser granularity
• Modern computational power supports finer
granularity
• Analysis services provides support for fast computation
over large data sets
• Just don't go overboard
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
23
IMS 4212: Data Warehousing / Business Intelligence
Fact Table Exercise #2
COURSE
SECTION
DeptCode
CourseNo
Name
CreditHrs
LabHrs
SectionID
DeptCode <AK>
CourseNo <AK>
SecNo
<AK>
Term
<AK>
Year
<AK>
Room <FK1>
Days
Time
InstructorID <FK2>
Has
ENROLLMENT
Has
SectionID <FK1>
StudentID <FK2>
Grade
<FK3>
• Expand entities around
the core of our
University ERD
– See next slide
• Consider two business goals
– Understand real credit hour revenue
– Understand classroom utilization
• Identify and design Fact and Dimension Tables
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
STUDENT
Has
GRADE
Grade
GradePts
Has
:
StudentID
LastName
FirstName
:
24
IMS 4212: Data Warehousing / Business Intelligence
Fact Table Exercise #2 (Cont.)
Payment
PaymentID
StudentID
PaymentDate
PaymentAmt
Enrollment
SectionID
StudentID
Grade
SectionBook
Textbook
ISBN
SectionID
QtyOrdered
QtySold
ISBN
Title
Year
:
WholsalePrice
ListPrice
Student
StudentID
Lname
:
Resident Y/N
BrightFutureStat
StudentID
AcadYear
PercentCovered
Section
SectionID
DeptCode
CourseNum
SecNum
Term
Year
Days
Time
InstructorID
Capacity
BldgID
RoomNum
Course
DeptCode
CourseNum
CrHr
Title
Description
LecHr
LabHr
DeptID
Room
SalaryHistory
EmployeeID
StartDate
EndDate
Salary
Status
Instructor
Department
BldgID
RoomNum
Type
Capacity
FeeSchedule
DeprtID
Name
ChairID
DeptID
OfficeBldg
OfficeRm
AcadYear
InStateUGCrHr
OutStateUGCrHr
InStateGrCrHr
OutStateGrCrHr
InStateDCrHr
OutStateDCrHr
HealthFee
ActivityFee
Building
BldgID
Name
Floors
SqFt
EmployeeID
LastName
FirstName
DeptID
CurrentSalary
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
25
IMS 4212: Data Warehousing / Business Intelligence
External Data
• What external data might you want to have in a salesoriented DW?
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
26
IMS 4212: Data Warehousing / Business Intelligence
Next Time
• Transformations to load the DW from the source
OLTP (and other) data sources
– Automated support
– Do it yourself
• Analysis Services—putting our DW to work
Dr. Lawrence West, Management Dept., University of Central Florida
lwest@bus.ucf.edu
27
Download