04-1-Design-Case 2 N..

advertisement
1 of 11
04-1Design Case 2 Notes
Case:
Mail out
This modifies the previous exercise
A part of the Day-to-Day operations (OLTP system) is kept on an Oracle database. The section we
will look at is known as the Order Entry System
Drawing of the system
Categories
Customer
Warehouse
Location
Salesrep
Product
Orders
Items or
Order line
The following shows some of the attributes found in the above tables.
CUSTOMER TABLE
PRODUCT TABLE
SALESREP TABLE
CID
Customer Name
Shipping Address
Billing Address
Phone
Credit Limit
YTD Sales
Last Year Sales
Contact Person
Comments
Terms of Sale
Sales Rep ID
PID
Product Name
Category ID
Brand
Cost (unit cost)
Sell (unit price)
Size
Weight
Supplier ID
Warehouse Location ID
Taxable PST
Taxable GST
Minimum Order Quantity
ETC …
SID
Salesrep Name
Street Address
City
Province
Phone Home
Phone Cell
Email Address
Date Of Birth
Life Insurance Y or N
Long Term Disability Y or N
Salary
Commission Rate
Hire Date
OBJECTIVE:
Design and Draw the STAR SCEMA
– Shown are the attributes of the tables
Document1 by rt -- 12 March 2016
1 of 11
2 of 11
DEVELOPING A DESIGN FOR A DATA WAREHOUSE
STEP 1 – Understanding the business
What do you want to analyze
Example - Who is best customer?
- Are sales up or down?
- Who makes the most sales?
- And other …
Where does it come from?
Talk to the business executive to see how THEY measure the performance of the
business.
Note: Design is not usually done from an ERD. Management wouldn’t understand a complicated
ERD (the one above is an oversimplification). For our purposes in this tutorial it is easier to use the
ERD to see the system since there is no management to meet with and you are familiar with ERDs.
***Also need to know 1. How big is the business?
 # Of Salesreps
 # Of Orders
 especially this one
 100/day or 15,000 per day
Effects
Choice of
Grain
 2 What measures do they look at?
 How does company measure their business?
Assumption:
We will assume the business is a wholesaler of Sporting Goods to sporting and outdoor stores as well
as large department stores that have sporting goods sections. An example might be Canadian Tire,
Sears, Wal-Mart or smaller stores like Outdoor Oriented Camping Stores.
The company has 3 warehouses across the country where it stores stock for faster shipment to its
customers.
The company does not manufacture any product but buys it from the suppliers it deals with.
The company has over 500 orders a day from each warehouse. Some orders are very big and some
orders have only a few lines of products on them. The size of the orders from retail stores vary from
$50 (but could be smaller) to over $100,000.
Measures:
Document1 by rt -- 12 March 2016
2 of 11
3 of 11
The normal business measures are Revenue, Costs of Product, Costs of operations, Profits and
Quantity Sold.
ASIDE:
In the example above Total Revenue (or Total $ Sales) and Quantity Sold would be measures in the
system, but sometimes costs of operations are not. The Total $ cost of product sold is in the system,
but Cost of Processing an order or the cost of operations is not so easily found. It is more likely that
those costs will be lumped into a bigger category of expenses. For example it takes vehicles, people,
and warehouse heating just to name a few to process an order. That kind of information may still be
analyzed but the information needed to do that will come from a different part of the OLTP system. It
is harder to know what percentage of the Presidents salary is apportioned to each product sold. For a
simple example if the President earns 1000 a week and we sold 1000 items you could say the cost of
the product has added to it $1. However if next week the business has sales of 275 or zero because it
is seasonal that salary has to be apportioned differently. This is far too difficult to track in this way and
would distort reporting, which is the basis of analysis.
Document1 by rt -- 12 March 2016
3 of 11
4 of 11
STEP 2 - GRAIN
How to determine the grain?
Easiest way is probably to use the measures.
 Revenue is a measure ($ Sales or Total $ Sales for the period of time)
What time periods does management want to compare the measure over? Would management want
the measure reported on a monthly, daily, weekly, or yearly basis? What is a reasonable time period
that would show enough detail for decision making? There has to be a balance between too much
detail and too little detail. For example yearly revenue is too little detail. Yearly revenue can also be
obtained any time from the OLTP System and so having a data warehouse would not add extra
value. Having sales reported on a daily basis and certainly on an hourly basis would be far too much
detail and would not improve the decision making. It is really hard to compare revenue on a daily
bases over a five year. That’s over 1500 entries.
For this example we have decided to choose the time period as Month. If
 Do we want Revenue broken down also by SALESREP, CUSTOMER LOCATION?
Designer in consultation with management needs to understand what information be a value to the
management of the company. Since we don’t have management available to ask for our exercise I
would suggest that the company would like the measures broken down by customer, by customer
location, by sales Rep, by product and product category.
The choices of how the Revenue or Total $ Sales is broken down determines the grain
Assume GRAIN chosen is as follows:
 Monthly Revenue (or Total $ Sales), by product, product category, by salesrep, by
customer and customer location.
Based on the above grain
Can you do analysis by order#?
 NO -- data will be kept on DAILY basis
 Not by order
If you want by order then use the OLTP system because the data is already by order. You will
discover that the data warehouse system is not broken down by order, because there’s little decision
value to be gained.
Document1 by rt -- 12 March 2016
4 of 11
5 of 11
STEP 3 Determine Dimension tables and attributes
DISCUSS DIMENSIONS – meaning  WHAT
TO KEEP from the OLTP
.
Given the following 6 breakdowns of the measures as decided upon from the grain we need to
determine the dimensions.
TIME
PRODUCT
PRODUCT CATEGORY
SALES REP
LOCATION OF CUSTOMER
CUSTOMER
Remember that in a star schema we would like all the dimension tables to be connected to the fact
table by a single join in order to enhance performance. Looking at the above six items are there any
of those six that are closely related on a one to many bases. You can see that product and product
category are related on a one to many bases. That would mean that an analysis by product category
would require a join between the product category table, the product table, and the fact table along
with any other table required in the analysis. We can reduce one of the joins by denormalize in the
tables product category and product.
When looking at the ERD diagram we can often see where there are areas to be de-normalized. You
will notice in the OLTP system that when you see a hierarchy of 1:M relationships these can be
denormalized. If we decide to use those tables in a DW, then they should be denormalized.
Looking at the remaining five tables ask yourself again, is there any kind of close relationship
between two of the breakdowns. It would appear that customer and location of the customer such as
city will be closely related. In fact that detail in the OLTP is stored in one table and would also be one
table in the data warehouse.
The remaining dimensions are as follows:
TIME
PRODUCT includes PRODUCT CATEGORY
SALES REP
CUSTOMER includes LOCATION
Let us now look at each of the tables to determine what attributes they would have.
Document1 by rt -- 12 March 2016
5 of 11
6 of 11
DIMENSION  TIME DIMENSION TABLE
Since there is no TIME table shown in the OLTP system, what do we use as dimension attributes?
Going back to the grain, the requirement was for MONTHLY analysis.
The business would also like to analyze things on less detailed basis.
The following choices might be made for TIME dimension table
TIME
TIME-KEY
YEAR
QUARTER
MONTH
hierarchy or levels
UP SUMMARY VIEWS
drill up or drill down
DOWN DETAILS VIEWS
DIMENSION  CUSTOMER
– What to keep from the list of attributes available.
The company sells sporting goods to stores not directly to people. As an aside, students often want to
keep things like GENDER. First stores don’t have gender and second there is no gender in the OLTP
System. You have to look at what is available in the OLTP System and determine what is useful
based on what management wants to look at.
Example of a few of the attributes in the OLTP table for customer:
CID
Customer Name
Shipping Address
Billing Address
Phone
Credit Limit
YTD Sales
Last Year Sales
Contact Person
Comments
Terms of Sale
Sales Rep ID
What ones would you keep???
NOTE: For a moment I will deal with LOCATION later as a separate dimension so you can see the
analysis of why it is included in the CUSTOMER.
CID
Customer Name
Yes
Yes On reports customer names have more meaning than showing ID
Document1 by rt -- 12 March 2016
6 of 11
Shipping Address
No
Billing Address
Phone
No
No
Credit Limit
YTD Sales
Last Year Sales
Contact Person
Comments
Terms of Sale
Sales Rep ID
7 of 11
Maybe you want to keep postal code, city or province, but not street
The depth of detail depends on how many customers in a city
Example: The Toronto Star Newspaper may want a breakdown of
businesses by postal code.
A supplier of lumber would not have a lot of customers (lumber stores) in
a postal code
Any analysis by phone number will be the same as analysis by CID
The phone and id have a 1 to 1 relationship < important hint on what to
keep
??? Maybe if the business wants revenue by credit limit – more likely to want
if it is by GOOD, POOR or several descriptive or range values and not by
specific numbers. Example: revenues by customers with credit limits
would generate numbers like 11,000 limit $285,634
No This is data kept in the OLTP System so that when an employee like the
sales rep, credit manager or marketing manager looks at a customer they
can see what sales are YTD. Perhaps the customer wants an extra
discount for a special anniversary sale they have. The amount of
business they have done so far this year may determine the amount of
discount. This is not needed the data warehouse.
No Same reasoning as YTD sales.
No The contact person is needed by the sales Rep but does not provide any
additional analysis. It is not required in the data warehouse
No Only if comment is GOOD, POOR, EXCELLENT as in credit limit
??? This is things like NET 30. Probably not
No This is a foreign key connection done in FACT table for analysis. It is not
stored in the customer dimension table
Document1 by rt -- 12 March 2016
7 of 11
8 of 11
DIMENSION  LOCATION 
We decided earlier that this had to do with customer location and it will be included in the customer
table. But I wanted to go through the analysis to show you what would happen if it was originally
thought to be a separate table.
In this case are we talking about analysis by customer location?
IF the grain was
 Monthly Revenue (or Total $ Sales), by product, by salesrep, by location, by
customer
The grain certainly indicates that we need to keep location information so we probably need to keep
 CITY
 PROVINCE
Do we keep city and province in a separate dimension table?
Example:
LOCATION ID, CITY and PROVINCE
????
We could keep it in a separate table, but we need to insure that in the fact table the customer is
associated with the correct customer location. This means the 2 fields must be validated as working
together. See the fact table example below.
TIME
CUSTOMER LOCATION PRODUCT SAES
REP
MEASURE
MEASURE MEASURE
You can see that it would require extra coding to ensure that any entry in the fact table and there are
millions of entries, have the customer ID and location ID matching for that customer. It would be
easier to keep the LOCATION data with the customer.
A keeping it in the customer table, the CUSTOMER table would have a built in hierarchy for analysis
By CUSTOMER
By CITY
By PROVINCE
Plus what ever measure they use. Example Revenue or Units shipped
by Month by Province. The PROVINCE value can be specifically
limited in the WHERE clause to any combination of provinces and so
can the TIME periods
Document1 by rt -- 12 March 2016
8 of 11
9 of 11
DIMENSION  SALESREP
Here are some sample attributes. Again decide what to extract from the OLTP and move to the DW
dimension table for SALESREP
SID
Y
NAME
Y
ADDRES
N
City maybe, but not
likely. Don’t need
analysis of where
sales person lives.
Would analysis of
student revenue by
where professor
lives make sense?
SALARY
????
HIREDATE
N
All different
EXT#
N
… DOB
N
May keep
a range
SALARY
Should you keep salary? If you do then this is what you analyze
 Daily revenue by Salesrep salary  Is that important?
If all salesreps have different salaries, then the GROUP BY on salaries will result in the
same result as a GROUP BY on salesrep
HIREDATE?
Might use a range value to denote experience
 Only of value if you were looking for analysis similar to
 Are dollar sales affected by the experience level of the Sales Rep?
Another question that needs to be addressed is that it might be nice to analyze by age, but is it done
enough to warrant keeping that data.
Here is an idea to consider. Keeping hire date in the salesrep dimension table is not a lot of data. It is
when you make decisions that add a column to a fact table, and then the amount of data stored
becomes quite an increase for little return.
Document1 by rt -- 12 March 2016
9 of 11
10 of 11
DIMENSION  PRODUCTS
PID
PRODUCT NAME
CATID
BRAND
COST (unit cost)
SELL (unit price)
SIZE
WEIGHT
SUPPLIER ID
WHSE LOCAT’N ID
TAXABLE
ETC
Y
Y
See note 2
Y – It is part of naming a product
See note 1
See note 1
N
N
N
N
N
1 Should we keep the unit price? I.e. selling price for one unit of product
Ask the question.
What is the measure? Revenue
What is the descriptive attribute or dimension to create the grain?
 Revenue by unit price  is this meaningful?
 Revenue by unit cost  is this meaningful?
NO
NO
Here is what an analysis might mean
 Sales in January for products costing 45.23 were $4,389.97
Would there be a meaningful question management may want to know?
Maybe  does high price stuff generate more revenue than low price stuff?
Since there are a lot of values for unit price it might be better to create a range
Values such as 10, 100, 500, 1000 meaning values between 0 and 10, 11 to 100
This is the same for CREDIT LIMIT in customer. It may be important to see if customers with low
credit limits or small customers generate more revenue. However this is a decision made by the
business so on the surface it looks like you wouldn’t keep it.
2 Should we keep Categories?
In the OLTP system CATID is an ID referencing a CATEGORY table. This is done because tables are
normalized for transaction performance and to reduce redundancy or space.
Since the relationship of category to product is 1:M we can put the category name back into the
product dimension or  denormalize
Document1 by rt -- 12 March 2016
10 of 11
11 of 11
Some OTHER ATTRIBUTES you may want to consider from other tables.
In the ORDER TABLE
Would ORDER STATUS be kept?
Example:
 Cancelled orders
 Back ordered
NO, the orders are not kept
WAREHOUSE LOCATION?
The relationship of WAREHOUSE LOCATION to PRODUCT is 1:M
Could put WAREHOUSE LOCATION into PRODUCT and denormalize the two tables. You could if
you wanted analysis by warehouse.
.
Once decisions have been made
Now draw the STAR SCEMA
NOTE: STAR SCHEMA has little meaning to management
 Grain has meaning
Example:
Daily revenue by product, by salesrep, by location of warehouse, by customer…
These can be stated in things that management can evaluate.
Daily revenue
- By product
- By category of product
- By brand
Daily revenue
- By brand by city
- By brand by customer … etc
Document1 by rt -- 12 March 2016
Once the grain is approved
The implementation is next
11 of 11
Download