02-2-Data Modeling a..

advertisement
02-2-DATA MODELING and IMPORTANCE
Topics Covered:
WHY DO DATA MODELING?
MODELING
-- Entity Relational Model
-- Dimensional Modeling – Quick Look
-- Fact Table
-- Dimension Tables
Document1 by rt -- 12 March 2016
1 of 11
Why do data modeling?
Data modeling is the process of analyzing and representing the
"things" (entities) about which an enterprise must know.
(Source reported from Larry English's teachings as far back as 1994)
When they built the CN Tower they used a lot of plans; plans for HVAC, Telecommunications, and the
physical structure. The problem of building the Tower was complex.
When you designed your data base in an earlier subject, you were doing the "plan" before
implementing an actual database.
Data Modeling is a
-- A plan
-- A guide
On how to build something
As mentioned above you are already familiar with data modeling before in the design course. You
should remember doing
 Entity Attribute Lists
 Entity Relationship Diagrams
Document1 by rt -- 12 March 2016
2 of 11
MODELING
Entity Relational Model
The ER model describes data as  Entities,
 Relationships and
 Attributes
-- What are the advantages?
- Its advantage is that data exists in a highly normalized form
- This model seeks to drive all the redundancy out of the data
 This means that a transaction that changes any data only needs to touch the database in one
place.  Speeds up processing
-- Disadvantage
- For queries that span many records or many tables
- Too complex for users to understand and
- Too complicated for users be successfully navigate by DBMS software (code)
- Very slow when summarizing large amounts of data
- Multiple JOINS are required and that uses more resources to process
So what is the alternative?
Document1 by rt -- 12 March 2016
3 of 11
Dimensional Modeling – Quick Look
Dimensional modeling is a technique that helps the database designer
 To build information structures that satisfy the way a business user asks for information.
Most important advantage is that
 It matches end users’ needs for simplicity.
Purpose of dimensional modeling is to
 Provide the data warehouse and managed-query tools with a database definition that lends itself
to subject-oriented information processing.
Information can be re-arranged to
 Be presented to the end user in multiple views, from different perspectives and very fast.
Components of a DW
First, it is still a relational database. That means it has tables that can be accessed like any other
relational database – by SQL
2 types of tables
FACT TABLES
DIMENSION TABLES
This where the name dimensional model comes from
2 styles of dimensional models
 STAR SCHEMA
 Snowflake Schema
General description
 There is one dominant table (fact table) in the center of the schema
 With multiple joins joining it to other tables (dimension tables).
Document1 by rt -- 12 March 2016
4 of 11
DIMENSIONAL MODEL: STAR SCHEMA
The concentration in this course will be on the STAR Schema
SALES
FACT
TIME
DIMENSION
Time key
Day-of-week
Month
Quarter
Year
Holiday flag
PRODUCT
DIMENSION
Time key
Product key
Store key
Dollars sold
Units sold
Dollars cost
Product key
Description
Brand
Category
STORE
DIMENSION
Store key
Store name
City or
Province
Floor plan type
Document1 by rt -- 12 March 2016
5 of 11
Fact Table
 Part of Star Schema in DW
Contains 2 things
Fact tables contain 2 things ONLY. Each row in a fact table is composed of a
 Set of keys and a
 Set of measures.
Keys
Keys represent the foreign keys that connect to the primary keys in the dimensional table. The set of
keys form a composite primary key to the fact table.
Measures
The most useful fact tables also contain one or more numerical measures, or facts, that occur for the
combination of keys that define each record. Measures represent how the business measures
performance. As you can see from the above diagram total dollar sales, total dollar costs or total
products sold are ways of measuring the business. We can look at the measures such as dollars
sold broken down by various time periods or by products or by a store attribute or a combination of
any of these. The best and most useful facts (measures) are
1  Numeric,
2  Additive
3  Continuously valued
This makes analysis of the data much easier, and distinguishes the attributes in the fact table from
attributes in the dimension tables
Again to repeat, an important piece to understand is that the fact table contains numeric
values. It does not contain character fields such as name or province.
Additive (More on this topic later)
Additive ability is crucial because data warehouse applications almost never retrieve a single fact or
table record; rather, they fetch back hundreds, thousands, or even millions of these records at a time,
and often the most useful thing to do with so many records is to add them up and present a summary
by category for management to analyze.
The Fact table is the largest table in a data warehouse. There can be millions of rows, up to several
hundred billion rows. Think of a company like Wal-Mart. For Wal-Mart to understand their customer,
each customers purchasing habits need to be recorded. Also with this is purchasing habits by store,
by time of week, by province or region. The combinations of how inventory or products are sold
results in a massive amount of data extracted from EVERY purchase in EVERY store over a period of
years. As another example consider the amount of data stored by Facebook or Google. These are
massive databases of customer information.
To show cumulative you need data to be summarized as Total $ Sales of a PRODUCT or Category or
Brand across many time periods.
So what is stored will be the different groupings of Sales. What is NOT stored is a single sale of a
product on an order or in a Banking operation the account balances over a period of a month are not
stored. These values would be unlikely to have value to the organization and so is not additive.
Document1 by rt -- 12 March 2016
6 of 11
NOTE: it is a common error to include numeric data in the fact table that is not additive. You know
that all numeric data can be added. In fact since people's names are really a set of bits, in theory
they could be added also. However if we added up all the names in a class and displayed the result
the key thing to notice is that the result would be meaningless. So to determine whether numeric
items should be put in the fact table it must also be meaningful to add up the results. For example if
we took the total dollar sales for each day and added those up for a whole week the result would still
have meaning. You would have an important value, namely weekly dollar sales. However if instead
the numeric quantity on hand (QOH) of a single product in a store was added to all the other products
in that same aisle, the total quantity of products in a specific aisle would have no businesses value or
meaning. For example in a grocery store the QOH of all canned goods is 23,721. Is that all beans? Is
it large cans, small cans, mostly vegetables or is it juice. The number 23,721 has no value to the
business.
Another star schema example
A better choice of names for a measure might be Total Dollar Cost.
Document1 by rt -- 12 March 2016
7 of 11
Using a smaller, less detailed fact table with the following attributes as an example:
Time ID
Product ID
Store ID
Total Dollar Sales
Total Dollar Cost
Total Units Sold
A single row of data represents the total dollar sales, total dollar cost and the total number of
units sold broken down by the (keys) or 1 product, in 1 store, in 1 time period. Assuming the
time period to be a week or month, there would be many sales of that product in a single store.
The row is an accumulation of many transactions from the OLTP system.
Document1 by rt -- 12 March 2016
8 of 11
Dimensional Model (DM) vs Entity Relation (ER) model
 The key to understanding the relationship between DM and ER is that a single ER diagram breaks
down into multiple DM diagrams, or ‘stars’.


 Think of a large ER diagram as representing every possible business process within an
application. The ER diagram may have Sales Calls, Order Entries, Shipment Invoices, Customer
Payments, and Product Returns, all on the same diagram.
Returns
Shipments
Orders
Sales Contact
Payments
Up until now you have been dealing with small sets of tables, maybe 12 to 15. However, corporations
contain far more tables. On the next page is an example of slightly more complicated ERD. This is by
no means a full representation of a large company. It contains very little of the accounting functions
that need to be stored and tables associated with employees, such as benefits etc.
Document1 by rt -- 12 March 2016
9 of 11
Sample of larger ERD
Document1 by rt -- 12 March 2016
10 of 11
Dimension Tables
Dimension tables are textual descriptions of the dimensions, in other words the attributes of a
store or customer or product.
Looking at the star schema example a few pages back you can see that if you were to add sample
data to the dimensional table attribute, the data would be textual.
Examples for CUSTOMER would be customer name, address etc.
Examples for PRODUCT would be product name, product category name, brand name and other
text-like attributes.
The purpose of these attributes is to filter how the facts are presented. Why does the numeric
measures like total dollar sales need to be filtered. Management wants something more than a total
sales figure. Management asks and wants total dollars sales in the last quarter by single products, by
product sub category, by product category or by brand. Sometimes it wants the data further or
differently broken down by province or store within categories of products. What is happening in
Sales of FROZEN ITEMS (Fish, Vegetables, packaged dinners, ice cream) by Province and Store for
last week compared to the same week the previous year. From this we can see that we can break
down large totals like the total sales dollars for all products in this quarter, into sub amounts by
displaying total dollars sales by product category. In this way we can see that the textual attributes
are ways of breaking down or filtering the measures. Another example might be showing me sales of
Bananas (text) and Grapefruit (text).
January
February
Bananas
12,375
13,200
Grapefruit
10,101
10,917
ASIDE: Notice how time is important to providing meaning to the data.
What is important is to determine what type of textual data is to be stored in a dimension table. Not
all the data found in the OLTP that describes a product is worthwhile moving to the data warehouse.
For example sales by products coloured yellow might not be worthwhile unless the product is cars
and not groceries. Later on with practice doing some cases you will be able to determine which
attributes should go into a dimension table.
Dimension tables have levels or categories that populate the table. (more later)
Another characteristic of dimension tables is that they have levels. In the example above we can see
that products are contained within in a level called sub product name, which in turn is contained in a
level called product category name. These levels are a way to provide different groupings for the
measures. Examples of levels:
Weeks  Month  Quarter  Year
City  Province  Region  Country
Customer  Branch  Region
ASIDE: These levels are due to denormalizing the OLTP tables into a single table.
Document1 by rt -- 12 March 2016
11 of 11
Download