02-5-HO-Additional N..

advertisement
02-5-HO Data Warehousing Fundamentals
 This is a nice summary so far, but hard to understand if you have no exposure before. If you have exposure this
will provide a different way of saying things. The more different explanations you read the better you will
understand the subject.
What is it about?
Scenario for a relatively small business (Not the Sears, Bell, Wal-Mart, or TD Bank size)
The company needs to keep growing in order to survive. The information that suggests this need came
from sources such as the business section of the Star, Globe & Mail and National Post. The trend in this
industry is to have mid-size corporations expand their markets by buying up competitors.
The information about trends comes mainly from internal sources within the company and it comes from
external sources such as trade magazines, business papers, business organizations and the
Government of Canada (Statistics Canada).
Most information that allows for comparisons, trends or answers to questions is numerical.
A data warehouse then is software that allows for the storage of all this data from different sources in
numerical format that will allow business analysis to take place.
Requirements would be:
1 The information is available in a timely manner – fast
2 Users must be able to manipulate the data to obtain the needed analysis. No IT needed on reports
3 The format of the data presentation must be in a form easily understood. Charts, tables, graphs and
lists are common information forms in the past. Such forms as spreadsheets represent excellent ways of
portraying data.
4 The values in the data warehouse must be consistent with those from the day-to-day system.
5 Warehouse data must draw from multiple sources within the company to ensure a complete enterprise
picture.
6 Processes need to be in place to gather and convert the data to this more usable form
Later when you use SQLServer Analysis Services, you will find that it provides the visual or presentation
tools that retrieve data fast and allows for data manipulation to see data from different perspectives such
as levels of detail.
With Excel and Pivotal tables it is possible to browse data differently.
The requirements to ensure integrity of data so that decisions are based on reliable data are done in the
process of storing data into a data warehouse. That means the RDMS that holds the data is loaded
before the analysis tools can present the information.
Document1 by rt -- March 22, 2016
1 of 10
Comparing transaction database with a data warehouse
Transaction or OLTP (Online Transaction Processing) is the operation of running the business on a dayto-day basis. The system will tell me how many are enrolled in a class, what room is assigned and how
many people paid tuition as of this date. The processing of payments, enrollment or assigning room
usage is handled by application software that optimizes the transaction process.
The data warehouse database handles questions about historical trends.
Example of DW knowledge: If classes can only hold 40 students maximum then the historical trend of
180 students enrolling in September would require 4.5 classes, which isn’t possible. You would need 5
classes of 30 to 40 students. However, suppose when we looked at long-term analysis over the last 5
years the following appeared: that 5% of enrolled students for some reason either never show or attend
one class. Another 6% drop within one or two weeks and an additional 15% are shown to have almost
no attendance in the last 7 weeks. Knowing the preceding, means that the classes can be overstuffed
with 45 students each, because the 11% not attending at the beginning the semester will result in
classes of 40 students. In fact analysis shows that the later classes (after 3 PM) and the 8 AM classes
tend to have the highest drop out. These could be overloaded even further.
In a transaction database the values change often. Quantities on hand of products change every order
of the product (shipped out) and when the product level is replenished (received or built). These
changes can occur many times per day or hour. This type of change makes the database volatile. A data
warehouse has data that changes but only at set periodic updates. Over time, there is new data to add
to the data warehouse. An example might be adding sales for the day or month for each product. The
new values added do not change the existing values.
The transaction database deals in details. The Sales Order department wants to know how many of a
product X is in stock as their customer requires 10 of them. The sales order person does not need to
know what the average dollar investment that is tied up in product X during the last year or the amount
of turnover on average for the line of products that X belongs to will experience. Data warehouses focus
on higher-level questions rather than, did customer 2345 buy product 52756 lawn mowers and how
many were purchased? Things like averages and total sales by time period, or profitability by grocery
line within region or by store means that there is a lot of summarizing needed. To do this type of analysis
means the values need to be numeric.
Transaction databases provide the source data to feed the data warehouse. The transaction database
could be one or many databases.
Document1 by rt -- March 22, 2016
2 of 10
Dimensions and Measures in Data Analysis
Since data needs to be summarized and that summary data provides information of how the business
performs these numeric values are called measures. Examples of measures would be, Total Dollar
Sales, Total Dollar Cost, Total Units Sold, Units manufactured per day, weight of goods shipped through
a port.
When you want to do analysis the question becomes what measure should be used. Suppose we
wanted to know about units sold. The answer one would get might be 3 or it might be 3000. Both of
these figures are meaningless. They both are numeric and both are summarized or accumulations, but
they are still of little value.
Units Sold 3000
Units Sold
3
To make it more informative the question might become, how many units were shipped by year for the
last 4 years?
2009
3000
2008
2507
2007
2618
2006
2300
Suppose the question is sales by month for last year
January
100
February
98
March
215
etc
To a total of 3000
Suppose for example the company sells 3 products
Product
January February March
Chicken Perogies
25
30
120
Cheese Perogies
35
35
40
Potato Perogies
40
33
55
The example above is more informative to some people in the business. You can imagine if there are
60,000 products, that many senior levels of management won’t require all of the detail. Note that even at
the more detailed level the value of 25 Chicken Perogies is still a summary of the quantity sold for that
product in January.
As an aside the data could also be displayed differently. Which would you prefer?
Product
Month
Quantity
Sold
Chicken Perogies
January 25
Chicken Perogies
February 30
Chicken Perogies
March
120
Cheese Perogies
January 35
Cheese Perogies
February 35
Cheese Perogies
March
40
Potato Perogies
January 40
Potato Perogies
February 33
Potato Perogies
March
55
Document1 by rt -- March 22, 2016
3 of 10
Now suppose you had 25 kiosks in malls that sell this product. The report would need to be tailored to
handle the changes.
Store 1
Store 2
Product
Chicken Perogies
Cheese Perogies
Potato Perogies
Chicken Perogies
Cheese Perogies
Potato Perogies
January
2
3
4
3
4
February
3
35
3
3
3
March
12
4
5
2
5
etc
The more variables you wish to show the more values in the tables.
For example
Total sales each month generated 12 values
Total sales by month for the 3 products gave 36 values in one month
Total sales for 25 stores for the 3 products by month would be 25 * 3 * 12 = 900 values
Do the same by 60,000 products as might be carried by a grocery chain and the number of cells holding
values becomes large. The values are too difficult to humanly grasp.
The three variables used product, store and month are called dimensions.
Although the above report has 3 dimensions the number of dimensions can be higher. The question
becomes how useful is the report. If you looked at store 17, Potato Perogies by month the report would
be manageable and contain 3 dimensions with a restriction on some of the dimensions. (Using SQL as a
retrieval that would mean a WHERE condition).
Note each measure added simply subdivides the same numeric measure.
Interesting example
If you have 128 members in a single dimension (128 products in a product dimension) then there are
128 possible subdivisions of the value of a Total Dollar Sales measure. If 128 products were put into 64
product groupings dimensions (just two in each) the report then has 18,446,744,073,709,551,616
possible values. It is perfectly fine to have dimensions or ways to analyze the measures, but be careful
adding dimensions to a report.
(Note: need to check the figure)
Document1 by rt -- March 22, 2016
4 of 10
Hierarchies
If there are few products and the analysis is for only a few months a single report may be easy to
comprehend. With more products and longer time spans, humans need to reduce the number of values
they are looking at in order to comprehend the data. Grouping things does this. Another term might be
aggregating.
Things that belong together form a dimension. Products belong in one dimension and months in another.
These dimensions are independent column and row labels for values. If one is used as a column
heading and the other as a row heading a value will exist at the intersection of the two.
Suppose we have product line and product as two labels. Are they independent?
Product Line 18” gas 18” electric Saw Hammer
25’ rubber 25’ green
Mowers
12
25
Tools
25
63
Hoses
100
1234
50’ rubber
540
Looking at the report there can be no values at the intersection of Hoses and 18” gas mowers. This
generates a ridiculous report. Adding Product Line in a separate dimension means that the report grows
substantially (see example previous page). Product Line and Product are not completely independent.
Product Line total is a grouping or aggregation of the products within the same Product Line. Therefore
they belong in the same dimension.
The relationship between Product and Product Line is called a hierarchy. Each represents a “level” of
summarization and is referred to as levels.
One of the dimensions you will almost always have is the Time dimension. Depending upon the level of
detail or summarization you could have Day, Week, Month, Quarter and Year as the hierarchy structure.
Each of them is a level.
Members
The term member applies to members of a level in a hierarchy or to all members of the dimension. The
members at the lowest level of detail are called leaf members.
Types of Hierarchies
Balanced
A balanced hierarchy has members at each level. The time dimension is the easiest to use. Under every
Year is a Quarter and under every Quarter is a Month. If a single Month has Days below it, then so do all
other months. This does not mean that the numbers of members at a level are the same for all as they
are in the Time example.
Document1 by rt -- March 22, 2016
5 of 10
Unbalanced
These hierarchies may be similar to an employee hierarchy. A graphic might portray this better.
President
VP Corporate
VP Marketing
Mgr Print Media
VP Sales
Mgr TV Media
Mgr Sales
Mgr Eastern
Canada
Mgr Western
Canada
Notice that some levels such as VP do not all have “children” below them. Notice that it is harder to
name some of the levels i.e. to give a level an appropriate name that describes the level. You can see
that Manager applies to 2 different levels. Suppose also that under Mgr Sales were Salespeople and
under Mgr of Eastern Canada and Western Canada there were also salespeople, then some
salespeople are at a level directly in line with regional managers. It is then hard to call this level Regional
managers. Now what would you call this level.
Leaf members are always the ones with no children below them.
Balanced Ragged
Some hierarchies appear to be unbalanced but also appear to be balanced.
North America
USA
Canada
Mexico
Eastern Group
New York
Pennsylvania
Document1 by rt -- March 22, 2016
Ontario
Yucatan
6 of 10
In Analysis Services, you can define balanced and unbalanced hierarchies, whether they are ragged or
not. A dimension will always have leaf members otherwise it would be a table with no content. The
hierarchy structure simply defines how the values of leaf members are summarized.
DATA WAREHOUSE STRUCTURE
Analysis Services makes it easy for a client application to create reports that use multiple dimensions,
but the values displayed in the report, must come from a relational data warehouse. Analysis Services
assumes that you already have a relational data warehouse. The process of gathering data,
transforming and cleaning the data, and loading the data into a data warehouse (ETL) requires
considerable skill and work.
For Analysis Services to work it needs a number of measures in a table. It needs a fact table in the data
warehouse and it needs the fact table to be in a certain form.
FACT TABLE
A fact table is a table in the relational data warehouse that stores the detailed values for measurers, or
facts. A fact table that stores Dollar Sales and Units Sold by Province, by Product and by Month has five
columns like the following:
Province
BC
BC
ON
ON
ON
BC
BC
Etc ..
Product
18” 4HP Gas Lawn Mower
18” 4.5 HP Gas Lawn Mower
18” 4HP Gas Lawn Mower
18” 4.5 HP Gas Lawn Mower
22” 5 HP Gas Lawn Mower
18” 4HP Gas Lawn Mower
18” 4.5 HP Gas Lawn Mower
Month
June
June
June
June
June
July
July
Dollar Sales
7600.00
3000.00
22000.00
3000.00
1000.00
2600.00
2400.00
Units Sold
38
10
1100
10
2
13
8
Although the above table contains descriptive names to make it easier for us to understand right now, in
fact, each of the key columns would contain an integer value. The descriptive names, if they were
needed in a report would come from the dimension tables. Because the fact table contains many rows
possibly millions of rows, the use of an integer can substantially reduce the space requirements.
From the sample table above, the three columns Province, Product and Month represent keys (3
composite parts of the primary key). The remaining two columns of Dollar Sales and units Sold are
called measures.
The number off measures found in a fact table depends upon the application. For example, a
warehouse might contain two measure columns, Dollar Sales and Units Sold. A manufacturing operation
may have different measures. One of the measures might be units manufactured another measure
might be defects or the number of defects.
To be usable by Analysis Services, the fact table must contain rows at the lowest level of detail that you
might want to retrieve for a measure. That means the fact table contains rows for leaf member in the
dimension tables. If the fact table contains aggregates or summaries, such as quarter and year totals,
then Analysis Services cannot use the fact table. As an example, if a Province table included in the
Document1 by rt -- March 22, 2016
7 of 10
hierarchy Province, Region, and Country, only the members from the Province level appear in the fact
table. Analysis services will create all the summarized values. In the fact table, specifying a single leaf
member for each dimension should identify a single row.
DIMENSION TABLES
A fact table contains only the lowest level of detail. If you have eight products grouped into two product
lines, the fact table does not contain any rows for the product lines, only rows for the products. The
same thing is true about quarter totals or year totals, only values for the months, which is the lowest
level of detail in the fact table. Information necessary to create summaries on product lines will be
contained in the product table. Information in order to summarize on quarters or years will be stored in
the Time table.
A dimension table contains one row for each leaf member of the dimension. A product dimension table
with three products will have three rows. A dimension table contains a numeric key column that uniquely
identifies each member known as the primary key. A dimension table also contains the names of the
members. If the dimension table is involved in a balanced hierarchy, the dimension table will also have
an additional column that gives the parent for each member.
Example: Note there are several hierarchies in the table below.
PRODUCT TABLE
Product Product Description
ID
57400
18” 4HP Gas Lawn Mower
57450
18” 4.5 HP Gas Lawn Mower
57500
22” 5 HP Gas Lawn Mower
60400
18” Gas Snow Blower
Etc …
99607
12.6 Volt Cordless Drill
Product
Sub Category
Lawn Mower
Lawn Mower
Lawn Mower
Snow Blower
Product
Category
Lawn &Garden
Lawn &Garden
Lawn &Garden
Lawn &Garden
Department
Cordless Tools
Power Tools
Hardware
Hardware
Hardware
Hardware
Hardware
In the above Product table, product ID is the primary key for the table and for performance purposes this
primary key column should be indexed. The primary key column of each dimension table must match
one of the key columns in the related fact table. A key such as 57400 in the product table will appear
once. But in the fact table it may appear many thousands of times. The relationship between dimension
tables, and fact tables, is a one to many.
Notice that in the above product table, the descriptions under product sub category and product category
and departments repeat themselves. Normally, in an OLTP system the tables will be normalized and
what is a product table above would in fact be four different tables. The product table above is
denormalized and the hierarchy forms a chain of one to many relationships. The denormalization is
meant to increase the speed of processing queries as it requires less joins.
In some data warehouses, the tables are kept separate rather than put into a single table (snowflake
design). Both of these methods (star and snowflake) will be discussed on this course and the theoretical
advantages to each.
OTHER DIMENSION TABLE STRUCTURES (within Analysis Services)
Document1 by rt -- March 22, 2016
8 of 10
Most data warehouses include a time dimension. Sometimes the fact table will store date column, as
key for the Time dimension rather than an integer key. In that case there is no need for a Time
dimension table. Analysis Services can still use the date column from the fact table as the source for a
Time dimension and can also do hierarchies such as month, quarter, and year.
An employee table has an unbalanced hierarchy. The employee dimension contains all leaf members,
as all members are employees. Each employee has a manager; however, the manager is also an
employee. If we were using a snowflake design, the manager would be in a separate table. In the
employee dimension table, the parent member simply points back to a new row of the original dimension
table. This is called a parent-child dimension, because the parent member and the child member are in
the same table. A parent-child dimension provides a great deal of flexibility in how a hierarchy is
organized.
In some organizations an employee will report to two different managers. To do analysis requires that
the employee information is aggregated into two or more different managers. This means there are two
or more hierarchies. In Analysis Services, you can create these separate hierarchies, both of which use
the same leaf members, but which then aggregates using different parent members. How to do it is not
covered on this course.
A dimension table can also contain columns that are not part of a hierarchy structure. For example, a
product dimension may contain columns such as color and type of packaging. A column that is not part
of a hierarchy is called a member property.
Document1 by rt -- March 22, 2016
9 of 10
CONCEPT OF A CUBE
A cube in Analysis Services is a logical construct. It allows a client application to retrieve values as if
every possible summarized value existed in the cube.
Conceptually, the cube is a fact table, but with a few significant differences. Like a fact table, a cube
contains one column for each dimension and one column for each measure. Also like a fact table, a
cube contains a row for each possible combination of members for all dimensions. A fact table, however,
contains only the lowest level members for each dimension. A cube contains the same lowest level plus
the rows based on summarizations. The summarized rows would never appear in a fact table.
The cube conceptually contains a value for each measure summarized at each possible hierarchy level
for each dimension, but again, that doesn't mean that the cube will actually store all those possible
summarized values. The cube can calculate any value by dynamically summarizing leaf level values.
For example in a structure with Country, Provinces and City with cities as the leaf level, a summary can
quickly be made into provinces, and then the summaries of the provinces in countries.
Analysis Services allows the cube designer the flexibility to control how many aggregations are
physically created. The more aggregations the more space required to store them. Aside from
performance differences, whether or not the cube physically stores a particular summarized value is
completely invisible to a user of a cube.
When storing values in a cube, Analysis Services stores values for only simple aggregations, such as
summing, counting and taking minimum or maximum values. However business analysis requires more
than simple aggregations. You can create calculated members that perform calculations on aggregated
values. Calculated members make it easy to create values such as average sales. An average sale is a
sum of all the sales divided by a count, both of which the cube does.
Virtual Cube
A virtual cube combines measures from cubes that share at least one common dimension.
Document1 by rt -- March 22, 2016
10 of 10
Download