Report

advertisement
ISAM 5931: Data Warehouse and Data Mining
Group Project Report
Contents
Business Scenario ………………………………………………………………………
Business Background ….…………………………………………………………...
Defining Problems …………………………………………………………………
Expected Value …………………………………………………………………….
Dimensional Modeling …………………………………………………………………
Defining Dimensions ………………………………………………………………
Defining Fact ………………………………………………………………………
Dimensional Model – Star Schema ………………………………………………..
Hierarchies in Dimension ………………………………………………………….
Snowflakes Schema ………………………………………………………………..
Database Schema …………………………………………………………………..
Database Implementation ………………………………………………………………
Microsoft Access Database – Origin of Data ………………………………………
Design of Tables ……………………………………………………………………
Relationships in Access …………………………………………………………….
Cube Implementation and OLAP ………………………………………………………
Linked Tables from Microsoft Access ……………………………………………..
OLAP Database Schema ……………………………………………………………
Dimensions Hierarchy and Schema …………………………………………………
Storage Design and Process the Cube ………………………………………………
Browser the Cube ……………………………………………………………………
OLAP Techniques (Slice & Dice, Drill-Down) ……………………………………..
Implementation of Data Mining …………………………………………………………
Conclusion and Discussion ………………………………………………………………
2
2
2
3
3
3
5
5
6
8
9
10
10
11
23
24
24
25
26
28
32
33
44
47
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Business scenario
Business Background
Harley-Davidson Motor Company is the epitome of rugged adventure and customer service.
Since the group built the first three motorcycles in 1903, the company has strived to provide
an unforgettable experience to every customer that rides one of their motorcycles. Three men
decided to build a motorcycle that would generate an overwhelming desire by individuals to
enjoy the ride of their lives. They succeeded. Through the trying times of the Great
Depression, and the phenomenal demand both during and after World War I and II, the
company continues to capitalize on the strength of the brand name that comes with the name
“Harley.”
Harley-Davidson remains committed to its customers as evidenced by their mission statement:
“We Fulfill Dreams Through the Experiences of Motorcycling by Providing to Motorcyclists
and to the General Public an Expanding Line of Motorcycles, Branded Products and Services
in Selected Market Selections.” The company believes it requires more than just building
products to satisfy their customers, but rather it takes providing an unforgettable experience.
Defining Problems
Since 1960s, Harley-Davidson has been using computers to automate business processes.
Most of these processes deal with the day-to-day operation of a business such as order
processing, retail sales through dealers, bank transactions, process claims, post payment, and
so on; thus creating a huge amount of data.
During the 1990s, Harley-Davidson’s business grew complex, it spread globally, and
competition became fiercer. The business executives need information to get profound
knowledge of their company’s operations, learn about the key business factors and how these
affect one another, monitor how the business factors change over the time, and compare their
company’s performance relative to competition and to industry standard. Besides, managers
also need to focus their attention on key issues such as customers’ needs and references,
emerging technologies, sales and marketing results, and product and service quality.
Along with the need for the business to capture and process business transactions, the
computer technology has been growing to efficiently store large amounts of data and process
2
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
that data into meaningful information. Therefore, the company has been moving from files to
databases, and now from databases to data warehouses.
Today, data warehouse provides an integrated and total view of an enterprise, and makes
enterprise’s current and historical information easily accessible through Online Analytical
Processing (OLAP) tools. OLAP is an analysis technique that facilitates summarization,
consolidation, and aggregation of data, as well as providing the ability to view information
from different angles. Its operations include slice and dice, drill-down and roll-up, which
allow the user to view the data at different degrees of summarization.
Expected Value
In this project, we use OLAP tools to analyze the company’s operational activities that help
managers keep track of the sales of the product at various dealers by a given time. These
include:

Total sales of a particular product and the quantity sold, by the dealer, identifying the
state.

Total sales of all products, per month, for a specific dealer in a state.

Total sales of each dealer, per state.

Total sales in a category of products, for each dealer of a state.

Total sales of each product for the whole country.

Total sales of each dealer, per quarter, for a state.

Average sale per transaction, per dealer, for all states.
In order to support managers give their operational decisions in the future, we provide data
mining tools and techniques such as clustering and decision tree. These prediction techniques
show customers’ trends and references by products and regions.
Dimensional Modeling
Defining Dimensions
Data warehouse and OLAP tools are based on a dimensional data model. A dimensional
model is based on dimensions, facts, cubes and schemas such as star and snowflake. In
developing a data warehouse, managers think of business in terms of business dimensions.
Harley-Davidson is a retail business that provides motorcycle products to customers through a
3
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
dealers system in over the world. So the business dimensions are considered as time, product
and dealer.
When a business dimension is abstracted and represented in a database table, it is called a
dimensional table. A dimensional table provides the texture descriptions of a business
dimension through its attributes:
Product Dimension Table
Product ID (PK)
Product Name
Product Description
SKU Number
MSRP
Size
Length
Weight
Fuel Capacity
Oil Capacity
Online Purchase
Product Department
Product Line
Product Category
Product Subcategory
…
Dealer Dimension Table
Dealer ID (PK)
Dealer Number
Dealer Name
Dealer Manager
Address
City
State
Zip code
District
Region
Phone
Fax
…
4
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Time Dimension Table
Time ID (PK)
Date
Day of week
Month
Day of month
Year
Week of year
Month of year
Quarter
…
Defining Fact
A fact is a measurement captured from an event in the marketplace. It is the raw materials for
knowledge – observations. A customer buys a product at a certain location at a certain time.
When the intersection of these four dimensions occurs, a sale is made. Thus, a meaningful and
measurable event of significance to the business occurs at the intersection point of business
dimensions. It is the fact. We use the fact to represent a business measure.
A fact table is the primary table in a dimensional model where the numerical performance
measurement of the business is stored. Fact table tends to be deep in terms of number of rows
but narrow in terms of the number of columns. The sales fact table has three foreign keys that
connect to the dimension table’s primary keys. They are Product ID, Dealer ID, and Time ID
that are also called composite keys in the fact table. We access the fact table via the
dimension tables joined to it.
Sales Fact Table
Product ID (FK)
Dealer ID (FK)
Time ID (FK)
Units Sold
Sales Amount
The Dimensional Model – Star Schema
In the dimensional model, the fact table consisting of numeric measurements is joined to a set
of dimension tables filled with descriptive attributes. Using characteristic structure of the star
schema, the fact table is at the center and the dimension tables are hung around like a star.
5
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
When a product id, a dealer id, and a time period are used to determine which rows are
selected from the fact table, this way of collecting data is called the star schema join.
Product
PK
Dealer
PK
dealer_ID
dealer_number
dealer_name
dealer_addr
dealer_city
dealer_zipcode
dealer_country
dealer_phone
dealer_fax
sales_city
sales_state
sales_region
sales_country
Sales_fact
PK
PK
PK
product_ID
dealer_ID
time_ID
units_sold
sales_amount
product_ID
Product_name
Product_desc
SKU
MSRP
Gender
Color
Size
Length
Weight
Fuel_capacity
Oil_capacity
Online_purchase
Product_department
Product_line
Product_category
Product_subcategory
Time_by_day
PK
time_ID
the_date
the_day
the_month
the_year
day_of_month
week_of year
month_of_year
quarter
Hierarchies in Dimensions
In a data warehouse, measures are stored in the fact table in such detail that users can roll-up
in various levels of summarization; it also called aggregation. Only the sales data in the
lowest level are kept in the fact table, but the descriptions of various levels of data are kept in
the dimension tables, so that appropriate tools can be used to summarize data in various
levels.
A hierarchy defines a sequence of mappings from a set of low-level to high-level, more
general level. The following are hierarchies for the dimension Dealer and Product:
6
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Location
All
All
Country
Canada
USA
Region
Canada
West
Central
West
District
Vancouver
San
Francisco
Europe
South
West
San
Diego
North
West
Los
Western
Seattle
London
Eastern
Italy
Moscow
Angeles
Product
All
All
Product
Department
Product
Line
…
Motorcycles
…
Sporster®
Product
…
Category 883/R
1200
Touring
FLHR
…
Motorclothes
Jackets
…
Leather … Knit/Nylon
FLHT
7
Gloves
Gauntlet … Open tip
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Snowflakes Schema
Because of the various levels of hierarchy, data in a dimension table in Star schema contain
duplicates or redundant values. Therefore, dimension tables are not typically normalized.
There is no redundancy in the fact table, only in dimensions. The snowflake schema is a
variant of the star schema, where some dimension tables are normalized. The resulting
schema forms a picture similar to the snowflake. The redundant attributes are removed from
the flat, denormalized dimension tables and placed in normalized secondary tables.
Figure below shows a partial snowflake schema through the expansion of Product and Dealer
dimensions into multiple tables and their associated PK-FK relationships:
Product Dimension
Sales Fact
Product ID (FK)
Dealer ID (FK)
Time ID (FK)
Units Sold
Sales Amount
Product ID (PK)
Product Class ID (FK)
Product Name
Product Description
SKU Number
MSRP
Size
Length
Weight
Fuel Capacity
Oil Capacity
Online Purchase
Product Class Dimension
Product Class ID (PK)
Product Subcategory
Product Category
Product Line
Product Department
Dealer Dimension
Dealer ID (PK)
Region ID (FK)
Dealer Number
Dealer Name
Dealer Manager
Address
City
State
Zip code
District
Region
Phone
Fax
8
Region Dimension
Region ID (PK)
Sales City
Sales District
Sales Region
Sales Country
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Database Schema
Based on the snowflake schema and the relationships between fact and dimensions, we design
database schema. In the database design, dimensions and fact tables become entities, which
are tables in the ER Diagram. Attributes in dimensions and fact tables become attributes of
tables in the ER Diagram. Every table that expresses a many-to-many relationship must be the
fact table. All other tables are dimension tables. The following figure shows the database
schema:
Product
PK product_id
has
is of
Time_by_day
PK time_id
the_date
the_day
the_month
the_year
day_of_month
week_of_year
month_of_year
quarter
Sales_fact
has
is of
PK,FK
PK,FK
PK,FK
FK product_class_id
product_name
SKU
MSRP
gender
color
size
length
weight
fuel_capacity
oil_capacity
online_purchase
desc
Product_class
has
belongs
to
PK product_class_id
product_subcategory
product_category
product_line
product_department
product_id
time_id
dealer_id
units_sold
sales_amount
Dealer
PK dealer_id
has
is of
FK region_id
dealer_number
dealer_name
dealer_addr
dealer_city
dealer_state
dealer_zipcode
dealer_country
dealer_manager
dealer_phone
dealer_fax
9
Region
PK region_id
has
belongs
to
sales_city
sales_state
sales_district
sales_region
sales_country
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Database Implementation
Microsoft Access Database (Origin of Data)
10
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Design of Product Table
11
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Design of Product_class Table
12
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Product, Product_class Data
13
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
14
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Dealer Table Design
15
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Dealer Data
16
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Region Table Design
17
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Region Data
18
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Time_by_day Table Design
19
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Time Data
20
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Sales_fact Table Design
21
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Sales_fact Data
22
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Relationships in Access
23
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Cube Implementation and OLAP
Based on the Harley-Davidson database, we use SQL Server Analysis Services to perform
online analytical processing (OLAP) and data mining. SQL Server Analysis Services requires
data to be retrieved from an existing database and then forms the schema and cubes. Here are
some basic steps of the process:
 Create an OLAP database in the SQL Server, which uses HarleyDavidson database in
Access as data source, and make a connection to this database.
 Build data cube schema, named as Sales, using the existing linked tables.
 Build Sales_fact table and Product, Time, and Dealer dimension tables.
 Process the Sales cube to populate data in various hierarchies.
Linked Table from Microsoft Access
24
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
OLAP Database Schema
25
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Dimensions Hierarchy and Schema: Dealer
26
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Dimensions Hierarchy and Schema: Product
27
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Storage Design and Process the Cube
28
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
29
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
30
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Resulting Cube
31
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Browsing the Cube
The figure below shows the units sold, sales amount and average sale per transaction of
Harley-Davidson in the years of 2002 and 2003 group by all products, and for all countries,
respectively.
32
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
33
ISAM 5931: Data Warehouse and Data Mining
Group Project Report
Slice & Dice: Total Sales of Dyna Motocycles Product line in quarter 1 of 2003 for each state
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Slice & Dice: Total Sales of Dyna Motocycles – FXDBI – Street Bob product in quarter 1 of 2003 for each state
35
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Drill-Down: Total sale of all products, per month, for a particular dealer in a state
36
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Drill-Down: Total sale of each store, per state
37
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Drill-Down: Total sale in a category of products, for each store of a state
38
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Drill-Down: Total sale of each product for the whole country
39
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Drill-Down: Total sale of each store, per quarter, for a state
40
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Average sale per transaction, per store, for all states
41
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
42
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
43
ISAM 5931: Data Warehouse and Data Mining
Group Project Report
Implementation of Data Mining
The visualization tools supplied with Analysis Services are ideal for the evaluation of data mining
models. The Data Mining Model Browser display the statistical information contained within a data
mining model in an understandable graphic format. The Data Mining model Browser is used to
inspect the structure of a generated data mining model from the viewpoint of a single predictable
attribute, to provide insight into the effects input variables have in predicting output variables.
Because the most significant input variables appear early within decision tree data mining models,
generating a decision tree model and then viewing the structure can provide insight into the most
significant input variables to be used in other data mining models. One of the most benefits of the
decision tree algorithm is the generation of easily understandable rules. By following the node along
a single series of branches, a rule can be constructed to derive a single classification of cases.
The following figures present the data mining model using the decision tree. In this decision tree, we
use data from relational tables with “sales_fact” table as a case key table, “sales_amount” as case key
column, and “sales_country”, “product_department” as prediction columns.
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
45
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
46
ISAM 5931 – Data Warehouse and Data Mining
Group Project Report
Conclusion and discussion
We all know that the data warehouse is an information environment that provides an integrated and
total view of an enterprise. To build up a data warehouse we need to collect data from multiple
sources during a long period of time (current and historical data). The data in our project are not the
actual data of Harley Davidson Inc., we just referenced to the Harley Davidson Inc. data and build a
pseudo-database. After the implementation of OLAP techniques (drill-down, roll-up, slicing and
dicing) and data mining techniques (decision tree) on this pseudo-database, we came to the following
conclusion:
 Average Sales per transaction, per dealer for all countries is from $4,367 to $4,612
 Among the motorcycle product lines, Touring is the most favorite (highest sales amount)
 Among the U.S regions, Northwest contributes the most to the total sales
 Based on decision tree, Sporster® and Touring product lines are potentially increased in the
future (probability of 24.93% and 24.65% of 16,116 cases, respectively).
 The Western Europe customers likely to be more and more interesting in this brand.
Therefore, this market should be expanded in the next few years.
This presentation demonstrates the process of building a data warehouse, designing a
multidimensional model, how the OLAP and data mining techniques help managers in decision
making. In reality, the task of implementing data mining on a data warehouse is much more
complex. The process of data mining consists of three stages: (1) the initial exploration, (2) model
building or pattern identification with validation/verification, and (3) deployment (i.e., the
application of the model to new data in order to generate predictions). In this process, different
models are applied to the same data set and then comparing their performance to choose the best
which is then applied to new data in order to generate predictions.
47
Download