ISAM 5931: Data Warehouse and Data Mining Group Project Report Contents Business Scenario ……………………………………………………………………… Business Background ….…………………………………………………………... Defining Problems ………………………………………………………………… Expected Value ……………………………………………………………………. Dimensional Modeling ………………………………………………………………… Defining Dimensions ……………………………………………………………… Defining Fact ……………………………………………………………………… Dimensional Model – Star Schema ……………………………………………….. Hierarchies in Dimension …………………………………………………………. Snowflakes Schema ……………………………………………………………….. Database Schema ………………………………………………………………….. Database Implementation ……………………………………………………………… Microsoft Access Database – Origin of Data ……………………………………… Design of Tables …………………………………………………………………… Relationships in Access ……………………………………………………………. Cube Implementation and OLAP ……………………………………………………… Linked Tables from Microsoft Access …………………………………………….. OLAP Database Schema …………………………………………………………… Dimensions Hierarchy and Schema ………………………………………………… Storage Design and Process the Cube ……………………………………………… Browser the Cube …………………………………………………………………… OLAP Techniques (Slice & Dice, Drill-Down) …………………………………….. Implementation of Data Mining ………………………………………………………… Conclusion and Discussion ……………………………………………………………… 2 2 2 3 3 3 5 5 6 8 9 10 10 11 23 24 24 25 26 28 32 33 44 47 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Business scenario Business Background Harley-Davidson Motor Company is the epitome of rugged adventure and customer service. Since the group built the first three motorcycles in 1903, the company has strived to provide an unforgettable experience to every customer that rides one of their motorcycles. Three men decided to build a motorcycle that would generate an overwhelming desire by individuals to enjoy the ride of their lives. They succeeded. Through the trying times of the Great Depression, and the phenomenal demand both during and after World War I and II, the company continues to capitalize on the strength of the brand name that comes with the name “Harley.” Harley-Davidson remains committed to its customers as evidenced by their mission statement: “We Fulfill Dreams Through the Experiences of Motorcycling by Providing to Motorcyclists and to the General Public an Expanding Line of Motorcycles, Branded Products and Services in Selected Market Selections.” The company believes it requires more than just building products to satisfy their customers, but rather it takes providing an unforgettable experience. Defining Problems Since 1960s, Harley-Davidson has been using computers to automate business processes. Most of these processes deal with the day-to-day operation of a business such as order processing, retail sales through dealers, bank transactions, process claims, post payment, and so on; thus creating a huge amount of data. During the 1990s, Harley-Davidson’s business grew complex, it spread globally, and competition became fiercer. The business executives need information to get profound knowledge of their company’s operations, learn about the key business factors and how these affect one another, monitor how the business factors change over the time, and compare their company’s performance relative to competition and to industry standard. Besides, managers also need to focus their attention on key issues such as customers’ needs and references, emerging technologies, sales and marketing results, and product and service quality. Along with the need for the business to capture and process business transactions, the computer technology has been growing to efficiently store large amounts of data and process 2 ISAM 5931 – Data Warehouse and Data Mining Group Project Report that data into meaningful information. Therefore, the company has been moving from files to databases, and now from databases to data warehouses. Today, data warehouse provides an integrated and total view of an enterprise, and makes enterprise’s current and historical information easily accessible through Online Analytical Processing (OLAP) tools. OLAP is an analysis technique that facilitates summarization, consolidation, and aggregation of data, as well as providing the ability to view information from different angles. Its operations include slice and dice, drill-down and roll-up, which allow the user to view the data at different degrees of summarization. Expected Value In this project, we use OLAP tools to analyze the company’s operational activities that help managers keep track of the sales of the product at various dealers by a given time. These include: Total sales of a particular product and the quantity sold, by the dealer, identifying the state. Total sales of all products, per month, for a specific dealer in a state. Total sales of each dealer, per state. Total sales in a category of products, for each dealer of a state. Total sales of each product for the whole country. Total sales of each dealer, per quarter, for a state. Average sale per transaction, per dealer, for all states. In order to support managers give their operational decisions in the future, we provide data mining tools and techniques such as clustering and decision tree. These prediction techniques show customers’ trends and references by products and regions. Dimensional Modeling Defining Dimensions Data warehouse and OLAP tools are based on a dimensional data model. A dimensional model is based on dimensions, facts, cubes and schemas such as star and snowflake. In developing a data warehouse, managers think of business in terms of business dimensions. Harley-Davidson is a retail business that provides motorcycle products to customers through a 3 ISAM 5931 – Data Warehouse and Data Mining Group Project Report dealers system in over the world. So the business dimensions are considered as time, product and dealer. When a business dimension is abstracted and represented in a database table, it is called a dimensional table. A dimensional table provides the texture descriptions of a business dimension through its attributes: Product Dimension Table Product ID (PK) Product Name Product Description SKU Number MSRP Size Length Weight Fuel Capacity Oil Capacity Online Purchase Product Department Product Line Product Category Product Subcategory … Dealer Dimension Table Dealer ID (PK) Dealer Number Dealer Name Dealer Manager Address City State Zip code District Region Phone Fax … 4 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Time Dimension Table Time ID (PK) Date Day of week Month Day of month Year Week of year Month of year Quarter … Defining Fact A fact is a measurement captured from an event in the marketplace. It is the raw materials for knowledge – observations. A customer buys a product at a certain location at a certain time. When the intersection of these four dimensions occurs, a sale is made. Thus, a meaningful and measurable event of significance to the business occurs at the intersection point of business dimensions. It is the fact. We use the fact to represent a business measure. A fact table is the primary table in a dimensional model where the numerical performance measurement of the business is stored. Fact table tends to be deep in terms of number of rows but narrow in terms of the number of columns. The sales fact table has three foreign keys that connect to the dimension table’s primary keys. They are Product ID, Dealer ID, and Time ID that are also called composite keys in the fact table. We access the fact table via the dimension tables joined to it. Sales Fact Table Product ID (FK) Dealer ID (FK) Time ID (FK) Units Sold Sales Amount The Dimensional Model – Star Schema In the dimensional model, the fact table consisting of numeric measurements is joined to a set of dimension tables filled with descriptive attributes. Using characteristic structure of the star schema, the fact table is at the center and the dimension tables are hung around like a star. 5 ISAM 5931 – Data Warehouse and Data Mining Group Project Report When a product id, a dealer id, and a time period are used to determine which rows are selected from the fact table, this way of collecting data is called the star schema join. Product PK Dealer PK dealer_ID dealer_number dealer_name dealer_addr dealer_city dealer_zipcode dealer_country dealer_phone dealer_fax sales_city sales_state sales_region sales_country Sales_fact PK PK PK product_ID dealer_ID time_ID units_sold sales_amount product_ID Product_name Product_desc SKU MSRP Gender Color Size Length Weight Fuel_capacity Oil_capacity Online_purchase Product_department Product_line Product_category Product_subcategory Time_by_day PK time_ID the_date the_day the_month the_year day_of_month week_of year month_of_year quarter Hierarchies in Dimensions In a data warehouse, measures are stored in the fact table in such detail that users can roll-up in various levels of summarization; it also called aggregation. Only the sales data in the lowest level are kept in the fact table, but the descriptions of various levels of data are kept in the dimension tables, so that appropriate tools can be used to summarize data in various levels. A hierarchy defines a sequence of mappings from a set of low-level to high-level, more general level. The following are hierarchies for the dimension Dealer and Product: 6 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Location All All Country Canada USA Region Canada West Central West District Vancouver San Francisco Europe South West San Diego North West Los Western Seattle London Eastern Italy Moscow Angeles Product All All Product Department Product Line … Motorcycles … Sporster® Product … Category 883/R 1200 Touring FLHR … Motorclothes Jackets … Leather … Knit/Nylon FLHT 7 Gloves Gauntlet … Open tip ISAM 5931 – Data Warehouse and Data Mining Group Project Report Snowflakes Schema Because of the various levels of hierarchy, data in a dimension table in Star schema contain duplicates or redundant values. Therefore, dimension tables are not typically normalized. There is no redundancy in the fact table, only in dimensions. The snowflake schema is a variant of the star schema, where some dimension tables are normalized. The resulting schema forms a picture similar to the snowflake. The redundant attributes are removed from the flat, denormalized dimension tables and placed in normalized secondary tables. Figure below shows a partial snowflake schema through the expansion of Product and Dealer dimensions into multiple tables and their associated PK-FK relationships: Product Dimension Sales Fact Product ID (FK) Dealer ID (FK) Time ID (FK) Units Sold Sales Amount Product ID (PK) Product Class ID (FK) Product Name Product Description SKU Number MSRP Size Length Weight Fuel Capacity Oil Capacity Online Purchase Product Class Dimension Product Class ID (PK) Product Subcategory Product Category Product Line Product Department Dealer Dimension Dealer ID (PK) Region ID (FK) Dealer Number Dealer Name Dealer Manager Address City State Zip code District Region Phone Fax 8 Region Dimension Region ID (PK) Sales City Sales District Sales Region Sales Country ISAM 5931 – Data Warehouse and Data Mining Group Project Report Database Schema Based on the snowflake schema and the relationships between fact and dimensions, we design database schema. In the database design, dimensions and fact tables become entities, which are tables in the ER Diagram. Attributes in dimensions and fact tables become attributes of tables in the ER Diagram. Every table that expresses a many-to-many relationship must be the fact table. All other tables are dimension tables. The following figure shows the database schema: Product PK product_id has is of Time_by_day PK time_id the_date the_day the_month the_year day_of_month week_of_year month_of_year quarter Sales_fact has is of PK,FK PK,FK PK,FK FK product_class_id product_name SKU MSRP gender color size length weight fuel_capacity oil_capacity online_purchase desc Product_class has belongs to PK product_class_id product_subcategory product_category product_line product_department product_id time_id dealer_id units_sold sales_amount Dealer PK dealer_id has is of FK region_id dealer_number dealer_name dealer_addr dealer_city dealer_state dealer_zipcode dealer_country dealer_manager dealer_phone dealer_fax 9 Region PK region_id has belongs to sales_city sales_state sales_district sales_region sales_country ISAM 5931 – Data Warehouse and Data Mining Group Project Report Database Implementation Microsoft Access Database (Origin of Data) 10 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Design of Product Table 11 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Design of Product_class Table 12 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Product, Product_class Data 13 ISAM 5931 – Data Warehouse and Data Mining Group Project Report 14 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Dealer Table Design 15 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Dealer Data 16 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Region Table Design 17 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Region Data 18 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Time_by_day Table Design 19 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Time Data 20 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Sales_fact Table Design 21 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Sales_fact Data 22 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Relationships in Access 23 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Cube Implementation and OLAP Based on the Harley-Davidson database, we use SQL Server Analysis Services to perform online analytical processing (OLAP) and data mining. SQL Server Analysis Services requires data to be retrieved from an existing database and then forms the schema and cubes. Here are some basic steps of the process: Create an OLAP database in the SQL Server, which uses HarleyDavidson database in Access as data source, and make a connection to this database. Build data cube schema, named as Sales, using the existing linked tables. Build Sales_fact table and Product, Time, and Dealer dimension tables. Process the Sales cube to populate data in various hierarchies. Linked Table from Microsoft Access 24 ISAM 5931 – Data Warehouse and Data Mining Group Project Report OLAP Database Schema 25 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Dimensions Hierarchy and Schema: Dealer 26 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Dimensions Hierarchy and Schema: Product 27 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Storage Design and Process the Cube 28 ISAM 5931 – Data Warehouse and Data Mining Group Project Report 29 ISAM 5931 – Data Warehouse and Data Mining Group Project Report 30 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Resulting Cube 31 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Browsing the Cube The figure below shows the units sold, sales amount and average sale per transaction of Harley-Davidson in the years of 2002 and 2003 group by all products, and for all countries, respectively. 32 ISAM 5931 – Data Warehouse and Data Mining Group Project Report 33 ISAM 5931: Data Warehouse and Data Mining Group Project Report Slice & Dice: Total Sales of Dyna Motocycles Product line in quarter 1 of 2003 for each state ISAM 5931 – Data Warehouse and Data Mining Group Project Report Slice & Dice: Total Sales of Dyna Motocycles – FXDBI – Street Bob product in quarter 1 of 2003 for each state 35 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Drill-Down: Total sale of all products, per month, for a particular dealer in a state 36 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Drill-Down: Total sale of each store, per state 37 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Drill-Down: Total sale in a category of products, for each store of a state 38 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Drill-Down: Total sale of each product for the whole country 39 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Drill-Down: Total sale of each store, per quarter, for a state 40 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Average sale per transaction, per store, for all states 41 ISAM 5931 – Data Warehouse and Data Mining Group Project Report 42 ISAM 5931 – Data Warehouse and Data Mining Group Project Report 43 ISAM 5931: Data Warehouse and Data Mining Group Project Report Implementation of Data Mining The visualization tools supplied with Analysis Services are ideal for the evaluation of data mining models. The Data Mining Model Browser display the statistical information contained within a data mining model in an understandable graphic format. The Data Mining model Browser is used to inspect the structure of a generated data mining model from the viewpoint of a single predictable attribute, to provide insight into the effects input variables have in predicting output variables. Because the most significant input variables appear early within decision tree data mining models, generating a decision tree model and then viewing the structure can provide insight into the most significant input variables to be used in other data mining models. One of the most benefits of the decision tree algorithm is the generation of easily understandable rules. By following the node along a single series of branches, a rule can be constructed to derive a single classification of cases. The following figures present the data mining model using the decision tree. In this decision tree, we use data from relational tables with “sales_fact” table as a case key table, “sales_amount” as case key column, and “sales_country”, “product_department” as prediction columns. ISAM 5931 – Data Warehouse and Data Mining Group Project Report 45 ISAM 5931 – Data Warehouse and Data Mining Group Project Report 46 ISAM 5931 – Data Warehouse and Data Mining Group Project Report Conclusion and discussion We all know that the data warehouse is an information environment that provides an integrated and total view of an enterprise. To build up a data warehouse we need to collect data from multiple sources during a long period of time (current and historical data). The data in our project are not the actual data of Harley Davidson Inc., we just referenced to the Harley Davidson Inc. data and build a pseudo-database. After the implementation of OLAP techniques (drill-down, roll-up, slicing and dicing) and data mining techniques (decision tree) on this pseudo-database, we came to the following conclusion: Average Sales per transaction, per dealer for all countries is from $4,367 to $4,612 Among the motorcycle product lines, Touring is the most favorite (highest sales amount) Among the U.S regions, Northwest contributes the most to the total sales Based on decision tree, Sporster® and Touring product lines are potentially increased in the future (probability of 24.93% and 24.65% of 16,116 cases, respectively). The Western Europe customers likely to be more and more interesting in this brand. Therefore, this market should be expanded in the next few years. This presentation demonstrates the process of building a data warehouse, designing a multidimensional model, how the OLAP and data mining techniques help managers in decision making. In reality, the task of implementing data mining on a data warehouse is much more complex. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions). In this process, different models are applied to the same data set and then comparing their performance to choose the best which is then applied to new data in order to generate predictions. 47