Data Mining Course IT449 Text Book Data Mining: Concepts and Techniques 3rd Edition Chapter 4 : Data Warehousing and On-line Analytical Processing 1 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 4 — Jiawei Han, Micheline Kamber, and Jian Pei ©2011 Han, Kamber & Pei. All rights reserved. 2 Chapter 4: Data Warehousing(DW) and On-line Analytical Processing n Data Warehouse: Basic Concepts n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage n Data Warehouse Implementation n Data Generalization by Attribute-Oriented Induction n Additional materials n Summary 4 What is a Data Warehouse? n Defined in many different ways, but not rigorously. n A decision support database that is maintained separately from the organization’s operational database n Support information processing by providing a solid platform of consolidated, historical data for analysis. n “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon n Data warehousing: n The process of constructing and using data warehouses 5 Barry Devlin, Definitions of DW A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. 6 Other DW Definition Info/knowledge A process of transforming data into information and making it available to users in a timely enough manner to make a difference, it is suitable environment for knowledge mining Data 7 Data mining and OLAP OLAP: Online Analytical Processing n n n n n OLAP (on dataware house) is a data summarization/aggregation tool that helps simplify data analysis. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. Data mining involves more automated and deeper analysis than OLAP. It is expected to have broader applications. Data mining can help business managers find and reach more suitable customers, as well as gain critical business insights that may help drive market share and raise profits. OLAP is a data summarization/aggregation tool that helps simplify data analysis. December 27, 2022 Data Mining: Concepts and Techniques 8 Data Warehouse n “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon 9 Data Warehouse—Subject-Oriented n Organized around major subjects, such as customer, product, sales n Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing n Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process 10 Cont-Data Warehouse—Subject-Oriented Application-Orientation Operational Database Loans Credit Card Trust Subject-Orientation G Customer Product Savings December 27, 2022 Data Warehouse Vendor Activity Data Mining: Concepts and Techniques 11 Data Warehouse—Integrated n n Constructed by integrating multiple, heterogeneous data sources n relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. n Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources 12 Data Warehouse—Time Variant n The time horizon for the data warehouse is significantly longer than that of operational systems n n n Operational database: current value data no deal of in Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years), static data element of tin Every key structure in the data warehouse n n Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element” 13 Data Warehouse—Nonvolatile n A physically separate store of data transformed from the operational environment n Operational update of data does not occur in the data warehouse environment n Does not require transaction processing, recovery, and concurrency control mechanisms n Requires only two operations in data accessing: n initial loading of data and access of data 14 OLTP (on database) vs. OLAP (on data warehouse) OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data historical, summarized, multidimensional integrated, consolidated lots of scans unit of work current, up-to-date detailed, flat relational isolated read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response access complex query 15 Data Warehouse: A Multi-Tiered Architecture Metadata Other sources Operational DBs of QU Extract Transform Load Refresh Monitor & Integrator Data Warehouse Raped Miner Server Serve Analysis Query Reports Data mining Data Marts QU Data Sources Data Storage Raped Miner Front-End Tools Engine 16 Three Data Warehouse Models n n n Enterprise warehouse n collects all of the information about subjects spanning the entire organization such as QU Data Mart n a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart n Independent vs. dependent (directly from warehouse) data mart Virtual warehouse n A set of views over operational databases n Only some of the possible summary views may be materialized 17 Extraction, Transformation, and Loading (ETL) n n n n n Data extraction n get data from multiple, heterogeneous, and external sources Data cleaning n detect errors in the data and rectify them when possible Data transformation n convert data from legacy or host format to warehouse format Load n sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh n propagate the updates from the data sources to the warehouse 18 Chapter 4: Data Warehousing and On-line Analytical Processing n Data Warehouse: Basic Concepts n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage n Data Warehouse Implementation n Data Generalization by Attribute-Oriented Induction n Additional materials n Summary 19 Multidimensional Data Model n A data warehouse is based on a multidimensional data model which views data in the form of a data cube n A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions n Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) n Fact table contains measures (such as dollars_sold (sales amount in dollars and keys to each of the related dimension tables n The data cube summarizes the measure with respect to a set of n dimensions and provides summarizations for all subsets of them 20 Multidimensional Data Model 21 Conceptual Modeling of Data Warehouses n Modeling data warehouses: dimensions & measures n Star schema: A fact table in the middle connected to a set of dimension tables n Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake n Fact constellations: Multiple fact tables share ﻛو ﻛﺑ ﺔ dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 22 Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch branch_key branch_name branch_type location_key units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location location_key street city state_or_province country Measures 23 Example of Snowflake Schema ﻧدﻓﺔ اﻟﺛﻠﺞ time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key supplier supplier_key supplier_type location location_key street city_key city city_key city state_or_province country 24 Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_key item_name brand type supplier_type location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures time_key item_key shipper_key from_location branch_key branch Shipping Fact Table location to_location location_key street city province_or_state country dollars_cost units_shipped shipper shipper_key shipper_name location_key 25 shipper_type A Concept Hierarchy: Dimension (location) all all Europe region country city office Germany Frankfurt ... ... ... Spain North_America Canada Vancouver ... L. Chan ... Mexico Toronto ... M. Wind 26 Measures of Data Cubes: Three Categories n Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning n n Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function n n E.g., count(), sum(), min(), max() E.g., avg(), min_N(), standard_deviation() Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. n E.g., median(), mode(), rank() Q. What is the main difference between these three measures? 27 Multidimensional Data Sales volume as a function of product, month, and region Re gi on Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product n Product City Office Month Week Day Month 28 A Sample Data Cube • The data cube summarizes the fact (measure with respect to a set of n dimensions and provides summarizations for all Pr od TV PC VCR sum 1Qtr Date 2Qtr 3Qtr 4Qtr sum Total annual sales of TVs in U.S.A. U.S.A Country uc t subsets of them Canada Mexico sum 29 Typical OLAP Operations n Roll up (drill-up): summarize data n by climbing up hierarchy or by dimension reduction n Drill down (roll down): reverse of roll-up n from higher level summary to lower level summary or detailed data, or introducing new dimensions drill across: involving (across) more than one fact table n n n n Slice Dice Pivot (rotate): n reorient the cube, visualization, 3D to series of 2D planes 30 Typical OLAP Operations 31 Browsing a Data Cube n n n Visualization OLAP capabilities Interactive manipulation 32 Chapter 4: Data Warehousing and On-line Analytical Processing n Data Warehouse: Basic Concepts n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage n Data Warehouse Implementation n Data Generalization by Attribute-Oriented Induction n Additional materials n Summary 33 Design of Data Warehouse: A Business Analysis Framework n Four views regarding the design of a data warehouse n Top-down view n n Data source view n n Exposes the information being captured, stored, and managed by operational systems Data warehouse view n n The top-down view allows the selection of the relevant information necessary for the data warehouse. This information matches current and future business needs. consists of fact tables and dimension tables Business query view n Sees the perspectives of data in the warehouse from the view of end-user 34 Data Warehouse Design strategies n n Top-down, bottom-up approaches or a combination of both n Top-down: Starts with overall design and planning (mature) n Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view n n n From the software engineering point of view, the design and construction of a data warehouse may consist of the following steps: planning, requirements study, problem analysis, warehouse design, data integration and testing, and finally deployment of the data warehouse. Large software systems can be developed using one of two methodologies: the waterfall method or the spiral method. 35 warehouse design process 1. Choose a business process to model (e.g., orders, invoices, shipments, inventory, account administration, sales, or the general ledger). 2. Choose the business process “grain”, which is the fundamental , atomic level of data to be represented in the fact table for this process (e.g., individual transactions, individual daily snapshots, and so on). 3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status. 4. Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like dollars sold and units sold. 36 Data Warehouse Usage n Three kinds of data warehouse applications n Information processing n n n supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing n multidimensional analysis of data warehouse data n Eg: OLAP operations, slice-dice, drilling, pivoting Data mining n n knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools 37 Chapter 4: Data Warehousing and On-line Analytical Processing n Data Warehouse: Basic Concepts n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage n Data Warehouse Implementation n Data Generalization by Attribute-Oriented Induction n Additional materials n Summary 38 Data Warehouse Implementation December 27, 2022 Data Mining: Concepts and Techniques 39 Indexing OLAP Data: Bitmap Index n n n n Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column Base table Index on Type Index on Region Cust Region Type C1 C2 C3 C4 C5 Asia Europe Asia America Europe Retail Dealer Dealer Retail Dealer RecID Asia 1 2 3 4 5 1 0 1 0 0 Europe America RecID Retail Dealer 0 1 0 0 1 0 0 0 1 0 1 2 3 4 5 1 0 0 1 0 0 1 1 0 1 40 Chapter 4: Data Warehousing and On-line Analytical Processing n Data Warehouse: Basic Concepts n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage n Data Warehouse Implementation n Data Generalization by Attribute-Oriented Induction n Additional materials n Summary 41 Attribute-Oriented Induction n n n Collect the task-relevant data (initial relation) using a relational database query Perform generalization by attribute removal or attribute generalization Apply aggregation by merging identical, generalized tuples and accumulating their respective counts n Interaction with users for knowledge presentation n Example: … 42 Attribute-Oriented Induction: An Example Example: Describe general characteristics of graduate students in the University database n Step 1. Fetch relevant set of data using an SQL statement, e.g., Select * (i.e., name, gender, major, birth_place, birth_date, residence, phone#, gpa) from student where student_status in {“Msc”, “MBA”, “PhD” } n n Step 2. Perform attribute-oriented induction Step 3. Present results in generalized relation, cross-tab, or rule forms 43 Class Characterization: An Example Name Gender Jim Initial Woodman Relation Scott M Birth_date Vancouver,BC, 8-12-76 Canada CS Montreal, Que, 28-7-75 Canada Physics Seattle, WA, USA 25-8-70 … … … M F … Removed Retained Sci,Eng, Bus Gender Major M F … Birth-Place CS Lachance Laura Lee … Prime Generalized Relation Major Science Science … Country Age range Residence Phone # GPA 3511 Main St., Richmond 345 1st Ave., Richmond 687-4598 3.67 253-9106 3.70 125 Austin Ave., Burnaby … 420-5232 … 3.83 … City Removed Excl, VG,.. Birth_region Age_range Residence GPA Canada Foreign … 20-25 25-30 … Richmond Burnaby … Very-good Excellent … Count 16 22 … Birth_Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 44 Other Data Mining Tools n WEKa: WEKA is a collection of machine learning algorithms for data mining tasks n Rapid Miner n KNIME: The Konstanz Information Miner is a modular environment, It is designed as a teaching, research and collaboration platform n TANAGRA:is open-source data analysis software for academic and research purposes which combines data mining techniques with statistical learning. n Other Tools such as OLAP December 27, 2022 Data Mining: Concepts and Techniques 45 Chapter 4: Data Warehousing and On-line Analytical Processing n Data Warehouse: Basic Concepts n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage n Data Warehouse Implementation n Data Generalization by Attribute-Oriented Induction n Additional materials n Summary 46 Additional materials: A producer wants to know Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What product prom-otions have the biggest effect on revenue? Which customers are most likely to go to the competition ? اﻻﻳﺮادات What impact will new products/services have on revenue and margins? How we can know that? 47 Very Large Data Bases n Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes n Petabytes -- 10^15 bytes: n Exabytes -- 10^18 bytes: Geographic Information Systems National Medical Records n Zettabytes -- 10^21 bytes: n Zottabytes -- 10^24 bytes: Weather images Intelligence Agency Videos 48 Data Mining works with Warehouse Data n Data Warehousing provides the Enterprise with a memory Data Mining provides the Enterprise with intelligence 49 What factors make data mining possible? n Advances in the following areas are making data mining deployable: n n n n n Better and more data (i.e., operational, behavioral, and demographic Ability to establish data warehousing The emergence of easily deployed data mining tools The advent/emergence of new data mining techniques. Experts and supports 50 So, what’s different? December 27, 2022 Data Mining: Concepts and Techniques 51 OLTP vs Data Warehouse n OLTP n n n n n n n Application Oriented Used to run business Detailed data Current up to date Isolated Data Repetitive access December 27, 2022 Data Warehouse n n n n n n Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Knowledge User (Manager) Data Mining: Concepts and Techniques 52 OLTP vs Data Warehouse n OLTP n Performance Sensitive n Few Records accessed at a time (tens) n n n Read/Update Access No data redundancy Database Size 100MB -100 GB December 27, 2022 n Data Warehouse n Performance relaxed n Large volumes accessed at a time(millions) n Mostly Read (Batch Update) n Redundancy present n Database Size 100 GB few terabytes Data Mining: Concepts and Techniques اداء ﺧﻔﯾف 53 OLTP vs Data Warehouse n OLTP n n Data Warehouse Transaction n performance metric Thousands of users n throughput* is the n December 27, 2022 Query throughput is the performance metric Hundreds of users Data Mining: Concepts and Techniques 54 To summarize ... n OLTP Systems are used to “run” a business n December 27, 2022 The Data Warehouse helps to “optimize” the business Data Mining: Concepts and Techniques 55 Heart of the Data Warehouse n n n Heart of the data warehouse is the data itself! Single version of the truth Data is organized in a way that represents business -- subject orientation December 27, 2022 Data Mining: Concepts and Techniques 56 From the Data Warehouse to Data Marts Information Less Individually Structured History Normalized Detailed Departmentally Structured Organizationally Structured Data Warehouse More Data December 27, 2022 Data Mining: Concepts and Techniques 57 Summary n Data warehousing: A multi-dimensional model of a data warehouse n n n n A data cube consists of dimensions & measures Star schema, snowflake schema, fact constellations OLAP operations: drilling, rolling, slicing, dicing and pivoting Data Warehouse Architecture, Design, and Usage n Multi-tiered architecture n Business analysis design framework Information processing, analytical processing, data mining, OLAM (Online Analytical Mining) Implementation: Efficient computation of data cubes n Partial vs. full vs. no materialization n Indexing OALP data: Bitmap index and join index n OLAP query processing n OLAP servers: ROLAP, MOLAP, HOLAP n n n Data generalization: Attribute-oriented induction 58 December 27, 2022 Data Mining: Concepts and Techniques 59