Data Mining: Concepts and Techniques (3rd ed.) Unit-I Jiawei Han, Micheline Kamber, and Jian Pei 1 Unit-1 (a) : Data Warehousing 1. Basic Concepts 2. Data Warehouse Modeling 3. Data Warehouse Design and Usage 4. Data Generalization by Attribute-Oriented Induction 2 1. Basic Concepts i. What is a DWH? ii. Differences between OLTP and OLAP iii. Why a Separate Data Ware house? iv. A Multi-tiered Architecture v. DWH Models (Enterprise WH, Data Mart, Virtual WH) vi. Extraction, Transformation and Loading vii. Metadata Repository 3 (i) DWH Definitions Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization’s operational database. Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing: March 2, 2022 The process of constructing and using data warehouses Unit-I: DWH and OLAP Technology 4 DWH Characteristics—Subject-Oriented Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. March 2, 2022 Unit-I: DWH and OLAP Technology 5 DWH Characteristics - Integrated Constructed by integrating multiple, heterogeneous data sources Relational Databases, Flat Files, On-line transaction records. Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources March 2, 2022 E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. Unit-I: DWH and OLAP Technology 6 DWH Characteristics — Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly. But the key of operational data may or may not contain “time element”. March 2, 2022 Unit-I: DWH and OLAP Technology 7 DWH Characteristics—Nonvolatile A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment Does not require transaction processing, recovery, and concurrency control mechanisms Requires March 2, 2022 only two operations in data accessing: initial loading of data and access of data Unit-I: DWH and OLAP Technology 8 Data Warehouse Vs Heterogeneous DBMS Traditional heterogeneous DB integration: A query driven approach Build wrappers/mediators on top of heterogeneous databases When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set. Complex information filtering, compete for resources. Data warehouse: update-driven, High Performance Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis. March 2, 2022 Unit-I: DWH and OLAP Technology 9 ii. Data Warehouse Vs Operational DBMS OLTP (on-line transaction processing) Major task of traditional Relational DBMS Day-to-day operations: purchasing, inventory, manufacturing, payroll, registration, accounting, etc. banking, OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Distinct features (OLTP Vs OLAP): User and system orientation: Customer Vs Market Data contents: Current, Detailed Vs Historical, Consolidated Database design: ER + Application Vs Star + Subject View: Current, Local Vs Evolutionary, Integrated Access patterns: Update Vs Read-only but Complex Queries March 2, 2022 Unit-I: DWH and OLAP Technology 10 March 2, 2022 Unit-I: DWH and OLAP Technology 11 DBMS, OLAP and Data Mining DBMS OLAP Data Mining Task Extraction of detailed and summary data Knowledge Summaries, trends discovery of and forecasts hidden patterns and insights Type of Result Information Analysis Insight and Prediction Method Deduction (Ask the question, verify with data) Multidimensional data modeling, Aggregation, Statistics Induction (Build the model, apply it to new data, get the result) Who purchased mutual funds in the last 3 years? What is the average income of mutual fund buyers by region by year? Who will buy a mutual fund in the next 6 months and why? Example question iii. Why Separate Data Warehouse? High performance for both systems Warehouse—tuned for OLAP: Complex Multidimensional view, Consolidation OLAP queries, Different functions and different data: DBMS— tuned for OLTP: Access methods, Indexing, Concurrency control, Recovery missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Note: There are more and more systems which perform OLAP analysis directly on relational databases March 2, 2022 Unit-I: DWH and OLAP Technology 13 iv. Architecture of a DWH 14 v. Three Data Warehouse Models Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart Virtual warehouse A set of views over operational databases Only some of the possible summary views may be materialized 15 vi. Extraction, Transformation, and Loading (ETL) Data Extraction get data from multiple, heterogeneous, and external sources Data Cleaning detect errors in the data and rectify them when possible Data Transformation convert data from legacy or host format to warehouse format Load sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh propagate the updates from the data sources to the warehouse 16 vii. Metadata Repository Meta data is the data defining warehouse objects. It stores: Description of the structure of the data warehouse schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents Operational meta-data data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) The algorithms used for summarization The mapping from operational environment to the data warehouse Data related to system performance warehouse schema, view and derived data definitions Business data business terms and definitions, ownership of data, charging policies 17 Unit-1 (a) : Data Warehousing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Generalization by Attribute-Oriented Induction Summary 18 2. Data Warehouse Modeling i. Data Cube: A Multidimensional Data Model ii. Schemas for Multidimensional Data Models: Stars, Snowflakes, and Fact Constellations iii. Dimensions: The Role of Concept Hierarchies iv. Measures: Their Categorization and Computation v. Typical OLAP Operations vi. A Starnet Query Model for Querying Multidimensional Databases 19 i. Data Cube: A Multidimensional Data Model A data warehouse is based on a multidimensional data model which views data in the form of a data cube. A data cube, allows data to be modeled and viewed in multiple dimensions (Eg: Sales) Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. 20 March 2, 2022 Data Mining: Concepts and Techniques 21 March 2, 2022 Data Mining: Concepts and Techniques 22 March 2, 2022 Data Mining: Concepts and Techniques 23 Cube: A Lattice of Cuboids all time 0-D (apex) cuboid item time,location time,item location supplier item,location time,supplier 1-D cuboids location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,location time,item,supplier item,location,supplier 4-D (base) cuboid time, item, location, supplier 24 Schemas for Multidimensional Data Models Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 25 March 2, 2022 Data Mining: Concepts and Techniques 26 27 28 iii. Dimensions: The Role of Concept Hierarchies 29 Contd.. 30 iv. Measures: Their Categorization and Computation Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function E.g., count(), sum(), min(), max() E.g., avg(), min_N(), standard_deviation() Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. E.g., median(), mode(), rank() 31 v. Typical OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 32 OLAP Systems Vs Statistical Databases A statistical database is a database system that is designed to support statistical applications. SDBs tend to focus on socioeconomic applications and OLAP has been targeted for business applications. Privacy issues regarding concept hierarchies are a major concern for SDBs. OLAP systems are designed for efficiently handling huge amounts of data. 33 Typical OLAP Operations 34 vi. A Starnet Query Model for Querying MDBs Unit-1 (a) : Data Warehousing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Generalization by Attribute-Oriented Induction Summary 36 3. Data Warehouse Design and Usage i. A Business Analysis Framework for Data Warehouse Design ii. Data Warehouse Design Process iii. Data Warehouse Usage for Information Processing iv. From Online Analytical Processing to Multidimensional Data Mining 37 i. A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view Data source view exposes the information being captured, stored, and managed by operational systems Data warehouse view allows selection of the relevant information necessary for the data warehouse consists of fact tables and dimension tables Business query view sees the perspectives of data in the warehouse from the view of end-user 38 ii. Data Warehouse Design Process Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record 39 iii. Data Warehouse Usage Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools 40 iv. From On-Line Analytical Processing to On Line Analytical Mining Why online analytical mining? High quality of data in data warehouses DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis Mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions Integration and swapping of multiple mining functions, algorithms, and tasks 41 42 Unit-1 (a) : Data Warehousing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Generalization by Attribute-Oriented Induction Summary 43 4. Data Generalization by Attribute-Oriented Induction i. Attribute-Oriented Induction for Data Characterization ii. Efficient Implementation of Attribute-Oriented Induction iii. Attribute-Oriented Induction for Class Comparisons 44 Introduction to Data Generalization The data cube can be viewed as a kind of multidimensional data generalization. Data generalization summarizes data by replacing relatively low-level values with higher-level concepts. Eg: Age -> Young, Middle-aged and Senior Reducing the number of dimensions to summarize data in concept space involving fewer dimensions Eg: Removing birth_date and telephone_no summarizing the behavior of a group of students). when 45 Introduction to Concept Description Allowing data sets to be generalized at multiple levels of abstraction facilitates users in examining the general behavior of the data. Eg: Given the AllElectronics database, instead of examining individual customer transactions, sales managers may refer to view the data generalized to higher levels, such as summarized by customer groups according to geographic regions, frequency of purchases per group, and customer income. 46 Characterization and Discrimination Concept Description, which is a form of data generalization. It generates descriptions for data characterization and comparison. Characterization provides a concise summarization of the given data collection. and succinct Concept or class comparison (discrimination) provides descriptions comparing two or more data collections. 47 i. Attribute-Oriented Induction Proposed in 1989, before data cube approach. Process steps: i. ii. iii. iv. Collect the task-relevant data (initial relation) using a relational database query Perform generalization by attribute removal or attribute generalization Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Interaction with users for knowledge presentation 48 49 An Example Query: Describe general characteristics of graduate students in the Big_University Database. Given the attributes: Name gender major birth_place birth_date residence phone# and gpa (grade_point_average). 50 Steps in AOI Step 1: Fetch relevant set of data using an SQL statement. Step 2: Perform attribute-oriented induction Step 3: Present results in generalized relation, cross-tab, or rule forms 51 DMQL for characterization use Big_University_DB mine characteristics as “Science_Students” in relevance to name, gender, major, birth_place,birth_date, residence, phone#, gpa from student where status in “graduate” RDBMS Query for characterization use Big_University_DB select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {“M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.”} 52 (Task-relevant) Initial Working Relation 53 Step 2: Perform attribute-oriented induction The essential operation of attribute-oriented induction is data generalization, can be performed in two ways on the initial working relation: attribute removal and attribute generalization. Rule for attribute removal: If there is a large set of distinct values for an attribute of the initial working relation, but either case 1: There is no generalization operator on the attribute (e.g., there is no concept hierarchy defined for the attribute) (or) (Name) case 2: its higher-level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation. (Street -> city, state, country) 54 Step 2: Perform attribute-oriented induction Rule for Attribute generalization If there is a large set of distinct values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute. (DOB ->Age) Attribute generalization control: If the number of distinct values in an attribute is greater than the attribute threshold, further attribute removal or attribute generalization should be performed Problems Over-generalization and under-generalization 55 Step 2: Perform attribute-oriented induction generalized relation threshold control If the number of (distinct) tuples in the generalized relation is greater than the threshold, further generalization should be performed. Otherwise, no further generalization should be performed. 56 Class Characterization: An Example Name Gender Jim Initial Woodman Relation Scott M Major M F … Removed Retained Residence Phone # GPA Vancouver,BC, 8-12-76 Canada CS Montreal, Que, 28-7-75 Canada Physics Seattle, WA, USA 25-8-70 … … … 3511 Main St., Richmond 345 1st Ave., Richmond 687-4598 3.67 253-9106 3.70 125 Austin Ave., Burnaby … 420-5232 … 3.83 … Sci,Eng, Bus City Removed Excl, VG,.. Gender Major M F … Birth_date CS Lachance Laura Lee … Prime Generalized Relation Birth-Place Science Science … Country Age range Birth_region Age_range Residence GPA Canada Foreign … 20-25 25-30 … Richmond Burnaby … Very-good Excellent … Count 16 22 … Birth_Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 57 Basic Principles of Attribute-Oriented Induction Data focusing: task-relevant data, including dimensions, and the result is the initial relation Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A Attribute-threshold control: typical 2-8, specified/default Generalized relation relation/rule size threshold control: control the final 58 Step3: Presentation of Generalized Results Generalized relation: Cross tabulation: Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Mapping results into cross tabulation form (similar to contingency tables). Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms. Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., grad( x) male( x) birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%]. 59 60 61 62 Attribute-Oriented Induction: Basic Algorithm InitialRel: Query processing of task-relevant data, deriving the initial relation. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. 63 Presentation of Generalized Results Generalized relation: Cross tabulation: Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Mapping results into cross tabulation form (similar to contingency tables). Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms. Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., grad( x) male( x) birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%]. 64 Mining Class Comparisons Comparison: Comparing two or more classes Method: Partition the set of relevant data into the target class and the contrasting class(es) Generalize both classes to the same high level concepts Compare tuples with the same high level descriptions Present for every tuple its description and two measures support - distribution within single class comparison - distribution between classes Highlight the tuples with strong discriminant features Relevance Analysis: Find attributes (features) which best distinguish different classes 65 66 67 68 Unit-1 (a) : Data Warehousing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Generalization by Attribute-Oriented Induction Summary 69 Summary Data warehousing: A multi-dimensional model of a data warehouse A data cube consists of dimensions & measures Star schema, snowflake schema, fact constellations OLAP operations: drilling, rolling, slicing, dicing and pivoting Data Warehouse Architecture, Design, and Usage Multi-tiered architecture Business analysis design framework Information processing, analytical processing, data mining, OLAM (Online Analytical Mining) Implementation: Efficient computation of data cubes Partial vs. full vs. no materialization Indexing OALP data: Bitmap index and join index OLAP query processing OLAP servers: ROLAP, MOLAP, HOLAP Data generalization: Attribute-oriented induction 70 Unit-1 (a) : Data Warehousing Data Warehouse: Basic Concepts (a) What Is a Data Warehouse? (b) Data Warehouse: A Multi-Tiered Architecture (c) Three Data Warehouse Models: Enterprise Warehouse, Data Mart, ad Virtual Warehouse (d) Extraction, Transformation and Loading (e) Metadata Repository Data Warehouse Modeling: Data Cube and OLAP (a) Cube: A Lattice of Cuboids (b) Conceptual Modeling of Data Warehouses (c) Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases (d) Dimensions: The Role of Concept Hierarchy (e) Measures: Their Categorization and Computation (f) Cube Definitions in Database systems (g) Typical OLAP Operations (h) A Starnet Query Model for Querying Multidimensional Databases Data Warehouse Design and Usage (a) Data Warehouses Design Processes (b) Data Warehouse Usage (c) From OLAP to OLAM (d) Design of Data Warehouses: A Business Analysis Framework (e) Analytical Processing to On-Line Analytical Mining Data Generalization by Attribute-Oriented Induction (a) Attribute-Oriented Induction for Data Characterization (b) Efficient Implementation of Attribute-Oriented Induction (c) Attribute-Oriented Induction for Class Comparisons (d) Attribute-Oriented Induction vs. Cube-Based OLAP 71