Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 1 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 2 Why Data Mining? ◼ The Explosive Growth of Data: from terabytes to petabytes ◼ Data collection and data availability ◼ Automated data collection tools, database systems, Web, computerized society ◼ Major sources of abundant data ◼ Business: Web, e-commerce, transactions, stocks, … ◼ Science: Remote sensing, bioinformatics, scientific simulation, … ◼ Society and everyone: news, digital cameras, YouTube ◼ We are drowning in data, but starving for knowledge! ◼ “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 3 Evolution of Sciences ◼ Before 1600, empirical science ◼ 1600-1950s, theoretical science ◼ ◼ 1950s-1990s, computational science ◼ ◼ ◼ Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now, data science ◼ The flood of data from new scientific instruments and simulations ◼ The ability to economically store and manage petabytes of data online ◼ The Internet and computing Grid that makes all these archives universally accessible ◼ ◼ Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge! Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002 4 Evolution of Database Technology ◼ 1960s: ◼ ◼ 1970s: ◼ ◼ ◼ Relational data model, relational DBMS implementation 1980s: ◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.) ◼ Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: ◼ ◼ Data collection, database creation, IMS and network DBMS Data mining, data warehousing, multimedia databases, and Web databases 2000s ◼ Stream data management and mining ◼ Data mining and its applications ◼ Web technology (XML, data integration) and global information systems 5 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 6 What Is Data Mining? ◼ Data mining (knowledge discovery from data) ◼ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data ◼ ◼ Alternative names ◼ ◼ Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”? ◼ Simple search and query processing ◼ (Deductive) expert systems 7 Knowledge Discovery (KDD) Process ◼ ◼ This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery Data Mining process Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 8 Example: A Web Mining Framework ◼ Web mining usually involves ◼ Data cleaning ◼ Data integration from multiple sources ◼ Warehousing the data ◼ Data cube construction ◼ Data selection for data mining ◼ Data mining ◼ Presentation of the mining results ◼ Patterns and knowledge to be used or stored into knowledge-base 9 Data Mining in Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA 10 Example: Mining vs. Data Exploration ◼ ◼ ◼ ◼ ◼ Business intelligence view ◼ Warehouse, data cube, reporting but not much mining Business objects vs. data mining tools Supply chain example: tools Data presentation Exploration 11 KDD Process: A Typical View from ML and Statistics Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction ◼ Data Mining Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… PostProcessing Pattern Pattern Pattern Pattern evaluation selection interpretation visualization This is a view from typical machine learning and statistics communities 12 Example: Medical Data Mining ◼ ◼ Health care & medical data mining – often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) ◼ Classification or/and clustering processes ◼ Post-processing for presentation 13 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 14 Multi-Dimensional View of Data Mining ◼ ◼ ◼ ◼ Data to be mined ◼ Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions) ◼ Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. ◼ Descriptive vs. predictive data mining ◼ Multiple/integrated functions and mining at multiple levels Techniques utilized ◼ Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted ◼ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. 15 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 16 Data Mining: On What Kinds of Data? ◼ Database-oriented data sets and applications ◼ ◼ Relational database, data warehouse, transactional database Advanced data sets and advanced applications ◼ Data streams and sensor data ◼ Time-series data, temporal data, sequence data (incl. bio-sequences) ◼ Structure data, graphs, social networks and multi-linked data ◼ Object-relational databases ◼ Heterogeneous databases and legacy databases ◼ Spatial data and spatiotemporal data ◼ Multimedia database ◼ Text databases ◼ The World-Wide Web 17 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 18 Data Mining Function: (1) Generalization ◼ Information integration and data warehouse construction ◼ ◼ Data cube technology ◼ ◼ ◼ Data cleaning, transformation, integration, and multidimensional data model Scalable methods for computing (i.e., materializing) multidimensional aggregates OLAP (online analytical processing) Multidimensional concept description: Characterization and discrimination ◼ Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region 19 Data Mining Function: (2) Association and Correlation Analysis ◼ Frequent patterns (or frequent itemsets) ◼ ◼ What items are frequently purchased together in your Walmart? Association, correlation vs. causality ◼ A typical association rule ◼ ◼ ◼ ◼ Diaper → Beer [0.5%, 75%] (support, confidence) Are strongly associated items also strongly correlated? How to mine such patterns and rules efficiently in large datasets? How to use such patterns for classification, clustering, 20 Data Mining Function: (3) Classification ◼ Classification and label prediction ◼ Construct models (functions) based on some training examples ◼ Describe and distinguish classes or concepts for future prediction ◼ ◼ ◼ Predict some unknown class labels Typical methods ◼ ◼ E.g., classify countries based on (climate), or classify cars based on (gas mileage) Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, patternbased classification, logistic regression, … Typical applications: ◼ Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, … 21 Data Mining Function: (4) Cluster Analysis ◼ ◼ ◼ ◼ Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity Many methods and applications 22 Data Mining Function: (5) Outlier Analysis ◼ Outlier analysis ◼ ◼ Outlier: A data object that does not comply with the general behavior of the data Noise or exception? ― One person’s garbage could be another person’s treasure ◼ Methods: by product of clustering or regression analysis, … ◼ Useful in fraud detection, rare events analysis 23 Time and Ordering: Sequential Pattern, Trend and Evolution Analysis ◼ Sequence, trend and evolution analysis ◼ ◼ Trend, time-series, and deviation analysis: e.g., regression and value prediction Sequential pattern mining ◼ ◼ ◼ Periodicity analysis Motifs and biological sequence analysis ◼ ◼ ◼ e.g., first buy digital camera, then buy large SD memory cards Approximate and consecutive motifs Similarity-based analysis Mining data streams 24 Structure and Network Analysis ◼ ◼ ◼ Graph mining ◼ Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) Information network analysis ◼ Social networks: actors (objects, nodes) and relationships (edges) ◼ e.g., author networks in CS, terrorist networks ◼ Multiple heterogeneous networks ◼ A person could be multiple information networks: friends, family, classmates, … ◼ Links carry a lot of semantic information: Link mining Web mining ◼ Web is a big information network: from PageRank to Google ◼ Analysis of Web information networks ◼ Web community discovery, opinion mining, usage mining, … 25 Evaluation of Knowledge ◼ ◼ Are all mined knowledge interesting? ◼ One can mine tremendous amount of “patterns” and knowledge ◼ Some may fit only certain dimension space (time, location, …) ◼ Some may not be representative, may be transient, … Evaluation of mined knowledge → directly mine only interesting knowledge? ◼ Descriptive vs. predictive ◼ Coverage ◼ Typicality vs. novelty ◼ Accuracy ◼ Timeliness ◼ … 26 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 27 Data Mining: Confluence of Multiple Disciplines Machine Learning Applications Algorithm Pattern Recognition Data Mining Database Technology Statistics Visualization High-Performance Computing 28 Why Confluence of Multiple Disciplines? ◼ Tremendous amount of data ◼ ◼ High-dimensionality of data ◼ ◼ Micro-array may have tens of thousands of dimensions High complexity of data ◼ ◼ ◼ ◼ ◼ ◼ ◼ Algorithms must be highly scalable to handle such as tera-bytes of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications 29 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 30 Applications of Data Mining ◼ Web page analysis: from web page classification, clustering to PageRank & HITS algorithms ◼ Collaborative analysis & recommender systems ◼ Basket data analysis to targeted marketing ◼ ◼ ◼ Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQLServer Analysis Manager, Oracle Data Mining Tools) to invisible data mining 31 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 32 Major Issues in Data Mining (1) ◼ ◼ Mining Methodology ◼ Mining various and new kinds of knowledge ◼ Mining knowledge in multi-dimensional space ◼ Data mining: An interdisciplinary effort ◼ Boosting the power of discovery in a networked environment ◼ Handling noise, uncertainty, and incompleteness of data ◼ Pattern evaluation and pattern- or constraint-guided mining User Interaction ◼ Interactive mining ◼ Incorporation of background knowledge ◼ Presentation and visualization of data mining results 33 Major Issues in Data Mining (2) ◼ ◼ ◼ Efficiency and Scalability ◼ Efficiency and scalability of data mining algorithms ◼ Parallel, distributed, stream, and incremental mining methods Diversity of data types ◼ Handling complex types of data ◼ Mining dynamic, networked, and global data repositories Data mining and society ◼ Social impacts of data mining ◼ Privacy-preserving data mining ◼ Invisible data mining 34 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 35 A Brief History of Data Mining Society ◼ 1989 IJCAI Workshop on Knowledge Discovery in Databases ◼ ◼ 1991-1994 Workshops on Knowledge Discovery in Databases ◼ ◼ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) ◼ Journal of Data Mining and Knowledge Discovery (1997) ◼ ACM SIGKDD conferences since 1998 and SIGKDD Explorations ◼ More conferences on data mining ◼ ◼ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. ACM Transactions on KDD starting in 2007 36 Conferences and Journals on Data Mining ◼ KDD Conferences ◼ ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) ◼ SIAM Data Mining Conf. (SDM) ◼ (IEEE) Int. Conf. on Data Mining (ICDM) ◼ European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD) ◼ Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) ◼ Int. Conf. on Web Search and Data Mining (WSDM) ◼ Other related conferences ◼ ◼ ◼ DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, … Web and IR conferences: WWW, SIGIR, WSDM ◼ ML conferences: ICML, NIPS ◼ PR conferences: CVPR, Journals ◼ ◼ Data Mining and Knowledge Discovery (DAMI or DMKD) IEEE Trans. On Knowledge and Data Eng. (TKDE) ◼ KDD Explorations ◼ ACM Trans. on KDD 37 Where to Find References? DBLP, CiteSeer, Google ◼ Data mining and KDD (SIGKDD: CDROM) ◼ ◼ ◼ Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) ◼ ◼ ◼ ◼ ◼ Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Statistics ◼ ◼ ◼ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Web and IR ◼ ◼ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AI & Machine Learning ◼ ◼ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization ◼ ◼ Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. 38 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary 39 Summary ◼ ◼ ◼ ◼ ◼ Data mining: Discovering interesting patterns and knowledge from massive amount of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of data Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. ◼ Data mining technologies and applications ◼ Major issues in data mining 40 Recommended Reference Books ◼ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 ◼ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 ◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 ◼ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 ◼ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 ◼ J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011 ◼ D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 ◼ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009 ◼ B. Liu, Web Data Mining, Springer 2006. ◼ T. M. Mitchell, Machine Learning, McGraw Hill, 1997 ◼ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 ◼ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 ◼ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 ◼ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005 41 Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University ©2011 Han, Kamber, and Pei. All rights reserved. 42 Chapter 2: Getting to Know Your Data ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Data Visualization ◼ Measuring Data Similarity and Dissimilarity ◼ Summary 43 44 Types of Data Sets ◼ pla y ball score game wi n lost timeout season ◼ coach ◼ team ◼ Record ◼ Relational records ◼ Data matrix, e.g., numerical matrix, crosstabs ◼ Document data: text documents: termfrequency vector ◼ Transaction data Graph and network ◼ World Wide Web ◼ Social or information networks ◼ Molecular Structures Ordered ◼ Video data: sequence of images ◼ Temporal data: time-series ◼ Sequential Data: transaction sequences ◼ Genetic sequence data Spatial, image and multimedia: ◼ Spatial data: maps ◼ Image data: ◼ Video data: Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0 TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Important Characteristics of Structured Data ◼ Dimensionality ◼ ◼ Sparsity ◼ ◼ Only presence counts Resolution ◼ ◼ Curse of dimensionality Patterns depend on the scale Distribution ◼ Centrality and dispersion 45 Data Objects ◼ Data sets are made up of data objects. ◼ A data object represents an entity. ◼ Examples: ◼ ◼ sales database: customers, store items, sales ◼ medical database: patients, treatments ◼ university database: students, professors, courses Also called samples , examples, instances, data points, objects, tuples. ◼ Data objects are described by attributes. ◼ Database rows -> data objects; columns ->attributes. 46 Attributes ◼ Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. ◼ ◼ E.g., customer _ID, name, address Types: ◼ Nominal ◼ Binary ◼ Numeric: quantitative ◼ Interval-scaled ◼ Ratio-scaled 47 Attribute Types ◼ ◼ ◼ Nominal: categories, states, or “names of things” ◼ Hair_color = {auburn, black, blond, brown, grey, red, white} ◼ marital status, occupation, ID numbers, zip codes Binary ◼ Nominal attribute with only 2 states (0 and 1) ◼ Symmetric binary: both outcomes equally important ◼ e.g., gender ◼ Asymmetric binary: outcomes not equally important. ◼ e.g., medical test (positive vs. negative) ◼ Convention: assign 1 to most important outcome (e.g., HIV positive) Ordinal ◼ Values have a meaningful order (ranking) but magnitude between successive values is not known. ◼ Size = {small, medium, large}, grades, army rankings 48 Numeric Attribute Types ◼ ◼ ◼ Quantity (integer or real-valued) Interval ◼ Measured on a scale of equal-sized units ◼ Values have order ◼ E.g., temperature in C˚or F˚, calendar dates ◼ No true zero-point Ratio ◼ Inherent zero-point ◼ We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). ◼ e.g., temperature in Kelvin, length, counts, monetary quantities 49 Discrete vs. Continuous Attributes ◼ ◼ Discrete Attribute ◼ Has only a finite or countably infinite set of values ◼ E.g., zip codes, profession, or the set of words in a collection of documents ◼ Sometimes, represented as integer variables ◼ Note: Binary attributes are a special case of discrete attributes Continuous Attribute ◼ Has real numbers as attribute values ◼ E.g., temperature, height, or weight ◼ Practically, real values can only be measured and represented using a finite number of digits ◼ Continuous attributes are typically represented as floating-point variables 50 Chapter 2: Getting to Know Your Data ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Data Visualization ◼ Measuring Data Similarity and Dissimilarity ◼ Summary 51 Basic Statistical Descriptions of Data ◼ ◼ ◼ ◼ Motivation ◼ To better understand the data: central tendency, variation and spread Data dispersion characteristics ◼ median, max, min, quantiles, outliers, variance, etc. Numerical dimensions correspond to sorted intervals ◼ Data dispersion: analyzed with multiple granularities of precision ◼ Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures ◼ Folding measures into numerical dimensions ◼ Boxplot or quantile analysis on the transformed cube 52 53 Measuring the Central Tendency ◼ Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. ◼ ◼ ◼ 1 n x = xi n i =1 n Weighted arithmetic mean: Trimmed mean: chopping extreme values x= Median: ◼ Middle value if odd number of values, or average of w x i =1 n i i w i =1 i the middle two values otherwise ◼ ◼ Estimated by interpolation (for grouped data): Mode median = L1 + ( n / 2 − ( freq)l freqmedian ◼ Value that occurs most frequently in the data ◼ Unimodal, bimodal, trimodal ◼ Empirical formula: ) width mean − mode = 3 (mean − median) x = N 54 Symmetric vs. Skewed Data ◼ Median, mean and mode of symmetric, positively and negatively skewed data positively skewed June 7, 2020 symmetric negatively skewed Data Mining: Concepts and Techniques Measuring the Dispersion of Data 55 ◼ Quartiles, outliers and boxplots ◼ Quartiles: Q1 (25th percentile), Q3 (75th percentile) ◼ Inter-quartile range: IQR = Q3 – Q1 ◼ Five number summary: min, Q1, median, Q3, max ◼ Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually ◼ ◼ Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) ◼ Variance: (algebraic, scalable computation) 1 n 1 n 2 1 n 2 2 s = ( xi − x ) = [ xi − ( xi ) ] n − 1 i =1 n − 1 i =1 n i =1 2 ◼ 1 = N 2 n 1 ( x − ) = i N i =1 2 n xi − 2 i =1 Standard deviation s (or σ) is the square root of variance s2 (or σ2) 2 56 Boxplot Analysis ◼ Five-number summary of a distribution ◼ ◼ Minimum, Q1, Median, Q3, Maximum Boxplot ◼ ◼ ◼ ◼ ◼ Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR The median is marked by a line within the box Whiskers: two lines outside the box extended to Minimum and Maximum Outliers: points beyond a specified outlier threshold, plotted individually Visualization of Data Dispersion: 3-D Boxplots June 7, 2020 Data Mining: Concepts and Techniques 57 58 Properties of Normal Distribution Curve ◼ The normal (distribution) curve ◼ From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) ◼ From μ–2σ to μ+2σ: contains about 95% of it ◼ From μ–3σ to μ+3σ: contains about 99.7% of it Graphic Displays of Basic Statistical Descriptions ◼ Boxplot: graphic display of five-number summary ◼ Histogram: x-axis are values, y-axis repres. frequencies ◼ Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi ◼ Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another ◼ Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 59 Histogram Analysis ◼ ◼ ◼ Histogram: Graph display of tabulated frequencies, shown as bars 40 It shows what proportion of cases fall into each of several categories 30 35 25 Differs from a bar chart in that it is 20 the area of the bar that denotes the 15 value, not the height as in bar charts, a crucial distinction when the 10 categories are not of uniform width 5 ◼ The categories are usually specified 0 as non-overlapping intervals of some variable. The categories (bars) must be adjacent 10000 30000 50000 70000 90000 60 Histograms Often Tell More than Boxplots ◼ The two histograms shown in the left may have the same boxplot representation ◼ ◼ 61 The same values for: min, Q1, median, Q3, max But they have rather different data distributions Quantile Plot ◼ ◼ Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information ◼ For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi Data Mining: Concepts and Techniques 62 Quantile-Quantile (Q-Q) Plot ◼ ◼ ◼ Graphs the quantiles of one univariate distribution against the corresponding quantiles of another View: Is there is a shift in going from one distribution to another? Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2. 63 Scatter plot ◼ ◼ Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane 64 Positively and Negatively Correlated Data ◼ The left half fragment is positively correlated ◼ The right half is negative correlated 65 Uncorrelated Data 66 Chapter 2: Getting to Know Your Data ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Data Visualization ◼ Measuring Data Similarity and Dissimilarity ◼ Summary 67 Data Visualization ◼ Why data visualization? ◼ ◼ ◼ ◼ ◼ ◼ Gain insight into an information space by mapping data onto graphical primitives Provide qualitative overview of large data sets Search for patterns, trends, structure, irregularities, relationships among data Help find interesting regions and suitable parameters for further quantitative analysis Provide a visual proof of computer representations derived Categorization of visualization methods: ◼ Pixel-oriented visualization techniques ◼ Geometric projection visualization techniques ◼ Icon-based visualization techniques ◼ Hierarchical visualization techniques ◼ Visualizing complex data and relations 68 Pixel-Oriented Visualization Techniques ◼ ◼ ◼ For a data set of m dimensions, create m windows on the screen, one for each dimension The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows The colors of the pixels reflect the corresponding values (a) Income (b) Credit Limit (c) transaction volume (d) age 69 Laying Out Pixels in Circle Segments ◼ To save space and show the connections among multiple dimensions, space filling is often done in a circle segment (a) Representing a data record in circle segment (b) Laying out pixels in circle segment 70 Geometric Projection Visualization Techniques ◼ ◼ Visualization of geometric transformations and projections of the data Methods ◼ Direct visualization ◼ Scatterplot and scatterplot matrices ◼ Landscapes ◼ Projection pursuit technique: Help users find meaningful projections of multidimensional data ◼ Prosection views ◼ Hyperslice ◼ Parallel coordinates 71 Direct Data Visualization Ribbons with Twists Based on Vorticity 72 Data Mining: Concepts and Techniques Used by ermission of M. Ward, Worcester Polytechnic Institute Scatterplot Matrices Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots] 73 Used by permission of B. Wright, Visible Decisions Inc. Landscapes ◼ ◼ news articles visualized as a landscape Visualization of the data as perspective landscape The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data 74 Parallel Coordinates ◼ ◼ ◼ n equidistant axes which are parallel to one of the screen axes and correspond to the attributes The axes are scaled to the [minimum, maximum]: range of the corresponding attribute Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute • • • Attr. 1 Attr. 2 Attr. 3 Attr. k 75 Parallel Coordinates of a Data Set 76 Icon-Based Visualization Techniques ◼ Visualization of the data values as features of icons ◼ Typical visualization methods ◼ ◼ Chernoff Faces ◼ Stick Figures General techniques ◼ ◼ ◼ Shape coding: Use shape to represent certain information encoding Color icons: Use color icons to encode more information Tile bars: Use small icons to represent the relevant feature vectors in document retrieval 77 Chernoff Faces ◼ ◼ ◼ ◼ 78 A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc. The figure shows faces produced using 10 characteristics--head eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, generated using Mathematica (S. Dickson) REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993 Weisstein, Eric W. "Chernoff Face." From MathWorld--A Wolfram Web Resource. mathworld.wolfram.com/ChernoffFace.html Stick Figure A census data figure showing age, income, gender, education, etc. A 5-piece stick figure (1 body and 4 limbs w. different angle/length) Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern 79 Hierarchical Visualization Techniques ◼ ◼ Visualization of the data using a hierarchical partitioning into subspaces Methods ◼ Dimensional Stacking ◼ Worlds-within-Worlds ◼ Tree-Map ◼ Cone Trees ◼ InfoCube 80 Dimensional Stacking attribute 4 attribute 2 attribute 3 attribute 1 ◼ ◼ ◼ ◼ ◼ 81 Partitioning of the n-dimensional attribute space in 2-D subspaces, which are ‘stacked’ into each other Partitioning of the attribute value ranges into classes. The important attributes should be used on the outer levels. Adequate for data with ordinal attributes of low cardinality But, difficult to display more than nine dimensions Important to map dimensions appropriately Dimensional Stacking Used by permission of M. Ward, Worcester Polytechnic Institute Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes 82 Worlds-within-Worlds Assign the function and two most important parameters to innermost world ◼ Fix all other parameters at constant values - draw other (1 or 2 or 3 dimensional worlds choosing these as the axes) ◼ Software that uses this paradigm ◼ ◼ ◼ 83 N–vision: Dynamic interaction through data glove and stereo displays, including rotation, scaling (inner) and translation (inner/outer) Auto Visual: Static interaction by means of queries Tree-Map ◼ ◼ Screen-filling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes) MSR Netscan Image Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg 84 Tree-Map of a File System (Schneiderman) 85 InfoCube ◼ ◼ A 3-D visualization technique where hierarchical information is displayed as nested semi-transparent cubes The outermost cubes correspond to the top level data, while the subnodes or the lower level data are represented as smaller cubes inside the outermost cubes, and so on 86 Three-D Cone Trees ◼ 3D cone tree visualization technique works well for up to a thousand nodes or so ◼ ◼ ◼ ◼ First build a 2D circle tree that arranges its nodes in concentric circles centered on the root node Cannot avoid overlaps when projected to 2D G. Robertson, J. Mackinlay, S. Card. “Cone Trees: Animated 3D Visualizations of Hierarchical Information”, ACM SIGCHI'91 Graph from Nadeau Software Consulting website: Visualize a social network data set that models the way an infection spreads from one person to the next Ack.: http://nadeausoftware.com/articles/visualization 87 Visualizing Complex Data and Relations ◼ ◼ Visualizing non-numerical data: text and social networks Tag cloud: visualizing user-generated tags The importance of tag is represented by font size/color Besides text data, there are also methods to visualize relationships, such as visualizing social networks ◼ ◼ Newsmap: Google News Stories in 2005 Chapter 2: Getting to Know Your Data ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Data Visualization ◼ Measuring Data Similarity and Dissimilarity ◼ Summary 89 Similarity and Dissimilarity ◼ ◼ ◼ Similarity ◼ Numerical measure of how alike two data objects are ◼ Value is higher when objects are more alike ◼ Often falls in the range [0,1] Dissimilarity (e.g., distance) ◼ Numerical measure of how different two data objects are ◼ Lower when objects are more alike ◼ Minimum dissimilarity is often 0 ◼ Upper limit varies Proximity refers to a similarity or dissimilarity 90 Data Matrix and Dissimilarity Matrix ◼ ◼ Data matrix ◼ n data points with p dimensions ◼ Two modes Dissimilarity matrix ◼ n data points, but registers only the distance ◼ A triangular matrix ◼ Single mode x11 ... x i1 ... x n1 ... x1f ... ... ... ... ... x if ... ... ... ... ... x nf ... 0 d(2,1) 0 d(3,1) d ( 3,2) 0 : : : d ( n,1) d ( n,2) ... x1p ... x ip ... x np ... 0 91 Proximity Measure for Nominal Attributes ◼ ◼ Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute) Method 1: Simple matching ◼ ◼ m: # of matches, p: total # of variables m d (i, j) = p − p Method 2: Use a large number of binary attributes ◼ creating a new binary attribute for each of the M nominal states 92 93 Proximity Measure for Binary Attributes Object j ◼ A contingency table for binary data Object i ◼ Distance measure for symmetric binary variables: ◼ Distance measure for asymmetric binary variables: ◼ Jaccard coefficient (similarity measure for asymmetric binary variables): ◼ Note: Jaccard coefficient is the same as “coherence”: Dissimilarity between Binary Variables ◼ Example Name Jack Mary Jim ◼ ◼ ◼ Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N Gender is a symmetric attribute The remaining attributes are asymmetric binary Let the values Y and P be 1, and the value N 0 0+1 = 0.33 2+ 0+1 1+1 d ( jack , jim ) = = 0.67 1+1+1 1+ 2 d ( jim , mary ) = = 0.75 1+1+ 2 d ( jack , mary ) = 94 95 ◼ Standardizing Numeric Data x − Z-score: z = ◼ ◼ ◼ ◼ X: raw score to be standardized, μ: mean of the population, σ: standard deviation the distance between the raw score and the population mean in units of the standard deviation negative when the raw score is below the mean, “+” when above An alternative way: Calculate the mean absolute deviation sf = 1 n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |) where m = 1 (x + x + ... + x ) f nf n 1f 2 f . ◼ ◼ standardized measure (z-score): xif − m f zif = sf Using mean absolute deviation is more robust than using standard deviation Example: Data Matrix and Dissimilarity Matrix Data Matrix point x1 x2 x3 x4 attribute1 attribute2 1 2 3 5 2 0 4 5 Dissimilarity Matrix (with Euclidean Distance) x1 x1 x2 x3 x4 x2 0 3.61 5.1 4.24 x3 0 5.1 1 x4 0 5.39 0 96 Distance on Numeric Data: Minkowski Distance ◼ Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm) ◼ ◼ Properties ◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) ◼ d(i, j) = d(j, i) (Symmetry) ◼ d(i, j) d(i, k) + d(k, j) (Triangle Inequality) A distance that satisfies these properties is a metric 97 98 Special Cases of Minkowski Distance ◼ h = 1: Manhattan (city block, L1 norm) distance ◼ E.g., the Hamming distance: the number of bits that are different between two binary vectors d (i, j) =| x − x | + | x − x | +...+ | x − x | i1 j1 i2 j 2 ip jp ◼ h = 2: (L2 norm) Euclidean distance d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 ) i1 j1 i2 j 2 ip jp ◼ h → . “supremum” (Lmax norm, L norm) distance. ◼ This is the maximum difference between any component (attribute) of the vectors Example: Minkowski Distance Dissimilarity Matrices point x1 x2 x3 x4 attribute 1 attribute 2 1 2 3 5 2 0 4 5 Manhattan (L1) L x1 x2 x3 x4 x1 0 5 3 6 x2 x3 x4 0 6 1 0 7 0 x2 x3 x4 Euclidean (L2) L2 x1 x2 x3 x4 x1 0 3.61 2.24 4.24 0 5.1 1 0 5.39 0 Supremum L x1 x2 x3 x4 x1 x2 0 3 2 3 x3 0 5 1 x4 0 5 0 99 Ordinal Variables ◼ An ordinal variable can be discrete or continuous ◼ Order is important, e.g., rank ◼ Can be treated like interval-scaled rif {1,..., M f } ◼ replace xif by their rank ◼ map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by zif ◼ rif −1 = M f −1 compute the dissimilarity using methods for intervalscaled variables 100 101 Attributes of Mixed Type ◼ ◼ A database may contain all attribute types ◼ Nominal, symmetric binary, asymmetric binary, numeric, ordinal One may use a weighted formula to combine their effects pf = 1 ij( f ) dij( f ) d (i, j) = pf = 1 ij( f ) ◼ ◼ ◼ f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise f is numeric: use the normalized distance f is ordinal ◼ Compute ranks rif and zif = r − 1 M −1 ◼ Treat zif as interval-scaled if f Cosine Similarity ◼ ◼ ◼ ◼ A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Other vector objects: gene features in micro-arrays, … Applications: information retrieval, biologic taxonomy, gene feature mapping, ... Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| , where • indicates vector dot product, ||d||: the length of vector d 102 Example: Cosine Similarity ◼ ◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| , where • indicates vector dot product, ||d|: the length of vector d Ex: Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 cos(d1, d2 ) = 0.94 103 Chapter 2: Getting to Know Your Data ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Data Visualization ◼ Measuring Data Similarity and Dissimilarity ◼ Summary 104 Summary ◼ Data attribute types: nominal, binary, ordinal, interval-scaled, ratioscaled ◼ Many types of data sets, e.g., numerical, text, graph, Web, image. ◼ Gain insight into the data by: ◼ Basic statistical data description: central tendency, dispersion, graphical displays ◼ Data visualization: map data onto graphical primitives ◼ Measure data similarity ◼ Above steps are the beginning of data preprocessing. ◼ Many methods have been developed but still an active area of research. 105 References ◼ W. Cleveland, Visualizing Data, Hobart Press, 1993 ◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003 ◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 ◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. ◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech. Committee on Data Eng., 20(4), Dec. 1997 ◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization and Computer Graphics, 8(1), 2002 ◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999 ◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(9), 1999 ◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001 ◼ C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies, Information Visualization, 8(1), 2009 106 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 107 Chapter 3: Data Preprocessing ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary 108 Data Quality: Why Preprocess the Data? ◼ Measures for data quality: A multidimensional view ◼ Accuracy: correct or wrong, accurate or not ◼ Completeness: not recorded, unavailable, … ◼ Consistency: some modified but some not, dangling, … ◼ Timeliness: timely update? ◼ Believability: how trustable the data are correct? ◼ Interpretability: how easily the data can be understood? 109 Major Tasks in Data Preprocessing ◼ Data cleaning ◼ ◼ Data integration ◼ ◼ ◼ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Integration of multiple databases, data cubes, or files Data reduction ◼ Dimensionality reduction ◼ Numerosity reduction ◼ Data compression Data transformation and data discretization ◼ Normalization ◼ Concept hierarchy generation 110 Chapter 3: Data Preprocessing ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary 111 Data Cleaning ◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error ◼ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ◼ ◼ noisy: containing noise, errors, or outliers ◼ ◼ ◼ e.g., Occupation=“ ” (missing data) e.g., Salary=“−10” (an error) inconsistent: containing discrepancies in codes or names, e.g., ◼ Age=“42”, Birthday=“03/07/2010” ◼ Was rating “1, 2, 3”, now rating “A, B, C” ◼ discrepancy between duplicate records Intentional (e.g., disguised missing data) ◼ Jan. 1 as everyone’s birthday? 112 Incomplete (Missing) Data ◼ Data is not always available ◼ ◼ Missing data may be due to ◼ equipment malfunction ◼ inconsistent with other recorded data and thus deleted ◼ data not entered due to misunderstanding ◼ ◼ ◼ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred 113 How to Handle Missing Data? ◼ Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably ◼ Fill in the missing value manually: tedious + infeasible? ◼ Fill in it automatically with ◼ a global constant : e.g., “unknown”, a new class?! ◼ the attribute mean ◼ ◼ the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree 114 Noisy Data ◼ ◼ ◼ Noise: random error or variance in a measured variable Incorrect attribute values may be due to ◼ faulty data collection instruments ◼ data entry problems ◼ data transmission problems ◼ technology limitation ◼ inconsistency in naming convention Other data problems which require data cleaning ◼ duplicate records ◼ incomplete data ◼ inconsistent data 115 How to Handle Noisy Data? ◼ ◼ ◼ ◼ Binning ◼ first sort data and partition into (equal-frequency) bins ◼ then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression ◼ smooth by fitting the data into regression functions Clustering ◼ detect and remove outliers Combined computer and human inspection ◼ detect suspicious values and check by human (e.g., deal with possible outliers) 116 Data Cleaning as a Process ◼ ◼ ◼ Data discrepancy detection ◼ Use metadata (e.g., domain, range, dependency, distribution) ◼ Check field overloading ◼ Check uniqueness rule, consecutive rule and null rule ◼ Use commercial tools ◼ Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections ◼ Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) Data migration and integration ◼ Data migration tools: allow transformations to be specified ◼ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface Integration of the two processes ◼ Iterative and interactive (e.g., Potter’s Wheels) 117 Chapter 3: Data Preprocessing ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary 118 Data Integration ◼ Data integration: ◼ ◼ Schema integration: e.g., A.cust-id B.cust-# ◼ ◼ Combines data from multiple sources into a coherent store Integrate metadata from different sources Entity identification problem: ◼ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton ◼ Detecting and resolving data value conflicts ◼ For the same real world entity, attribute values from different sources are different ◼ Possible reasons: different representations, different scales, e.g., metric vs. British units 119 Handling Redundancy in Data Integration ◼ Redundant data occur often when integration of multiple databases ◼ Object identification: The same attribute or object may have different names in different databases ◼ Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue ◼ ◼ Redundant attributes may be able to be detected by correlation analysis and covariance analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality 120 Correlation Analysis (Nominal Data) ◼ Χ2 (chi-square) test 2 ( Observed − Expected ) 2 = Expected ◼ ◼ ◼ The larger the Χ2 value, the more likely the variables are related The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count Correlation does not imply causality ◼ # of hospitals and # of car-theft in a city are correlated ◼ Both are causally linked to the third variable: population 121 Chi-Square Calculation: An Example ◼ Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) (250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2 = + + + = 507.93 90 210 360 840 2 ◼ It shows that like_science_fiction and play_chess are correlated in the group 122 Correlation Analysis (Numeric Data) ◼ Correlation coefficient (also called Pearson’s product moment coefficient) i=1 (ai − A)(bi − B) n rA, B = (n − 1) A B = n i =1 (ai bi ) − n AB (n − 1) A B where n is the number of tuples, A and B are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product. ◼ ◼ If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. rA,B = 0: independent; rAB < 0: negatively correlated 123 Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. 124 Correlation (viewed as linear relationship) ◼ ◼ Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, A and B, and then take their dot product a'k = (ak − mean( A)) / std ( A) b'k = (bk − mean( B)) / std ( B) correlatio n( A, B) = A'• B' 125 Covariance (Numeric Data) ◼ Covariance is similar to correlation Correlation coefficient: where n is the number of tuples, A and B are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B. ◼ ◼ ◼ Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values. Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value. Independence: CovA,B = 0 but the converse is not true: ◼ Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence126 Co-Variance: An Example ◼ It can be simplified in computation as ◼ Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). ◼ Question: If the stocks are affected by the same industry trends, will their prices rise or fall together? ◼ ◼ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 ◼ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 ◼ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 Thus, A and B rise together since Cov(A, B) > 0. Chapter 3: Data Preprocessing ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary 128 Data Reduction Strategies ◼ ◼ ◼ Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. Data reduction strategies ◼ Dimensionality reduction, e.g., remove unimportant attributes ◼ Wavelet transforms ◼ Principal Components Analysis (PCA) ◼ Feature subset selection, feature creation ◼ Numerosity reduction (some simply call it: Data Reduction) ◼ Regression and Log-Linear Models ◼ Histograms, clustering, sampling ◼ Data cube aggregation ◼ Data compression 129 Data Reduction 1: Dimensionality Reduction ◼ Curse of dimensionality ◼ ◼ ◼ ◼ ◼ When dimensionality increases, data becomes increasingly sparse Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful The possible combinations of subspaces will grow exponentially Dimensionality reduction ◼ Avoid the curse of dimensionality ◼ Help eliminate irrelevant features and reduce noise ◼ Reduce time and space required in data mining ◼ Allow easier visualization Dimensionality reduction techniques ◼ Wavelet transforms ◼ Principal Component Analysis ◼ Supervised and nonlinear techniques (e.g., feature selection) 130 Mapping Data to a New Space ◼ ◼ Fourier transform Wavelet transform Two Sine Waves Two Sine Waves + Noise Frequency 131 What Is Wavelet Transform? ◼ Decomposes a signal into different frequency subbands ◼ ◼ ◼ ◼ Applicable to ndimensional signals Data are transformed to preserve relative distance between objects at different levels of resolution Allow natural clusters to become more distinguishable Used for image compression 132 Wavelet Transformation Haar2 ◼ ◼ ◼ ◼ Discrete wavelet transform (DWT) for linear signal processing, multi-resolution analysis Daubechie4 Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space Method: ◼ Length, L, must be an integer power of 2 (padding with 0’s, when necessary) ◼ Each transform has 2 functions: smoothing, difference ◼ Applies to pairs of data, resulting in two set of data of length L/2 ◼ Applies two functions recursively, until reaches the desired length 133 Wavelet Decomposition ◼ ◼ ◼ Wavelets: A math tool for space-efficient hierarchical decomposition of functions S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2, 0, 0, -1, -1, 0] Compression: many small detail coefficients can be replaced by 0’s, and only the significant coefficients are retained 134 Haar Wavelet Coefficients Coefficient “Supports” Hierarchical 2.75 decomposition structure (a.k.a. + “error tree”) + -1.25 0.5 + + 2 0 + 2 0 + -1 -1 2 3 0.5 0 - - + + 0 0 - + 5 - + -1.25 - + + 2.75 4 Original frequency distribution - 0 4 -1 -1 0 + - + - + - + 135 Why Wavelet Transform? ◼ ◼ ◼ ◼ ◼ Use hat-shape filters ◼ Emphasize region where points cluster ◼ Suppress weaker information in their boundaries Effective removal of outliers ◼ Insensitive to noise, insensitive to input order Multi-resolution ◼ Detect arbitrary shaped clusters at different scales Efficient ◼ Complexity O(N) Only applicable to low dimensional data 136 Principal Component Analysis (PCA) ◼ ◼ Find a projection that captures the largest amount of variation in data The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space x2 e x1 137 Principal Component Analysis (Steps) ◼ Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data ◼ Normalize input data: Each attribute falls within the same range ◼ Compute k orthonormal (unit) vectors, i.e., principal components ◼ ◼ ◼ ◼ Each input data (vector) is a linear combination of the k principal component vectors The principal components are sorted in order of decreasing “significance” or strength Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data) Works for numeric data only 138 Attribute Subset Selection ◼ Another way to reduce dimensionality of data ◼ Redundant attributes ◼ ◼ ◼ Duplicate much or all of the information contained in one or more other attributes E.g., purchase price of a product and the amount of sales tax paid Irrelevant attributes ◼ ◼ Contain no information that is useful for the data mining task at hand E.g., students' ID is often irrelevant to the task of predicting students' GPA 139 Heuristic Search in Attribute Selection ◼ ◼ There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods: ◼ Best single attribute under the attribute independence assumption: choose by significance tests ◼ Best step-wise feature selection: ◼ The best single-attribute is picked first ◼ Then next best attribute condition to the first, ... ◼ Step-wise attribute elimination: ◼ Repeatedly eliminate the worst attribute ◼ Best combined attribute selection and elimination ◼ Optimal branch and bound: ◼ Use attribute elimination and backtracking 140 Attribute Creation (Feature Generation) ◼ ◼ Create new attributes (features) that can capture the important information in a data set more effectively than the original ones Three general methodologies ◼ Attribute extraction ◼ Domain-specific ◼ Mapping data to new space (see: data reduction) ◼ E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered) ◼ Attribute construction ◼ Combining features (see: discriminative frequent patterns in Chapter 7) ◼ Data discretization 141 Data Reduction 2: Numerosity Reduction ◼ ◼ ◼ Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods (e.g., regression) ◼ Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) ◼ Ex.: Log-linear models—obtain value at a point in mD space as the product on appropriate marginal subspaces Non-parametric methods ◼ Do not assume models ◼ Major families: histograms, clustering, sampling, … 142 Parametric Data Reduction: Regression and Log-Linear Models ◼ ◼ ◼ Linear regression ◼ Data modeled to fit a straight line ◼ Often uses the least-square method to fit the line Multiple regression ◼ Allows a response variable Y to be modeled as a linear function of multidimensional feature vector Log-linear model ◼ Approximates discrete multidimensional probability distributions 143 y Regression Analysis Y1 ◼ Regression analysis: A collective name for techniques for the modeling and analysis Y1’ y=x+1 of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables (aka. explanatory variables or predictors) ◼ ◼ The parameters are estimated so as to give a "best fit" of the data ◼ Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used X1 x Used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships 144 Regress Analysis and Log-Linear Models ◼ Linear regression: Y = w X + b ◼ ◼ Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. ◼ Multiple regression: Y = b0 + b1 X1 + b2 X2 ◼ ◼ Many nonlinear functions can be transformed into the above Log-linear models: ◼ ◼ ◼ Approximate discrete multidimensional probability distributions Estimate the probability of each point (tuple) in a multi-dimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations Useful for dimensionality reduction and data smoothing 145 Histogram Analysis ◼ ◼ Divide data into buckets and 40 store average (sum) for each 35 bucket Partitioning rules: 30 25 ◼ ◼ Equal-width: equal bucket 20 range Equal-frequency (or equal-15 depth) 10 5 0 10000 30000 50000 70000 90000 146 Clustering ◼ ◼ ◼ ◼ ◼ Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only Can be very effective if data is clustered but not if data is “smeared” Can have hierarchical clustering and be stored in multidimensional index tree structures There are many choices of clustering definitions and clustering algorithms Cluster analysis will be studied in depth in Chapter 10 147 Sampling ◼ ◼ ◼ Sampling: obtaining a small sample s to represent the whole data set N Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Key principle: Choose a representative subset of the data ◼ ◼ ◼ Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods, e.g., stratified sampling: Note: Sampling may not reduce database I/Os (page at a time) 148 Types of Sampling ◼ ◼ ◼ ◼ Simple random sampling ◼ There is an equal probability of selecting any particular item Sampling without replacement ◼ Once an object is selected, it is removed from the population Sampling with replacement ◼ A selected object is not removed from the population Stratified sampling: ◼ Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) ◼ Used in conjunction with skewed data 149 Sampling: With or without Replacement Raw Data 150 Sampling: Cluster or Stratified Sampling Raw Data Cluster/Stratified Sample 151 Data Cube Aggregation ◼ ◼ The lowest level of a data cube (base cuboid) ◼ The aggregated data for an individual entity of interest ◼ E.g., a customer in a phone calling data warehouse Multiple levels of aggregation in data cubes ◼ ◼ Reference appropriate levels ◼ ◼ Further reduce the size of data to deal with Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be answered using data cube, when possible 152 Data Reduction 3: Data Compression ◼ ◼ ◼ ◼ String compression ◼ There are extensive theories and well-tuned algorithms ◼ Typically lossless, but only limited manipulation is possible without expansion Audio/video compression ◼ Typically lossy compression, with progressive refinement ◼ Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio ◼ Typically short and vary slowly with time Dimensionality and numerosity reduction may also be considered as forms of data compression 153 Data Compression Compressed Data Original Data lossless Original Data Approximated 154 Chapter 3: Data Preprocessing ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary 155 Data Transformation ◼ ◼ A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values Methods ◼ Smoothing: Remove noise from data ◼ Attribute/feature construction ◼ New attributes constructed from the given ones ◼ Aggregation: Summarization, data cube construction ◼ Normalization: Scaled to fall within a smaller, specified range ◼ ◼ min-max normalization ◼ z-score normalization ◼ normalization by decimal scaling Discretization: Concept hierarchy climbing 156 Normalization ◼ Min-max normalization: to [new_minA, new_maxA] v' = ◼ ◼ v − minA (new _ maxA − new _ minA) + new _ minA maxA − minA Ex. Let income range $12,000 to $98,000 normalized to [0.0, 73,600 − 12,000 1.0]. Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716 Z-score normalization (μ: mean, σ: standard deviation): v' = v − A A ◼ ◼ Ex. Let μ = 54,000, σ = 16,000. Then 73,600 − 54,000 = 1.225 16,000 Normalization by decimal scaling v v'= j 10 Where j is the smallest integer such that Max(|ν’|) < 1 157 Discretization ◼ Three types of attributes ◼ ◼ ◼ ◼ Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set, e.g., military or academic rank Numeric—real numbers, e.g., integer or real numbers Discretization: Divide the range of a continuous attribute into intervals ◼ Interval labels can then be used to replace actual data values ◼ Reduce data size by discretization ◼ Supervised vs. unsupervised ◼ Split (top-down) vs. merge (bottom-up) ◼ Discretization can be performed recursively on an attribute ◼ Prepare for further analysis, e.g., classification 158 Data Discretization Methods ◼ Typical methods: All the methods can be applied recursively ◼ Binning ◼ ◼ Histogram analysis ◼ ◼ ◼ ◼ Top-down split, unsupervised Top-down split, unsupervised Clustering analysis (unsupervised, top-down split or bottom-up merge) Decision-tree analysis (supervised, top-down split) Correlation (e.g., 2) analysis (unsupervised, bottom-up merge) 159 Simple Discretization: Binning ◼ Equal-width (distance) partitioning ◼ Divides the range into N intervals of equal size: uniform grid ◼ if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. ◼ ◼ The most straightforward, but outliers may dominate presentation ◼ Skewed data is not handled well Equal-depth (frequency) partitioning ◼ Divides the range into N intervals, each containing approximately same number of samples ◼ Good data scaling ◼ Managing categorical attributes can be tricky 160 Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 ❑ 161 Labels (Binning vs. Clustering) Data Equal frequency (binning) Equal interval width (binning) K-means clustering leads to better results 162 Discretization by Classification & Correlation Analysis ◼ ◼ Classification (e.g., decision tree analysis) ◼ Supervised: Given class labels, e.g., cancerous vs. benign ◼ Using entropy to determine split point (discretization point) ◼ Top-down, recursive split ◼ Details to be covered in Chapter 7 Correlation analysis (e.g., Chi-merge: χ2-based discretization) ◼ Supervised: use class information ◼ Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge ◼ Merge performed recursively, until a predefined stopping condition 163 Concept Hierarchy Generation ◼ ◼ ◼ ◼ ◼ Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior) Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers Concept hierarchy can be automatically formed for both numeric and nominal data. For numeric data, use discretization methods shown. 164 Concept Hierarchy Generation for Nominal Data ◼ Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts ◼ ◼ Specification of a hierarchy for a set of values by explicit data grouping ◼ ◼ {Urbana, Champaign, Chicago} < Illinois Specification of only a partial set of attributes ◼ ◼ street < city < state < country E.g., only street < city, not others Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values ◼ E.g., for a set of attributes: {street, city, state, country} 165 Automatic Concept Hierarchy Generation ◼ Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set ◼ The attribute with the most distinct values is placed at the lowest level of the hierarchy ◼ Exceptions, e.g., weekday, month, quarter, year country 15 distinct values province_or_ state 365 distinct values city 3567 distinct values street 674,339 distinct values 166 Chapter 3: Data Preprocessing ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary 167 Summary ◼ ◼ ◼ ◼ ◼ Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability Data cleaning: e.g. missing/noisy values, outliers Data integration from multiple sources: ◼ Entity identification problem ◼ Remove redundancies ◼ Detect inconsistencies Data reduction ◼ Dimensionality reduction ◼ Numerosity reduction ◼ Data compression Data transformation and data discretization ◼ Normalization ◼ Concept hierarchy generation 168 References ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-78, 1999 A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003 J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. VLDB'01 M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), Dec. 1997 H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective. Kluwer Academic, 1998 J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001 T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995 169 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 4 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 170 Chapter 4: Data Warehousing and On-line Analytical Processing ◼ Data Warehouse: Basic Concepts ◼ Data Warehouse Modeling: Data Cube and OLAP ◼ Data Warehouse Design and Usage ◼ Data Warehouse Implementation ◼ Data Generalization by Attribute-Oriented Induction ◼ Summary 171 What is a Data Warehouse? ◼ Defined in many different ways, but not rigorously. ◼ A decision support database that is maintained separately from the organization’s operational database ◼ Support information processing by providing a solid platform of consolidated, historical data for analysis. ◼ “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon ◼ Data warehousing: ◼ The process of constructing and using data warehouses 172 Data Warehouse—Subject-Oriented ◼ Organized around major subjects, such as customer, product, sales ◼ Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing ◼ Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process 173 Data Warehouse—Integrated ◼ ◼ Constructed by integrating multiple, heterogeneous data sources ◼ relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. ◼ Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources ◼ ◼ E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. 174 Data Warehouse—Time Variant ◼ The time horizon for the data warehouse is significantly longer than that of operational systems ◼ ◼ ◼ Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse ◼ ◼ Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element” 175 Data Warehouse—Nonvolatile ◼ A physically separate store of data transformed from the operational environment ◼ Operational update of data does not occur in the data warehouse environment ◼ Does not require transaction processing, recovery, and concurrency control mechanisms ◼ Requires only two operations in data accessing: ◼ initial loading of data and access of data 176 OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access complex query 177 Why a Separate Data Warehouse? ◼ High performance for both systems ◼ ◼ ◼ Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data: ◼ ◼ ◼ ◼ DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Note: There are more and more systems which perform OLAP analysis directly on relational databases 178 Data Warehouse: A Multi-Tiered Architecture Other sources Operational DBs Metadata Extract Transform Load Refresh Monitor & Integrator Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Tools 179 Three Data Warehouse Models ◼ ◼ Enterprise warehouse ◼ collects all of the information about subjects spanning the entire organization Data Mart ◼ a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart ◼ ◼ Independent vs. dependent (directly from warehouse) data mart Virtual warehouse ◼ A set of views over operational databases ◼ Only some of the possible summary views may be materialized 180 Extraction, Transformation, and Loading (ETL) ◼ ◼ ◼ ◼ ◼ Data extraction ◼ get data from multiple, heterogeneous, and external sources Data cleaning ◼ detect errors in the data and rectify them when possible Data transformation ◼ convert data from legacy or host format to warehouse format Load ◼ sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh ◼ propagate the updates from the data sources to the warehouse 181 Metadata Repository ◼ Meta data is the data defining warehouse objects. It stores: ◼ Description of the structure of the data warehouse ◼ ◼ schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents Operational meta-data ◼ data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) ◼ The algorithms used for summarization ◼ The mapping from operational environment to the data warehouse ◼ ◼ Data related to system performance ◼ warehouse schema, view and derived data definitions Business data ◼ business terms and definitions, ownership of data, charging policies 182 Chapter 4: Data Warehousing and On-line Analytical Processing ◼ Data Warehouse: Basic Concepts ◼ Data Warehouse Modeling: Data Cube and OLAP ◼ Data Warehouse Design and Usage ◼ Data Warehouse Implementation ◼ Data Generalization by Attribute-Oriented Induction ◼ Summary 183 From Tables and Spreadsheets to Data Cubes ◼ A data warehouse is based on a multidimensional data model which views data in the form of a data cube ◼ A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions ◼ Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) ◼ Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables ◼ In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. 184 Cube: A Lattice of Cuboids all time 0-D (apex) cuboid item time,location time,item location supplier item,location time,supplier 1-D cuboids location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,location time,item,supplier item,location,supplier 4-D (base) cuboid time, item, location, supplier 185 Conceptual Modeling of Data Warehouses ◼ Modeling data warehouses: dimensions & measures ◼ Star schema: A fact table in the middle connected to a set of dimension tables ◼ Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake ◼ Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 186 Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location location_key street city state_or_province country Measures 187 Example of Snowflake Schema time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key supplier supplier_key supplier_type location location_key street city_key city city_key city state_or_province country 188 Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type item_key location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures time_key item_key shipper_key from_location branch_key branch Shipping Fact Table location to_location location_key street city province_or_state country dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type 189 A Concept Hierarchy: Dimension (location) all all Europe region country city office Germany Frankfurt ... ... ... Spain North_America Canada Vancouver ... L. Chan ... ... Mexico Toronto M. Wind 190 Data Cube Measures: Three Categories ◼ Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning ◼ ◼ Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function ◼ ◼ E.g., count(), sum(), min(), max() E.g., avg(), min_N(), standard_deviation() Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. ◼ E.g., median(), mode(), rank() 191 View of Warehouses and Hierarchies Specification of hierarchies ◼ Schema hierarchy day < {month < quarter; week} < year ◼ Set_grouping hierarchy {1..10} < inexpensive 192 Multidimensional Data Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product ◼ Product City Office Month Week Day Month 193 A Sample Data Cube 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico Country TV PC VCR sum 1Qtr Date Total annual sales of TVs in U.S.A. sum 194 Cuboids Corresponding to the Cube all 0-D (apex) cuboid product product,date date country product,country 1-D cuboids date, country 2-D cuboids 3-D (base) cuboid product, date, country 195 Typical OLAP Operations ◼ Roll up (drill-up): summarize data ◼ by climbing up hierarchy or by dimension reduction ◼ Drill down (roll down): reverse of roll-up ◼ from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select ◼ Pivot (rotate): ◼ ◼ ◼ reorient the cube, visualization, 3D to series of 2D planes Other operations ◼ ◼ drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 196 Fig. 3.10 Typical OLAP Operations 197 A Star-Net Query Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP CITY SALES PERSON COUNTRY DISTRICT REGION Location Each circle is called a footprint DIVISION Promotion Organization 198 Browsing a Data Cube ◼ ◼ ◼ Visualization OLAP capabilities Interactive manipulation 199 Chapter 4: Data Warehousing and On-line Analytical Processing ◼ Data Warehouse: Basic Concepts ◼ Data Warehouse Modeling: Data Cube and OLAP ◼ Data Warehouse Design and Usage ◼ Data Warehouse Implementation ◼ Data Generalization by Attribute-Oriented Induction ◼ Summary 200 Design of Data Warehouse: A Business Analysis Framework ◼ Four views regarding the design of a data warehouse ◼ Top-down view ◼ ◼ Data source view ◼ ◼ exposes the information being captured, stored, and managed by operational systems Data warehouse view ◼ ◼ allows selection of the relevant information necessary for the data warehouse consists of fact tables and dimension tables Business query view ◼ sees the perspectives of data in the warehouse from the view of end-user 201 Data Warehouse Design Process ◼ ◼ Top-down, bottom-up approaches or a combination of both ◼ Top-down: Starts with overall design and planning (mature) ◼ Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view ◼ ◼ ◼ Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process ◼ Choose a business process to model, e.g., orders, invoices, etc. ◼ Choose the grain (atomic level of data) of the business process ◼ Choose the dimensions that will apply to each fact table record ◼ Choose the measure that will populate each fact table record 202 Data Warehouse Development: A Recommended Approach Multi-Tier Data Warehouse Distributed Data Marts Data Mart Data Mart Model refinement Enterprise Data Warehouse Model refinement Define a high-level corporate data model 203 Data Warehouse Usage ◼ Three kinds of data warehouse applications ◼ Information processing ◼ ◼ ◼ supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing ◼ multidimensional analysis of data warehouse data ◼ supports basic OLAP operations, slice-dice, drilling, pivoting Data mining ◼ ◼ knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools 204 From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM) ◼ Why online analytical mining? ◼ High quality of data in data warehouses ◼ DW contains integrated, consistent, cleaned data ◼ Available information processing structure surrounding data warehouses ◼ ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools ◼ OLAP-based exploratory data analysis ◼ Mining with drilling, dicing, pivoting, etc. ◼ On-line selection of data mining functions ◼ Integration and swapping of multiple mining functions, algorithms, and tasks 205 Chapter 4: Data Warehousing and On-line Analytical Processing ◼ Data Warehouse: Basic Concepts ◼ Data Warehouse Modeling: Data Cube and OLAP ◼ Data Warehouse Design and Usage ◼ Data Warehouse Implementation ◼ Data Generalization by Attribute-Oriented Induction ◼ Summary 206 Efficient Data Cube Computation ◼ Data cube can be viewed as a lattice of cuboids ◼ The bottom-most cuboid is the base cuboid ◼ The top-most cuboid (apex) contains only one cell ◼ How many cuboids in an n-dimensional cube with L n levels? T = ( Li +1) i =1 ◼ Materialization of data cube ◼ ◼ Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection of which cuboids to materialize ◼ Based on size, sharing, access frequency, etc. 207 The “Compute Cube” Operator ◼ Cube definition and computation in DMQL define cube sales [item, city, year]: sum (sales_in_dollars) compute cube sales ◼ Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) () SELECT item, city, year, SUM (amount) FROM SALES ◼ CUBE BY item, city, year Need compute the following Group-Bys (city) (city, item) (item) (city, year) (date, product, customer), (date,product),(date, customer), (product, customer), (city, item, year) (date), (product), (customer) () (year) (item, year) 208 Indexing OLAP Data: Bitmap Index ◼ ◼ ◼ ◼ ◼ ◼ Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high cardinality domain as well [Wu, et al. TODS’06] Base table Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe Index on Region Index on Type Type RecIDAsia Europe America RecID Retail Dealer Retail 1 1 0 1 1 0 0 Dealer 2 2 0 1 0 1 0 Dealer 3 1 0 0 3 0 1 4 0 0 1 4 1 0 Retail 0 1 0 5 0 1 Dealer 5 209 Indexing OLAP Data: Join Indices ◼ ◼ ◼ Join index: JI(R-id, S-id) where R (R-id, …) S (S-id, …) Traditional indices map the values to a list of record ids ◼ It materializes relational join in JI file and speeds up relational join In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table. ◼ E.g. fact table: Sales and two dimensions city and product ◼ A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city ◼ Join indices can span multiple dimensions 210 Efficient Processing OLAP Queries ◼ Determine which operations should be performed on the available cuboids ◼ Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection ◼ Determine which materialized cuboid(s) should be selected for OLAP op. ◼ Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query? ◼ Explore indexing structures and compressed vs. dense array structs in MOLAP 211 OLAP Server Architectures ◼ Relational OLAP (ROLAP) ◼ ◼ ◼ ◼ ◼ Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services Greater scalability Multidimensional OLAP (MOLAP) ◼ Sparse array-based multidimensional storage engine ◼ Fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer) ◼ ◼ Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware Flexibility, e.g., low level: relational, high-level: array Specialized SQL servers (e.g., Redbricks) ◼ Specialized support for SQL queries over star/snowflake schemas 212 Chapter 4: Data Warehousing and On-line Analytical Processing ◼ Data Warehouse: Basic Concepts ◼ Data Warehouse Modeling: Data Cube and OLAP ◼ Data Warehouse Design and Usage ◼ Data Warehouse Implementation ◼ Data Generalization by Attribute-Oriented Induction ◼ Summary 213 Attribute-Oriented Induction ◼ Proposed in 1989 (KDD ‘89 workshop) ◼ Not confined to categorical data nor particular measures ◼ How it is done? ◼ ◼ ◼ ◼ Collect the task-relevant data (initial relation) using a relational database query Perform generalization by attribute removal or attribute generalization Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Interaction with users for knowledge presentation 214 Attribute-Oriented Induction: An Example Example: Describe general characteristics of graduate students in the University database ◼ Step 1. Fetch relevant set of data using an SQL statement, e.g., Select * (i.e., name, gender, major, birth_place, birth_date, residence, phone#, gpa) from student where student_status in {“Msc”, “MBA”, “PhD” } ◼ Step 2. Perform attribute-oriented induction ◼ Step 3. Present results in generalized relation, cross-tab, or rule forms 215 Class Characterization: An Example Name Gender Jim Initial Woodman Relation Scott M Major M F … Removed Retained Residence Phone # GPA Vancouver,BC, 8-12-76 Canada CS Montreal, Que, 28-7-75 Canada Physics Seattle, WA, USA 25-8-70 … … … 3511 Main St., Richmond 345 1st Ave., Richmond 687-4598 3.67 253-9106 3.70 125 Austin Ave., Burnaby … 420-5232 … 3.83 … Sci,Eng, Bus City Removed Excl, VG,.. Gender Major M F … Birth_date CS Lachance Laura Lee … Prime Generalized Relation Birth-Place Science Science … Country Age range Birth_region Age_range Residence GPA Canada Foreign … 20-25 25-30 … Richmond Burnaby … Very-good Excellent … Count 16 22 … Birth_Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 216 Basic Principles of Attribute-Oriented Induction ◼ ◼ ◼ ◼ ◼ Data focusing: task-relevant data, including dimensions, and the result is the initial relation Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A Attribute-threshold control: typical 2-8, specified/default Generalized relation threshold control: control the final relation/rule size 217 Attribute-Oriented Induction: Basic Algorithm ◼ ◼ ◼ ◼ InitialRel: Query processing of task-relevant data, deriving the initial relation. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. 218 Presentation of Generalized Results ◼ Generalized relation: ◼ ◼ Cross tabulation: ◼ ◼ Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Mapping results into cross tabulation form (similar to contingency tables). ◼ Visualization techniques: ◼ Pie charts, bar charts, curves, cubes, and other visual forms. Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., grad( x) male( x) birth_ region( x) ="Canada"[t :53%] birth_ region( x) =" foreign"[t : 47%]. ◼ 219 Mining Class Comparisons ◼ Comparison: Comparing two or more classes ◼ Method: ◼ Partition the set of relevant data into the target class and the contrasting class(es) ◼ Generalize both classes to the same high level concepts ◼ Compare tuples with the same high level descriptions ◼ Present for every tuple its description and two measures ◼ ◼ ◼ support - distribution within single class ◼ comparison - distribution between classes Highlight the tuples with strong discriminant features Relevance Analysis: ◼ Find attributes (features) which best distinguish different classes 220 Concept Description vs. Cube-Based OLAP ◼ ◼ Similarity: ◼ Data generalization ◼ Presentation of data summarization at multiple levels of abstraction ◼ Interactive drilling, pivoting, slicing and dicing Differences: ◼ OLAP has systematic preprocessing, query independent, and can drill down to rather low level ◼ AOI has automated desired level allocation, and may perform dimension relevance analysis/ranking when there are many relevant dimensions ◼ AOI works on the data which are not in relational forms 221 Chapter 4: Data Warehousing and On-line Analytical Processing ◼ Data Warehouse: Basic Concepts ◼ Data Warehouse Modeling: Data Cube and OLAP ◼ Data Warehouse Design and Usage ◼ Data Warehouse Implementation ◼ Data Generalization by Attribute-Oriented Induction ◼ Summary 222 Summary ◼ Data warehousing: A multi-dimensional model of a data warehouse ◼ ◼ ◼ ◼ A data cube consists of dimensions & measures Star schema, snowflake schema, fact constellations OLAP operations: drilling, rolling, slicing, dicing and pivoting Data Warehouse Architecture, Design, and Usage ◼ Multi-tiered architecture ◼ Business analysis design framework Information processing, analytical processing, data mining, OLAM (Online Analytical Mining) Implementation: Efficient computation of data cubes ◼ Partial vs. full vs. no materialization ◼ Indexing OALP data: Bitmap index and join index ◼ OLAP query processing ◼ OLAP servers: ROLAP, MOLAP, HOLAP ◼ ◼ ◼ Data generalization: Attribute-oriented induction 223 References (I) ◼ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96 ◼ D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97 ◼ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97 ◼ S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997 ◼ E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July 1993. J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997. ◼ ◼ A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999. ◼ J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107, 1998. ◼ V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96 ◼ J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97 224 References (II) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and Dimensional Techniques. John Wiley, 2003 W. H. Inmon. Building the Data Warehouse. John Wiley, 1996 R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 2ed. John Wiley, 2002 P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8– 11, Sept. 1995. P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998 S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94 A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00. D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using views. VLDB'96 P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987. J. Widom. Research problems in data warehousing. CIKM’95 K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans. on Database Systems (TODS), 31(1): 1-38, 2006 225 Surplus Slides 226 Compression of Bitmap Indices ◼ Bitmap indexes must be compressed to reduce I/O costs and minimize CPU usage—majority of the bits are 0’s ◼ ◼ Two compression schemes: ◼ Byte-aligned Bitmap Code (BBC) ◼ Word-Aligned Hybrid (WAH) code Time and space required to operate on compressed bitmap is proportional to the total size of the bitmap ◼ Optimal on attributes of low cardinality as well as those of high cardinality. ◼ WAH out performs BBC by about a factor of two 227 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2010 Han, Kamber & Pei. All rights reserved. 228 Chapter 5: Data Cube Technology ◼ Data Cube Computation: Preliminary Concepts ◼ Data Cube Computation Methods ◼ Processing Advanced Queries by Exploring Data Cube Technology ◼ Multidimensional Data Analysis in Cube Space ◼ Summary 229 Data Cube: A Lattice of Cuboids all time item time,location time,item 0-D(apex) cuboid location supplier item,location time,supplier 1-D cuboids location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,locationtime,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplierc 230 Data Cube: A Lattice of Cuboids all time item 0-D(apex) cuboid location supplier 1-D cuboids time,item time,location item,location location,supplier item,supplier time,supplier 2-D cuboids time,location,supplier time,item,location time,item,supplier item,location,supplier time, item, location, supplier ◼ 3-D cuboids 4-D(base) cuboid Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) 2. (9/15, milk, Urbana, *) 3. (*, milk, Urbana, *) 4. (*, milk, Urbana, *) 5. (*, milk, Chicago, *) 6. (*, milk, *, *) 231 Cube Materialization: Full Cube vs. Iceberg Cube ◼ Full cube vs. iceberg cube iceberg condition compute cube sales iceberg as select month, city, customer group, count(*) from salesInfo cube by month, city, customer group having count(*) >= min support ▪ Computing only the cuboid cells whose measure satisfies the iceberg condition ▪ Only a small portion of cells may be “above the water’’ in a sparse cube ◼ Avoid explosive growth: A cube with 100 dimensions ◼ 2 base cells: (a1, a2, …., a100), (b1, b2, …, b100) ◼ How many aggregate cells if “having count >= 1”? ◼ What about “having count >= 2”? 232 Iceberg Cube, Closed Cube & Cube Shell ◼ Is iceberg cube good enough? ◼ ◼ ◼ How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number! Close cube: ◼ ◼ ◼ ◼ 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10} Closed cell c: if there exists no cell d, s.t. d is a descendant of c, and d has the same measure value as c. Closed cube: a cube consisting of only closed cells What is the closed cube of the above base cuboid? Hint: only 3 cells Cube Shell ◼ ◼ Precompute only the cuboids involving a small # of dimensions, e.g., 3 For (A1, A2, … A10), how many combinations to compute? More dimension combinations will need to be computed on the fly 233 Roadmap for Efficient Computation ◼ General cube computation heuristics (Agarwal et al.’96) ◼ Computing full/iceberg cubes: 3 methodologies ◼ ◼ ◼ Bottom-Up: Multi-Way array aggregation (Zhao, Deshpande & Naughton, SIGMOD’97) Top-down: ◼ BUC (Beyer & Ramarkrishnan, SIGMOD’99) ◼ H-cubing technique (Han, Pei, Dong & Wang: SIGMOD’01) Integrating Top-Down and Bottom-Up: ◼ Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03) ◼ High-dimensional OLAP: A Minimal Cubing Approach (Li, et al. VLDB’04) ◼ Computing alternative kinds of cubes: ◼ Partial cube, closed cube, approximate cube, etc. 234 General Heuristics (Agarwal et al. VLDB’96) ◼ ◼ Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples Aggregates may be computed from previously computed aggregates, rather than from the base fact table ◼ ◼ ◼ ◼ ◼ Smallest-child: computing a cuboid from the smallest, previously computed cuboid Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used 235 Chapter 5: Data Cube Technology ◼ Data Cube Computation: Preliminary Concepts ◼ Data Cube Computation Methods ◼ Processing Advanced Queries by Exploring Data Cube Technology ◼ Multidimensional Data Analysis in Cube Space ◼ Summary 236 Data Cube Computation Methods ◼ Multi-Way Array Aggregation ◼ BUC ◼ Star-Cubing ◼ High-Dimensional OLAP 237 Multi-Way Array Aggregation ◼ Array-based “bottom-up” algorithm ◼ Using multi-dimensional chunks ◼ No direct tuple comparisons ◼ ◼ ◼ Simultaneous aggregation on multiple dimensions Intermediate aggregate values are reused for computing ancestor cuboids Cannot do Apriori pruning: No iceberg optimization 238 Multi-way Array Aggregation for Cube Computation (MOLAP) ◼ Partition arrays into chunks (a small subcube which fits in memory). ◼ Compressed sparse array addressing: (chunk_id, offset) ◼ Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 B b3 B13 b2 9 b1 5 b0 14 15 16 1 2 3 4 a0 a1 a2 a3 A 60 44 28 56 40 24 52 36 20 What is the best traversing order to do multi-way aggregation? 239 Multi-way Array Aggregation for Cube Computation (3-D to 2-D) all A B AB C AC BC ◼ ABC The best order is the one that minimizes the memory requirement and reduced I/Os 240 Multi-way Array Aggregation for Cube Computation (2-D to 1-D) 241 Multi-Way Array Aggregation for Cube Computation (Method Summary) ◼ Method: the planes should be sorted and computed according to their size in ascending order ◼ ◼ Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions ◼ If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored 242 Data Cube Computation Methods ◼ Multi-Way Array Aggregation ◼ BUC ◼ Star-Cubing ◼ High-Dimensional OLAP 243 Bottom-Up Computation (BUC) ◼ ◼ BUC (Beyer & Ramakrishnan, SIGMOD’99) Bottom-up cube computation (Note: top-down in our view!) Divides dimensions into partitions and facilitates iceberg pruning ◼ If a partition does not satisfy min_sup, its descendants can be pruned 3 AB ◼ If minsup = 1 compute full CUBE! 4 ABC No simultaneous aggregation AB ABC ◼ ◼ all A AC B AD ABD C BC D CD BD ACD BCD ABCD 1 all 2A 7 AC 6 ABD 10 B 14 C 16 D 9 AD 11 BC 13 BD 8 ACD 15 CD 12 BCD 5 ABCD 244 BUC: Partitioning ◼ ◼ ◼ ◼ Usually, entire data set can’t fit in main memory Sort distinct values ◼ partition into blocks that fit Continue processing Optimizations ◼ Partitioning ◼ External Sorting, Hashing, Counting Sort ◼ Ordering dimensions to encourage pruning ◼ Cardinality, Skew, Correlation ◼ Collapsing duplicates ◼ Can’t do holistic aggregates anymore! 245 Data Cube Computation Methods ◼ Multi-Way Array Aggregation ◼ BUC ◼ Star-Cubing ◼ High-Dimensional OLAP 246 Star-Cubing: An Integrating Method ◼ ◼ ◼ D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration, VLDB'03 Explore shared dimensions ◼ E.g., dimension A is the shared dimension of ACD and AD ◼ ABD/AB means cuboid ABD has shared dimensions AB Allows for shared computations e.g., cuboid AB is computed simultaneously as ABD Aggregate in a top-down manner but with the bottom-up AC/AC AD/A BC/BC sub-layer underneath which will allow Apriori pruning ACD/A ◼ ◼ ◼ Shared dimensions grow in bottom-up fashion ABC/ABC ABD/AB C/C D BD/B CD BCD ABCD/all 247 Iceberg Pruning in Shared Dimensions ◼ Anti-monotonic property of shared dimensions ◼ ◼ ◼ If the measure is anti-monotonic, and if the aggregate value on a shared dimension does not satisfy the iceberg condition, then all the cells extended from this shared dimension cannot satisfy the condition either Intuition: if we can compute the shared dimensions before the actual cuboid, we can use them to do Apriori pruning Problem: how to prune while still aggregate simultaneously on multiple dimensions? 248 Cell Trees ◼ Use a tree structure similar to H-tree to represent cuboids ◼ Collapses common prefixes to save memory ◼ Keep count at node ◼ Traverse the tree to retrieve a particular tuple 249 Star Attributes and Star Nodes ◼ Intuition: If a single-dimensional aggregate on an attribute value p does not satisfy the iceberg condition, it is useless to distinguish them during the iceberg computation ◼ ◼ E.g., b2, b3, b4, c1, c2, c4, d1, d2, d3 A B C D Count a1 b1 c1 d1 1 a1 b1 c4 d3 1 a1 b2 c2 d2 1 a2 b3 c3 d4 1 a2 b4 c3 d4 1 Solution: Replace such attributes by a *. Such attributes are star attributes, and the corresponding nodes in the cell tree are star nodes 250 Example: Star Reduction ◼ ◼ ◼ ◼ Suppose minsup = 2 Perform one-dimensional aggregation. Replace attribute values whose count < 2 with *. And collapse all *’s together Resulting table has all such attributes replaced with the starattribute With regards to the iceberg computation, this new table is a lossless compression of the original table A B C D Count a1 b1 * * 1 a1 b1 * * 1 a1 * * * 1 a2 * c3 d4 1 a2 * c3 d4 1 A B C D Count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2 251 Star Tree ◼ Given the new compressed table, it is possible to construct the corresponding A B C D Count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2 cell tree—called star tree ◼ Keep a star table at the side for easy lookup of star attributes ◼ The star tree is a lossless compression of the original cell tree 252 Star-Cubing Algorithm—DFS on Lattice Tree all BCD: 51 b*: 33 A /A B/B C/C b1: 26 D/D root: 5 c*: 14 AB/AB d*: 15 ABC/ABC c3: 211 AC/AC d4: 212 ABD/AB c*: 27 AD/A BC/BC BD/B CD a1: 3 a2: 2 d*: 28 ACD/A BCD b*: 1 b1: 2 b*: 2 c*: 1 c*: 2 c3: 2 d*: 1 d*: 2 d4: 2 ABCD 253 BCD ACD/A ABD/AB ABC/ABC Multi-Way Aggregation ABCD 254 BCD ACD/A ABD/AB ABC/ABC Star-Cubing Algorithm—DFS on Star-Tree ABCD 255 BCD ACD/A ABD/AB ABC/ABC Multi-Way Star-Tree Aggregation ABCD ◼ Start depth-first search at the root of the base star tree ◼ At each new node in the DFS, create corresponding star tree that are descendants of the current tree according to the integrated traversal ordering ◼ E.g., in the base tree, when DFS reaches a1, the ACD/A tree is created ◼ When DFS reaches b*, the ABD/AD tree is created ◼ The counts in the base tree are carried over to the new trees ◼ When DFS reaches a leaf node (e.g., d*), start backtracking ◼ ◼ On every backtracking branch, the count in the corresponding trees are output, the tree is destroyed, and the node in the base tree is destroyed Example ◼ ◼ ◼ When traversing from d* back to c*, the a1b*c*/a1b*c* tree is output and destroyed When traversing from c* back to b*, the a1b*D/a1b* tree is output and destroyed When at b*, jump to b1 and repeat similar process 256 Data Cube Computation Methods ◼ Multi-Way Array Aggregation ◼ BUC ◼ Star-Cubing ◼ High-Dimensional OLAP 257 The Curse of Dimensionality ◼ ◼ None of the previous cubing method can handle high dimensionality! A database of 600k tuples. Each dimension has cardinality of 100 and zipf of 2. 258 Motivation of High-D OLAP ◼ ◼ ◼ X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04 Challenge to current cubing methods: ◼ The “curse of dimensionality’’ problem ◼ Iceberg cube and compressed cubes: only delay the inevitable explosion ◼ Full materialization: still significant overhead in accessing results on disk High-D OLAP is needed in applications ◼ Science and engineering analysis ◼ Bio-data analysis: thousands of genes ◼ Statistical surveys: hundreds of variables 259 Fast High-D OLAP with Minimal Cubing ◼ Observation: OLAP occurs only on a small subset of dimensions at a time ◼ Semi-Online Computational Model 1. Partition the set of dimensions into shell fragments 2. Compute data cubes for each shell fragment while retaining inverted indices or value-list indices 3. Given the pre-computed fragment cubes, dynamically compute cube cells of the highdimensional data cube online 260 Properties of Proposed Method ◼ Partitions the data vertically ◼ Reduces high-dimensional cube into a set of lower dimensional cubes ◼ Online re-construction of original high-dimensional space ◼ Lossless reduction ◼ Offers tradeoffs between the amount of pre-processing and the speed of online computation 261 Example Computation ◼ ◼ Let the cube aggregation function be count tid A B C D E 1 a1 b1 c1 d1 e1 2 a1 b2 c1 d2 e1 3 a1 b2 c1 d1 e2 4 a2 b1 c1 d1 e2 5 a2 b1 c1 d1 e3 Divide the 5 dimensions into 2 shell fragments: ◼ (A, B, C) and (D, E) 262 1-D Inverted Indices ◼ Build traditional invert index or RID list Attribute Value TID List List Size a1 123 3 a2 45 2 b1 145 3 b2 23 2 c1 12345 5 d1 1345 4 d2 2 1 e1 12 2 e2 34 2 e3 5 1 263 Shell Fragment Cubes: Ideas ◼ ◼ ◼ ◼ Generalize the 1-D inverted indices to multi-dimensional ones in the data cube sense Compute all cuboids for data cubes ABC and DE while retaining the inverted indices Cell For example, shell fragment cube ABC a1 b1 contains 7 cuboids: a1 b2 ◼ A, B, C a2 b1 ◼ AB, AC, BC a2 b2 ◼ ABC This completes the offline computation stage Intersection TID List List Size 1 2 3 1 4 5 1 1 1 2 3 2 3 23 2 4 5 1 4 5 45 2 4 52 3 0 264 Shell Fragment Cubes: Size and Design ◼ Given a database of T tuples, D dimensions, and F shell fragment size, the fragment cubes’ space requirement is: ◼ For F < 5, the growth is sub-linear D F OT (2 −1) F ◼ Shell fragments do not have to be disjoint ◼ Fragment groupings can be arbitrary to allow for maximum online performance ◼ ◼ Known common combinations (e.g.,<city, state>) should be grouped together. Shell fragment sizes can be adjusted for optimal balance between offline and online computation 265 ID_Measure Table ◼ If measures other than count are present, store in ID_measure table separate from the shell fragments tid count sum 1 5 70 2 3 10 3 8 20 4 5 40 5 2 30 266 The Frag-Shells Algorithm 1. Partition set of dimension (A1,…,An) into a set of k fragments (P1,…,Pk). 2. Scan base table once and do the following 3. insert <tid, measure> into ID_measure table. 4. for each attribute value ai of each dimension Ai 5. build inverted index entry <ai, tidlist> 6. 7. For each fragment partition Pi build local fragment cube Si by intersecting tid-lists in bottomup fashion. 267 Frag-Shells (2) Dimensions D Cuboid EF Cuboid DE Cuboid A B C D E F … ABC Cube Cell Tuple-ID List d1 e1 {1, 3, 8, 9} d1 e2 {2, 4, 6, 7} d2 e1 {5, 10} … … DEF Cube 268 Online Query Computation: Query ◼ A query has the general form ◼ Each ai has 3 possible values 1. ▪ a1,a2 , ,an : M Instantiated value 2. Aggregate * function 3. Inquire ? function For example, 3 ? ? * 1: count returns a 2-D data cube. 269 Online Query Computation: Method Given the fragment cubes, process a query as ◼ follows 1. Divide the query into fragment, same as the shell 2. Fetch the corresponding TID list for each fragment from the fragment cube 3. Intersect the TID lists from each fragment to construct instantiated base table 4. Compute the data cube using the base table with any cubing algorithm 270 Online Query Computation: Sketch A B C D E F G H I J K L M N … Instantiated Base Table Online Cube 271 Experiment: Size vs. Dimensionality (50 and 100 cardinality) ◼ ◼ (50-C): 106 tuples, 0 skew, 50 cardinality, fragment size 3. (100-C): 106 tuples, 2 skew, 100 cardinality, fragment size 2. 272 Experiments on Real World Data ◼ UCI Forest CoverType data set ◼ ◼ ◼ ◼ 54 dimensions, 581K tuples Shell fragments of size 2 took 33 seconds and 325MB to compute 3-D subquery with 1 instantiate D: 85ms~1.4 sec. Longitudinal Study of Vocational Rehab. Data ◼ ◼ ◼ 24 dimensions, 8818 tuples Shell fragments of size 3 took 0.9 seconds and 60MB to compute 5-D query with 0 instantiated D: 227ms~2.6 sec. 273 Chapter 5: Data Cube Technology ◼ Data Cube Computation: Preliminary Concepts ◼ Data Cube Computation Methods ◼ Processing Advanced Queries by Exploring Data Cube Technology ◼ Sampling Cube ◼ Ranking Cube ◼ Multidimensional Data Analysis in Cube Space ◼ Summary 274 Processing Advanced Queries by Exploring Data Cube Technology ◼ Sampling Cube ◼ ◼ Ranking Cube ◼ ◼ X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over Sampling Data”, SIGMOD’08 D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional selections: The ranking cube approach. VLDB’06 Other advanced cubes for processing data and queries ◼ Stream cube, spatial cube, multimedia cube, text cube, RFID cube, etc. — to be studied in volume 2 275 Statistical Surveys and OLAP ▪ ▪ ▪ ▪ ▪ Statistical survey: A popular tool to collect information about a population based on a sample ▪ Ex.: TV ratings, US Census, election polls A common tool in politics, health, market research, science, and many more An efficient way of collecting information (Data collection is expensive) Many statistical tools available, to determine validity ▪ Confidence intervals ▪ Hypothesis tests OLAP (multidimensional analysis) on survey data ▪ highly desirable but can it be done well? 276 Surveys: Sample vs. Whole Population Data is only a sample of population Age\Education High-school College Graduate 18 19 20 … 277 Problems for Drilling in Multidim. Space Data is only a sample of population but samples could be small when drilling to certain multidimensional space Age\Education High-school College Graduate 18 19 20 … 278 OLAP on Survey (i.e., Sampling) Data ▪ ▪ Semantics of query is unchanged Input data has changed Age/Education High-school College Graduate 18 19 20 … 279 Challenges for OLAP on Sampling Data ▪ ▪ ▪ Computing confidence intervals in OLAP context No data? ▪ Not exactly. No data in subspaces in cube ▪ Sparse data ▪ Causes include sampling bias and query selection bias Curse of dimensionality ▪ Survey data can be high dimensional ▪ Over 600 dimensions in real world example ▪ Impossible to fully materialize 280 Example 1: Confidence Interval What is the average income of 19-year-old high-school students? Return not only query result but also confidence interval Age/Education High-school College Graduate 18 19 20 … 281 Confidence Interval ▪ Confidence interval at ▪ ▪ ◼ ▪ ▪ : x is a sample of data set; is the mean of sample tc is the critical t-value, calculated by a look-up is the estimated standard error of the mean Example: $50,000 ± $3,000 with 95% confidence ▪ Treat points in cube cell as samples ▪ Compute confidence interval as traditional sample set Return answer in the form of confidence interval ▪ Indicates quality of query answer ▪ User selects desired confidence interval 282 Efficient Computing Confidence Interval Measures ▪ Efficient computation in all cells in data cube ▪ Both mean and confidence interval are algebraic ▪ Why confidence interval measure is algebraic? is algebraic where both s and l (count) are algebraic ▪ Thus one can calculate cells efficiently at more general cuboids without having to start at the base cuboid each time 283 Example 2: Query Expansion What is the average income of 19-year-old college students? Age/Education High-school College Graduate 18 19 20 … 284 Boosting Confidence by Query Expansion ▪ ▪ ▪ From the example: The queried cell “19-year-old college students” contains only 2 samples Confidence interval is large (i.e., low confidence). why? ▪ Small sample size ▪ High standard deviation with samples Small sample sizes can occur at relatively low dimensional selections ▪ Collect more data?― expensive! ▪ Use data in other cells? Maybe, but have to be careful 285 Intra-Cuboid Expansion: Choice 1 Expand query to include 18 and 20 year olds? Age/Education High-school College Graduate 18 19 20 … 286 Intra-Cuboid Expansion: Choice 2 Expand query to include high-school and graduate students? Age/Education High-school College Graduate 18 19 20 … 287 Query Expansion 288 Intra-Cuboid Expansion Combine other cells’ data into own to “boost” confidence ▪ If share semantic and cube similarity ▪ Use only if necessary ▪ Bigger sample size will decrease confidence interval ◼ Cell segment similarity ◼ Some dimensions are clear: Age ◼ Some are fuzzy: Occupation ◼ May need domain knowledge ◼ Cell value similarity ◼ How to determine if two cells’ samples come from the same population? ◼ Two-sample t-test (confidence-based) ▪ 289 Inter-Cuboid Expansion If a query dimension is ▪ ▪ ▪ ▪ ▪ ▪ Not correlated with cube value But is causing small sample size by drilling down too much Remove dimension (i.e., generalize to *) and move to a more general cuboid Can use two-sample t-test to determine similarity between two cells across cuboids Can also use a different method to be shown later 290 Query Expansion Experiments ▪ ▪ Real world sample data: 600 dimensions and 750,000 tuples 0.05% to simulate “sample” (allows error checking) 291 Chapter 5: Data Cube Technology ◼ Data Cube Computation: Preliminary Concepts ◼ Data Cube Computation Methods ◼ Processing Advanced Queries by Exploring Data Cube Technology ◼ Sampling Cube ◼ Ranking Cube ◼ Multidimensional Data Analysis in Cube Space ◼ Summary 292 Ranking Cubes – Efficient Computation of Ranking queries ◼ ◼ ◼ Data cube helps not only OLAP but also ranked search (top-k) ranking query: only returns the best k results according to a user-specified preference, consisting of (1) a selection condition and (2) a ranking function Ex.: Search for apartments with expected price 1000 and expected square feet 800 ◼ ◼ ◼ ◼ Select top 1 from Apartment where City = “LA” and Num_Bedroom = 2 order by [price – 1000]^2 + [sq feet - 800]^2 asc Efficiency question: Can we only search what we need? ◼ Build a ranking cube on both selection dimensions and ranking dimensions 293 Ranking Cube: Partition Data on Both Selection and Ranking Dimensions One single data partition as the template Partition for all data Slice the data partition by selection conditions Sliced Partition for city=“LA” Sliced Partition for BR=2 294 Materialize Ranking-Cube Step 1: Partition Data on Ranking Dimensions tid t1 t2 t3 t4 t5 t6 t7 t8 City SEA CLE SEA CLE LA LA LA CLE BR 1 2 1 3 1 2 2 3 Price 500 700 800 1000 1100 1200 1200 1350 Sq feet 600 800 900 1000 200 500 560 1120 Step 2: Group data by Selection Dimensions City SEA LA CLE Block ID 5 5 2 6 15 11 11 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Step 3: Compute Measures for each group For the cell (LA) Block-level: {11, 15} Data-level: {11: t6, t7; 15: t5} City & BR BR 1 2 3 4 295 Search with Ranking-Cube: Simultaneously Push Selection and Ranking Select top 1 from Apartment where city = “LA” order by [price – 1000]^2 + [sq feet - 800]^2 asc Bin boundary for price [500, 600, 800, 1100,1350] Bin boundary for sq feet [200, 400, 600, 800, 1120] Given the bin boundaries, locate the block with top score 800 11 15 1000 Without ranking-cube: start search from here With ranking-cube: start search from here Measure for LA: {11, 15} {11: t6,t7; 15:t5} 296 Processing Ranking Query: Execution Trace Select top 1 from Apartment where city = “LA” order by [price – 1000]^2 + [sq feet - 800]^2 asc Bin boundary for price [500, 600, 800, 1100,1350] Bin boundary for sq feet [200, 400, 600, 800, 1120] f=[price-1000]^2 + [sq feet – 800]^2 Execution Trace: 800 1. Retrieve High-level measure for LA {11, 15} 2. Estimate lower bound score for block 11, 15 11 f(block 11) = 40,000, f(block 15) = 160,000 15 1000 3. Retrieve block 11 4. Retrieve low-level measure for block 11 5. f(t6) = 130,000, f(t7) = 97,600 With rankingcube: start search from here Measure for LA: {11, 15} {11: t6,t7; 15:t5} Output t7, done! 297 Ranking Cube: Methodology and Extension ◼ ◼ Ranking cube methodology ◼ Push selection and ranking simultaneously ◼ It works for many sophisticated ranking functions How to support high-dimensional data? ◼ Materialize only those atomic cuboids that contain single selection dimensions ◼ ◼ Uses the idea similar to high-dimensional OLAP Achieves low space overhead and high performance in answering ranking queries with a high number of selection dimensions 298 Chapter 5: Data Cube Technology ◼ Data Cube Computation: Preliminary Concepts ◼ Data Cube Computation Methods ◼ Processing Advanced Queries by Exploring Data Cube Technology ◼ Multidimensional Data Analysis in Cube Space ◼ Summary 299 Multidimensional Data Analysis in Cube Space ◼ Prediction Cubes: Data Mining in MultiDimensional Cube Space ◼ Multi-Feature Cubes: Complex Aggregation at Multiple Granularities ◼ Discovery-Driven Exploration of Data Cubes 300 Data Mining in Cube Space ◼ ◼ Data cube greatly increases the analysis bandwidth Four ways to interact OLAP-styled analysis and data mining ◼ Using cube space to define data space for mining ◼ Using OLAP queries to generate features and targets for mining, e.g., multi-feature cube ◼ Using data-mining models as building blocks in a multistep mining process, e.g., prediction cube ◼ Using data-cube computation techniques to speed up repeated model construction ◼ Cube-space data mining may require building a model for each candidate data space ◼ Sharing computation across model-construction for different candidates may lead to efficient mining 301 Prediction Cubes ◼ ◼ Prediction cube: A cube structure that stores prediction models in multidimensional data space and supports prediction in OLAP manner Prediction models are used as building blocks to define the interestingness of subsets of data, i.e., to answer which subsets of data indicate better prediction 302 How to Determine the Prediction Power of an Attribute? ◼ ◼ ◼ Ex. A customer table D: ◼ Two dimensions Z: Time (Month, Year ) and Location (State, Country) ◼ Two features X: Gender and Salary ◼ One class-label attribute Y: Valued Customer Q: “Are there times and locations in which the value of a customer depended greatly on the customers gender (i.e., Gender: predictiveness attribute V)?” Idea: ◼ Compute the difference between the model built on that using X to predict Y and that built on using X – V to predict Y ◼ If the difference is large, V must play an important role at predicting Y 303 Efficient Computation of Prediction Cubes ◼ ◼ Naïve method: Fully materialize the prediction cube, i.e., exhaustively build models and evaluate them for each cell and for each granularity Better approach: Explore score function decomposition that reduces prediction cube computation to data cube computation 304 Multidimensional Data Analysis in Cube Space ◼ Prediction Cubes: Data Mining in MultiDimensional Cube Space ◼ Multi-Feature Cubes: Complex Aggregation at Multiple Granularities ◼ Discovery-Driven Exploration of Data Cubes 305 Complex Aggregation at Multiple Granularities: Multi-Feature Cubes ◼ ◼ ◼ Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularities Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 2010 for each group, and the total sales among all maximum price tuples select item, region, month, max(price), sum(R.sales) from purchases where year = 2010 cube by item, region, month: R such that R.price = max(price) Continuing the last example, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples 306 Multidimensional Data Analysis in Cube Space ◼ Prediction Cubes: Data Mining in MultiDimensional Cube Space ◼ Multi-Feature Cubes: Complex Aggregation at Multiple Granularities ◼ Discovery-Driven Exploration of Data Cubes 307 Discovery-Driven Exploration of Data Cubes ◼ Hypothesis-driven ◼ ◼ exploration by user, huge search space Discovery-driven (Sarawagi, et al.’98) ◼ ◼ ◼ ◼ Effective navigation of large OLAP data cubes pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation Exception: significantly different from the value anticipated, based on a statistical model Visual cues such as background color are used to reflect the degree of exception of each cell 308 Kinds of Exceptions and their Computation ◼ Parameters ◼ ◼ ◼ ◼ ◼ SelfExp: surprise of cell relative to other cells at same level of aggregation InExp: surprise beneath the cell PathExp: surprise beneath cell for each drill-down path Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction Exception themselves can be stored, indexed and retrieved like precomputed aggregates 309 Examples: Discovery-Driven Data Cubes 310 Chapter 5: Data Cube Technology ◼ Data Cube Computation: Preliminary Concepts ◼ Data Cube Computation Methods ◼ Processing Advanced Queries by Exploring Data Cube Technology ◼ Multidimensional Data Analysis in Cube Space ◼ Summary 311 Data Cube Technology: Summary ◼ Data Cube Computation: Preliminary Concepts ◼ Data Cube Computation Methods ◼ ◼ ◼ MultiWay Array Aggregation ◼ BUC ◼ Star-Cubing ◼ High-Dimensional OLAP with Shell-Fragments Processing Advanced Queries by Exploring Data Cube Technology ◼ Sampling Cubes ◼ Ranking Cubes Multidimensional Data Analysis in Cube Space ◼ Discovery-Driven Exploration of Data Cubes ◼ Multi-feature Cubes ◼ Prediction Cubes 312 Ref.(I) Data Cube Computation Methods ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97 K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99 M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB’98 J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29–54, 1997. J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01 L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to Summarize the Semantics of a Data Cube, VLDB'02 X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04 Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. SIGMOD’97 K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97 D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration, VLDB'03 D. Xin, J. Han, Z. Shao, H. Liu, C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking, ICDE'06 313 Ref. (II) Advanced Applications with Data Cubes ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over uncertain and imprecise data. VLDB’05 X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over Sampling Data”, SIGMOD’08 C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR measures for multidimensional text database analysis. ICDM’08 D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP operations in spatial data warehouses. SSTD’01 N. Stefanovic, J. Han, and K. Koperski. Object-based selective materialization for efficient implementation of spatial data cubes. IEEE Trans. Knowledge and Data Engineering, 12:938– 958, 2000. T. Wu, D. Xin, Q. Mei, and J. Han. Promotion analysis in multidimensional space. VLDB’09 T. Wu, D. Xin, and J. Han. ARCube: Supporting ranking aggregate queries in partially materialized data cubes. SIGMOD’08 D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional selections: The ranking cube approach. VLDB’06 J. S. Vitter, M. Wang, and B. R. Iyer. Data cube approximation and histograms via wavelets. CIKM’98 D. Zhang, C. Zhai, and J. Han. Topic cube: Topic modeling for OLAP on multi-dimensional text databases. SDM’09 314 Ref. (III) Knowledge Discovery with Data Cubes ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97 B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. VLDB’05 B.-C. Chen, R. Ramakrishnan, J.W. Shavlik, and P. Tamma. Bellwether analysis: Predicting global aggregates from local regions. VLDB’06 Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional Regression Analysis of Time-Series Data Streams, VLDB'02 G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data Cubes. VLDB’ 01 R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. Multi-structural databases. PODS’05 J. Han. Towards on-line analytical mining in large databases. SIGMOD Record, 27:97–107, 1998 T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Data Mining & Knowledge Discovery, 6:219–258, 2002. R. Ramakrishnan and B.-C. Chen. Exploratory mining in cube space. Data Mining and Knowledge Discovery, 15:29–54, 2007. K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. EDBT'98 S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98 G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. VLDB'01 315 Surplus Slides 316 Chapter 5: Data Cube Technology ◼ ◼ ◼ ◼ Efficient Methods for Data Cube Computation ◼ Preliminary Concepts and General Strategies for Cube Computation ◼ Multiway Array Aggregation for Full Cube Computation ◼ BUC: Computing Iceberg Cubes from the Apex Cuboid Downward ◼ H-Cubing: Exploring an H-Tree Structure ◼ Star-cubing: Computing Iceberg Cubes Using a Dynamic Star-tree Structure ◼ Precomputing Shell Fragments for Fast High-Dimensional OLAP Data Cubes for Advanced Applications ◼ Sampling Cubes: OLAP on Sampling Data ◼ Ranking Cubes: Efficient Computation of Ranking Queries Knowledge Discovery with Data Cubes ◼ Discovery-Driven Exploration of Data Cubes ◼ Complex Aggregation at Multiple Granularity: Multi-feature Cubes ◼ Prediction Cubes: Data Mining in Multi-Dimensional Cube Space Summary 317 H-Cubing: Using H-Tree Structure all ◼ ◼ ◼ ◼ Bottom-up computation Exploring an H-tree structure If the current computation of an H-tree cannot pass min_sup, do not proceed further (pruning) A AB ABC AC ABD B AD ACD C BC D BD CD BCD ABCD No simultaneous aggregation 318 H-tree: A Prefix Hyper-tree Header table Attr. Val. Edu Hhd Bus … Jan Feb … Tor Van Mon … Quant-Info Sum:2285 … … … … … … … … … … … Side-link root bus hhd edu Jan Mar Tor Van Tor Mon Quant-Info Q.I. Q.I. Q.I. Month City Cust_grp Prod Cost Price Jan Tor Edu Printer 500 485 Jan Tor Hhd TV 800 1200 Jan Tor Edu Camera 1160 1280 Feb Mon Bus Laptop 1500 2500 Sum: 1765 Cnt: 2 Mar Van Edu HD 540 520 bins … … … … … … Jan Feb 319 Computing Cells Involving “City” Header Table HTor Attr. Val. Edu Hhd Bus … Jan Feb … Tor Van Mon … Attr. Val. Edu Hhd Bus … Jan Feb … Quant-Info Sum:2285 … … … … … … … … … … … Q.I. … … … … … … … Side-link From (*, *, Tor) to (*, Jan, Tor) root Hhd. Edu. Jan. Side-link Tor. Quant-Info Mar. Jan. Bus. Feb. Van. Tor. Mon. Q.I. Q.I. Q.I. Sum: 1765 Cnt: 2 bins 320 Computing Cells Involving Month But No City 1. Roll up quant-info 2. Compute cells involving month but no city Attr. Val. Edu. Hhd. Bus. … Jan. Feb. Mar. … Tor. Van. Mont. … Quant-Info Sum:2285 … … … … … … … … … … … … Side-link root Jan. Q.I. Tor. Hhd. Edu. Mar. Jan. Q.I. Q.I. Van. Tor. Bus. Feb. Q.I. Mont. Top-k OK mark: if Q.I. in a child passes top-k avg threshold, so does its parents. No binning is needed! 321 Computing Cells Involving Only Cust_grp root Check header table directly Attr. Val. Edu Hhd Bus … Jan Feb Mar … Tor Van Mon … Quant-Info Sum:2285 … … … … … … … … … … … … hhd edu Side-link Tor bus Jan Mar Jan Feb Q.I. Q.I. Q.I. Q.I. Van Tor Mon 322 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 323 Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods ◼ Basic Concepts ◼ Frequent Itemset Mining Methods ◼ Which Patterns Are Interesting?—Pattern Evaluation Methods ◼ Summary 324 What Is Frequent Pattern Analysis? ◼ Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set ◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining ◼ ◼ Motivation: Finding inherent regularities in data ◼ What products were often purchased together?— Beer and diapers?! ◼ What are the subsequent purchases after buying a PC? ◼ What kinds of DNA are sensitive to this new drug? ◼ Can we automatically classify web documents? Applications ◼ Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 325 Why Is Freq. Pattern Mining Important? ◼ ◼ Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks ◼ Association, correlation, and causality analysis ◼ Sequential, structural (e.g., sub-graph) patterns ◼ Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data ◼ Classification: discriminative, frequent pattern analysis ◼ Cluster analysis: frequent pattern-based clustering ◼ Data warehousing: iceberg cube and cube-gradient ◼ Semantic data compression: fascicles ◼ Broad applications 326 Basic Concepts: Frequent Patterns Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper ◼ ◼ ◼ ◼ ◼ Customer buys beer itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold 327 Basic Concepts: Association Rules Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 50 Nuts, Eggs, Milk ◼ Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys beer Customer buys diaper Find all the rules X → Y with minimum support and confidence ◼ support, s, probability that a transaction contains X Y ◼ confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 ◼ Association rules: (many more!) ◼ Beer → Diaper (60%, 100%) ◼ Diaper → Beer (60%, 75%) 328 Closed Patterns and MaxPatterns ◼ ◼ ◼ ◼ ◼ A long pattern contains a combinatorial number of subpatterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y כX, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כX (proposed by Bayardo @ SIGMOD’98) Closed pattern is a lossless compression of freq. patterns ◼ Reducing the # of patterns and rules 329 Closed Patterns and MaxPatterns ◼ Exercise. DB = {<a1, …, a100>, < a1, …, a50>} ◼ ◼ ◼ What is the set of closed itemset? ◼ <a1, …, a100>: 1 ◼ < a1, …, a50>: 2 What is the set of max-pattern? ◼ ◼ Min_sup = 1. <a1, …, a100>: 1 What is the set of all patterns? ◼ !! 330 Computational Complexity of Frequent Itemset Mining ◼ How many itemsets are potentially to be generated in the worst case? ◼ ◼ ◼ ◼ The number of frequent itemsets to be generated is senstive to the minsup threshold When minsup is low, there exist potentially an exponential number of frequent itemsets The worst case: MN where M: # distinct items, and N: max length of transactions The worst case complexty vs. the expected probability ◼ Ex. Suppose Walmart has 104 kinds of products ◼ The chance to pick up one product 10-4 ◼ The chance to pick up a particular set of 10 products: ~10-40 ◼ What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? 331 Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods ◼ Basic Concepts ◼ Frequent Itemset Mining Methods ◼ Which Patterns Are Interesting?—Pattern Evaluation Methods ◼ Summary 332 Scalable Frequent Itemset Mining Methods ◼ Apriori: A Candidate Generation-and-Test Approach ◼ Improving the Efficiency of Apriori ◼ FPGrowth: A Frequent Pattern-Growth Approach ◼ ECLAT: Frequent Pattern Mining with Vertical Data Format 333 The Downward Closure Property and Scalable Mining Methods ◼ ◼ The downward closure property of frequent patterns ◼ Any subset of a frequent itemset must be frequent ◼ If {beer, diaper, nuts} is frequent, so is {beer, diaper} ◼ i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches ◼ Apriori (Agrawal & Srikant@VLDB’94) ◼ Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) ◼ Vertical data format approach (Charm—Zaki & Hsiao @SDM’02) 334 Apriori: A Candidate Generation & Test Approach ◼ ◼ Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: ◼ ◼ ◼ ◼ Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 335 The Apriori Algorithm—An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Supmin = 2 Itemset {A, C} {B, C} {B, E} {C, E} sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 C1 1st scan C2 L2 Itemset sup 2 2 3 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 1 2 3 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 L1 C2 2nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C3 Itemset {B, C, E} 3rd scan L3 Itemset sup {B, C, E} 2 336 The Apriori Algorithm (PseudoCode) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 337 Implementation of Apriori ◼ ◼ How to generate candidates? ◼ Step 1: self-joining Lk ◼ Step 2: pruning Example of Candidate-generation ◼ ◼ L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 ◼ ◼ ◼ Pruning: ◼ ◼ abcd from abc and abd acde from acd and ace acde is removed because ade is not in L3 C4 = {abcd} 338 How to Count Supports of Candidates? ◼ Why counting supports of candidates a problem? ◼ ◼ ◼ The total number of candidates can be very huge One transaction may contain many candidates Method: ◼ Candidate itemsets are stored in a hash-tree ◼ Leaf node of hash-tree contains a list of itemsets and counts ◼ ◼ Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 339 Counting Supports of Candidates Using Hash Tree Subset function 3,6,9 1,4,7 Transaction: 1 2 3 5 6 2,5,8 1+2356 234 567 13+56 145 136 12+356 124 457 125 458 345 356 357 689 367 368 159 340 Candidate Generation: An SQL Implementation ◼ SQL Implementation of candidate generation ◼ Suppose the items in Lk-1 are listed in an order ◼ Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD’98] ◼ ◼ 341 Scalable Frequent Itemset Mining Methods ◼ Apriori: A Candidate Generation-and-Test Approach ◼ Improving the Efficiency of Apriori ◼ FPGrowth: A Frequent Pattern-Growth Approach ◼ ECLAT: Frequent Pattern Mining with Vertical Data Format ◼ Mining Close Frequent Patterns and Maxpatterns 342 Further Improvement of the Apriori Method ◼ ◼ Major computational challenges ◼ Multiple scans of transaction database ◼ Huge number of candidates ◼ Tedious workload of support counting for candidates Improving Apriori: general ideas ◼ Reduce passes of transaction database scans ◼ Shrink number of candidates ◼ Facilitate support counting of candidates 343 Partition: Scan Database Only Twice ◼ ◼ Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB ◼ Scan 1: partition database and find local frequent patterns ◼ Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski and S. Navathe, VLDB’95 DB1 sup1(i) < σDB1 + DB2 sup2(i) < σDB2 + + DBk supk(i) < σDBk = DB sup(i) < σDB DHP: Reduce the Number of Candidates A k-itemset whose corresponding hashing bucket count is below the ◼ Candidates: a, b, c, d, e ◼ Hash entries ◼ {ab, ad, ae} ◼ {bd, be, de} ◼ … count itemsets 35 88 {ab, ad, ae} . . . threshold cannot be frequent 102 {bd, be, de} . . . ◼ {yz, qs, wt} Hash Table ◼ Frequent 1-itemset: a, b, d, e ◼ ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold ◼ J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95 345 Sampling for Frequent Patterns ◼ Select a sample of original database, mine frequent patterns within sample using Apriori ◼ Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked ◼ Example: check abcd instead of ab, ac, …, etc. ◼ Scan database again to find missed frequent patterns ◼ H. Toivonen. Sampling large databases for association rules. In VLDB’96 346 DIC: Reduce Number of Scans ABCD ◼ ABC ABD ACD BCD AB AC BC AD BD ◼ Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins CD Transactions A B C D Apriori {} Itemset lattice S. Brin R. Motwani, J. Ullman, DIC and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD’97 1-itemsets 2-itemsets … 1-itemsets 2-items 3-items 347 Scalable Frequent Itemset Mining Methods ◼ Apriori: A Candidate Generation-and-Test Approach ◼ Improving the Efficiency of Apriori ◼ FPGrowth: A Frequent Pattern-Growth Approach ◼ ECLAT: Frequent Pattern Mining with Vertical Data Format ◼ Mining Close Frequent Patterns and Maxpatterns 348 Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation ◼ Bottlenecks of the Apriori approach ◼ Breadth-first (i.e., level-wise) search ◼ Candidate generation and test ◼ ◼ ◼ Often generates a huge number of candidates The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00) ◼ Depth-first search ◼ Avoid explicit candidate generation Major philosophy: Grow long patterns from short ones using local frequent items only ◼ “abc” is a frequent pattern ◼ Get all transactions having “abc”, i.e., project DB on abc: DB|abc ◼ “d” is a local frequent item in DB|abc → abcd is a frequent pattern 349 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3 {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} {} Header Table 1. Scan DB once, find f:4 c:1 Item frequency head frequent 1-itemset (single f 4 item pattern) c 4 c:3 b:1 b:1 2. Sort frequent items in a 3 b 3 frequency descending a:3 p:1 m 3 order, f-list p 3 m:2 b:1 3. Scan DB again, construct FP-tree p:2 m:1 F-list = f-c-a-b-m-p 350 Partition Patterns and Databases ◼ ◼ Frequent patterns can be partitioned into subsets according to f-list ◼ F-list = f-c-a-b-m-p ◼ Patterns containing p ◼ Patterns having m but no p ◼ … ◼ Patterns having c but no a nor b, m, p ◼ Pattern f Completeness and non-redundency 351 Find Patterns Having P From P-conditional Database ◼ ◼ ◼ Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 c:1 b:1 a:3 Conditional pattern bases item cond. pattern base b:1 c f:3 p:1 a fc:3 b fca:1, f:1, c:1 m:2 b:1 m fca:2, fcab:1 p:2 m:1 p fcam:2, cb:1 352 From Conditional Pattern-bases to Conditional FP-trees ◼ For each pattern-base ◼ Accumulate the count for each item in the base ◼ Construct the FP-tree for the frequent items of the pattern base Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 m-conditional pattern base: fca:2, fcab:1 All frequent patterns relate to m {} m, f:3 fm, cm, am, fcm, fam, cam, c:3 fcam a:3 m-conditional FP-tree 353 Recursion: Mining Each Conditional FPtree {} {} Cond. pattern base of “am”: (fc:3) c:3 f:3 c:3 a:3 f:3 am-conditional FP-tree Cond. pattern base of “cm”: (f:3) {} f:3 m-conditional FP-tree cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree 354 A Special Case: Single Prefix Path in FPtree ◼ ◼ {} a1:n1 a2:n2 Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts ◼ ◼ Reduction of the single prefix path into one node Concatenation of the mining results of the two parts a3:n3 b1:m1 C2:k2 r1 {} C1:k1 C3:k3 r1 = a1:n1 a2:n2 a3:n3 + b1:m1 C2:k2 C1:k1 C3:k3 355 Benefits of the FP-tree Structure ◼ Completeness ◼ ◼ ◼ Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness ◼ ◼ ◼ Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) 356 The Frequent Pattern Growth Mining Method ◼ ◼ Idea: Frequent pattern growth ◼ Recursively grow frequent patterns by pattern and database partition Method ◼ For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree ◼ Repeat the process on each newly created conditional FP-tree ◼ Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern 357 Scaling FP-growth by Database Projection ◼ What about if FP-tree cannot fit in memory? ◼ DB projection ◼ First partition a database into a set of projected DBs ◼ Then construct and mine FP-tree for each projected DB ◼ Parallel projection vs. partition projection techniques ◼ ◼ Parallel projection ◼ Project the DB in parallel for each frequent item ◼ Parallel projection is space costly ◼ All the partitions can be processed in parallel Partition projection ◼ Partition the DB based on the ordered frequent items ◼ Passing the unprocessed parts to the subsequent partitions 358 Partition-Based Projection ◼ ◼ Parallel projection needs a lot of disk space Partition projection saves it p-proj DB fcam cb fcam m-proj DB fcab fca fca am-proj DB fc fc fc Tran. DB fcamp fcabm fb cbp fcamp b-proj DB f cb … a-proj DB fc … cm-proj DB f f f c-proj DB f … f-proj DB … … 359 Performance of FPGrowth in Large Datasets 100 140 120 D1 Apriori runtime 80 Runtime (sec.) 70 Run time(sec.) D2 FP-growth D1 FP-grow th runtime 90 60 Data set T25I20D10K 50 40 30 20 D2 TreeProjection 100 80 Data set T25I20D100K 60 40 20 10 0 0 0 0.5 1 1.5 2 Support threshold(%) 2.5 FP-Growth vs. Apriori 3 0 0.5 1 1.5 2 Support threshold (%) FP-Growth vs. Tree-Projection 360 Advantages of the Pattern Growth Approach ◼ Divide-and-conquer: ◼ ◼ ◼ Lead to focused search of smaller databases Other factors ◼ No candidate generation, no candidate test ◼ Compressed database: FP-tree structure ◼ No repeated scan of entire database ◼ ◼ Decompose both the mining task and DB according to the frequent patterns obtained so far Basic ops: counting local freq items and building sub FP-tree, no pattern search and matching A good open-source implementation and refinement of FPGrowth ◼ FPGrowth+ (Grahne and J. Zhu, FIMI'03) 361 Further Improvements of Mining Methods ◼ AFOPT (Liu, et al. @ KDD’03) ◼ A “push-right” method for mining condensed frequent pattern (CFP) tree ◼ ◼ Carpenter (Pan, et al. @ KDD’03) ◼ Mine data sets with small rows but numerous columns ◼ Construct a row-enumeration tree for efficient mining FPgrowth+ (Grahne and Zhu, FIMI’03) ◼ Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03), Melbourne, FL, Nov. 2003 ◼ TD-Close (Liu, et al, SDM’06) 362 Extension of Pattern Growth Mining Methodology ◼ ◼ ◼ ◼ ◼ ◼ ◼ Mining closed frequent itemsets and max-patterns ◼ CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03) Mining sequential patterns ◼ PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04) Mining graph patterns ◼ gSpan (ICDM’02), CloseGraph (KDD’03) Constraint-based mining of frequent patterns ◼ Convertible constraints (ICDE’01), gPrune (PAKDD’03) Computing iceberg data cubes with complex measures ◼ H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03) Pattern-growth-based Clustering ◼ MaPle (Pei, et al., ICDM’03) Pattern-Growth-Based Classification ◼ Mining frequent and discriminative patterns (Cheng, et al, ICDE’07) 363 Scalable Frequent Itemset Mining Methods ◼ Apriori: A Candidate Generation-and-Test Approach ◼ Improving the Efficiency of Apriori ◼ FPGrowth: A Frequent Pattern-Growth Approach ◼ ECLAT: Frequent Pattern Mining with Vertical Data Format ◼ Mining Close Frequent Patterns and Maxpatterns 364 ECLAT: Mining by Exploring Vertical Data Format ◼ Vertical format: t(AB) = {T11, T25, …} ◼ ◼ ◼ ◼ ◼ tid-list: list of trans.-ids containing an itemset Deriving frequent patterns based on vertical intersections ◼ t(X) = t(Y): X and Y always happen together ◼ t(X) t(Y): transaction having X always has Y Using diffset to accelerate mining ◼ Only keep track of differences of tids ◼ t(X) = {T1, T2, T3}, t(XY) = {T1, T3} ◼ Diffset (XY, X) = {T2} Eclat (Zaki et al. @KDD’97) Mining Closed patterns using vertical format: CHARM (Zaki & Hsiao@SDM’02) 365 Scalable Frequent Itemset Mining Methods ◼ Apriori: A Candidate Generation-and-Test Approach ◼ Improving the Efficiency of Apriori ◼ FPGrowth: A Frequent Pattern-Growth Approach ◼ ECLAT: Frequent Pattern Mining with Vertical Data Format ◼ Mining Close Frequent Patterns and Maxpatterns 366 Mining Frequent Closed Patterns: CLOSET ◼ Flist: list of all frequent items in support ascending order ◼ ◼ ◼ Divide search space ◼ Patterns having d ◼ Patterns having d but no a, etc. Find frequent closed pattern recursively ◼ ◼ Flist: d-a-f-e-c Min_sup=2 TID 10 20 30 40 50 Items a, c, d, e, f a, b, e c, e, f a, c, d, f c, e, f Every transaction having d also has cfa → cfad is a frequent closed pattern J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. CLOSET+: Mining Closed Itemsets by PatternGrowth ◼ ◼ ◼ ◼ ◼ Itemset merging: if Y appears in every occurrence of X, then Y is merged with X Sub-itemset pruning: if Y כX, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned Hybrid tree projection ◼ Bottom-up physical tree-projection ◼ Top-down pseudo tree-projection Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels Efficient subset checking MaxMiner: Mining Max-Patterns ◼ 1st scan: find frequent items ◼ ◼ ◼ ◼ A, B, C, D, E 2nd scan: find support for ◼ AB, AC, AD, AE, ABCDE ◼ BC, BD, BE, BCDE ◼ CD, CE, CDE, DE Tid Items 10 A, B, C, D, E 20 B, C, D, E, 30 A, C, D, F Potential max-patterns Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan R. Bayardo. Efficiently mining long patterns from databases. SIGMOD’98 CHARM: Mining by Exploring Vertical Data Format ◼ Vertical format: t(AB) = {T11, T25, …} ◼ ◼ ◼ ◼ tid-list: list of trans.-ids containing an itemset Deriving closed patterns based on vertical intersections ◼ t(X) = t(Y): X and Y always happen together ◼ t(X) t(Y): transaction having X always has Y Using diffset to accelerate mining ◼ Only keep track of differences of tids ◼ t(X) = {T1, T2, T3}, t(XY) = {T1, T3} ◼ Diffset (XY, X) = {T2} Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02) Visualization of Association Rules: Plane Graph 371 Visualization of Association Rules: Rule Graph 372 Visualization of Association Rules (SGI/MineSet 3.0) 373 Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods ◼ Basic Concepts ◼ Frequent Itemset Mining Methods ◼ Which Patterns Are Interesting?—Pattern Evaluation Methods ◼ Summary 374 Interestingness Measure: Correlations (Lift) ◼ play basketball eat cereal [40%, 66.7%] is misleading ◼ ◼ The overall % of students eating cereal is 75% > 66.7%. play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence ◼ Measure of dependent/correlated events: lift P ( A B ) lift = P ( A) P ( B ) lift ( B, C ) = 2000 / 5000 = 0.89 3000 / 5000 * 3750 / 5000 lift ( B, C ) = Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 1000 / 5000 = 1.33 3000 / 5000 *1250 / 5000 375 Are lift and 2 Good Measures of Correlation? ◼ “Buy walnuts buy milk [1%, 80%]” is misleading if 85% of customers buy milk ◼ Support and confidence are not good to indicate correlations ◼ Over 20 interestingness measures have been proposed (see Tan, Kumar, Sritastava @KDD’02) ◼ Which are good ones? 376 Null-Invariant Measures 377 Comparison of Interestingness Measures ◼ ◼ ◼ Null-(transaction) invariance is crucial for correlation analysis Lift and 2 are not null-invariant 5 null-invariant measures Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m Null-transactions w.r.t. m and c June 7, 2020 Kulczynski measure (1927) Data Mining: Concepts and Techniques Null-invariant Subtle: They disagree378 Analysis of DBLP Coauthor Relationships Recent DB conferences, removing balanced associations, low sup, etc. Advisor-advisee relation: Kulc: high, coherence: low, cosine: middle ◼ Tianyi Wu, Yuguo Chen and Jiawei Han, “Association Mining in Large Databases: A Re-Examination of Its Measures”, Proc. 2007 Int. Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD'07), Sept. 2007 379 Which Null-Invariant Measure Is Better? ◼ ◼ IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in rule implications Kulczynski and Imbalance Ratio (IR) together present a clear picture for all the three datasets D4 through D6 ◼ D4 is balanced & neutral ◼ D5 is imbalanced & neutral ◼ D6 is very imbalanced & neutral Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods ◼ Basic Concepts ◼ Frequent Itemset Mining Methods ◼ Which Patterns Are Interesting?—Pattern Evaluation Methods ◼ Summary 381 Summary ◼ ◼ Basic concepts: association rules, supportconfident framework, closed and max-patterns Scalable frequent pattern mining methods ◼ Apriori (Candidate generation & test) ◼ Projection-based (FPgrowth, CLOSET+, ...) ◼ Vertical format approach (ECLAT, CHARM, ...) ▪ Which patterns are interesting? ▪ Pattern evaluation methods 382 Ref: Basic Concepts of Frequent Pattern Mining ◼ (Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93 ◼ (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98 ◼ (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99 ◼ (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95 383 Ref: Apriori and Its Improvements ◼ R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 ◼ H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94 ◼ A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95 ◼ J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95 ◼ H. Toivonen. Sampling large databases for association rules. VLDB'96 ◼ S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97 ◼ S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98 384 Ref: Depth-First, Projection-Based FP Mining ◼ R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. J. Parallel and Distributed Computing, 2002. ◼ G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. FIMI'03 ◼ B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining implementations. Proc. ICDM’03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI’03), Melbourne, FL, Nov. 2003 ◼ J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD’ 00 ◼ J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection. KDD'02 ◼ J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. ICDM'02 ◼ J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. KDD'03 385 Ref: Vertical Format and Row Enumeration Methods ◼ M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. DAMI:97. ◼ M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02. ◼ C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. KDD’02. ◼ F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long Biological Datasets. KDD'03. ◼ H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach, SDM'06. 386 Ref: Mining Correlations and Interesting Rules ◼ S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97. ◼ M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94. ◼ R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest. Kluwer Academic, 2001. ◼ C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98. ◼ P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02. ◼ E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03. ◼ T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern Mining: A Unified Framework", Data Mining and Knowledge Discovery, 21(3):371397, 2010 387 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 7 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2010 Han, Kamber & Pei. All rights reserved. 388 June 7, 2020 Data Mining: Concepts and Techniques 389 Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space ◼ Constraint-Based Frequent Pattern Mining ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 390 Research on Pattern Mining: A Road Map 391 Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space Mining Multi-Level Association ◼ Mining Multi-Dimensional Association ◼ Mining Quantitative Association Rules ◼ Mining Rare Patterns and Negative Patterns ◼ Constraint-Based Frequent Pattern Mining ◼ ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 392 Mining Multiple-Level Association Rules ◼ ◼ ◼ Items often form hierarchies Flexible support settings ◼ Items at the lower level are expected to have lower support Exploration of shared multi-level mining (Agrawal & Srikant@VLB’95, Han & Fu@VLDB’95) uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% reduced support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% 393 Multi-level Association: Flexible Support and Redundancy filtering ◼ Flexible min-support thresholds: Some items are more valuable but less frequent ◼ ◼ Use non-uniform, group-based min-support ◼ E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; … Redundancy Filtering: Some rules may be redundant due to “ancestor” relationships between items ◼ milk wheat bread [support = 8%, confidence = 70%] ◼ 2% milk wheat bread [support = 2%, confidence = 72%] The first rule is an ancestor of the second rule ◼ A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor 394 Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space Mining Multi-Level Association ◼ Mining Multi-Dimensional Association ◼ Mining Quantitative Association Rules ◼ Mining Rare Patterns and Negative Patterns ◼ Constraint-Based Frequent Pattern Mining ◼ ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 395 Mining Multi-Dimensional Association ◼ Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) ◼ Multi-dimensional rules: 2 dimensions or predicates ◼ Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”) occupation(X,“student”) buys(X, “coke”) ◼ hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) ◼ ◼ Categorical Attributes: finite number of possible values, no ordering among values—data cube approach Quantitative Attributes: Numeric, implicit ordering among values—discretization, clustering, and gradient approaches 396 Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space Mining Multi-Level Association ◼ Mining Multi-Dimensional Association ◼ Mining Quantitative Association Rules ◼ Mining Rare Patterns and Negative Patterns ◼ Constraint-Based Frequent Pattern Mining ◼ ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 397 Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated 1. Static discretization based on predefined concept hierarchies (data cube methods) 2. Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) 3. Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97) ◼ One dimensional clustering then association 4. Deviation: (such as Aumann and Lindell@KDD99) Sex = female => Wage: mean=$7/hr (overall mean = $9) 398 Static Discretization of Quantitative Attributes ◼ Discretized prior to mining using concept hierarchy. ◼ Numeric values are replaced by ranges ◼ In relational database, finding all frequent k-predicate sets will require k or k+1 table scans ◼ Data cube is well suited for mining ◼ The cells of an n-dimensional cuboid correspond to the (age) () (income) (buys) predicate sets ◼ Mining from data cubes can be much faster (age, income) (age,buys) (income,buys) (age,income,buys) 399 Quantitative Association Rules Based on Statistical Inference Theory [Aumann and Lindell@DMKD’03] ◼ Finding extraordinary and therefore interesting phenomena, e.g., (Sex = female) => Wage: mean=$7/hr (overall mean = $9) ◼ ◼ ◼ LHS: a subset of the population ◼ RHS: an extraordinary behavior of this subset The rule is accepted only if a statistical test (e.g., Z-test) confirms the inference with high confidence Subrule: highlights the extraordinary behavior of a subset of the pop. of the super rule ◼ ◼ ◼ E.g., (Sex = female) ^ (South = yes) => mean wage = $6.3/hr Two forms of rules ◼ Categorical => quantitative rules, or Quantitative => quantitative rules ◼ E.g., Education in [14-18] (yrs) => mean wage = $11.64/hr Open problem: Efficient methods for LHS containing two or more quantitative attributes 400 Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space Mining Multi-Level Association ◼ Mining Multi-Dimensional Association ◼ Mining Quantitative Association Rules ◼ Mining Rare Patterns and Negative Patterns ◼ Constraint-Based Frequent Pattern Mining ◼ ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 401 Negative and Rare Patterns ◼ Rare patterns: Very low support but interesting ◼ ◼ ◼ Mining: Setting individual-based or special group-based support threshold for valuable items Negative patterns ◼ ◼ E.g., buying Rolex watches Since it is unlikely that one buys Ford Expedition (an SUV car) and Toyota Prius (a hybrid car) together, Ford Expedition and Toyota Prius are likely negatively correlated patterns Negatively correlated patterns that are infrequent tend to be more interesting than those that are frequent 402 Defining Negative Correlated Patterns (I) ◼ Definition 1 (support-based) ◼ If itemsets X and Y are both frequent but rarely occur together, i.e., sup(X U Y) < sup (X) * sup(Y) ◼ ◼ Then X and Y are negatively correlated Problem: A store sold two needle 100 packages A and B, only one transaction containing both A and B. ◼ When there are in total 200 transactions, we have s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B) ◼ When there are 105 transactions, we have s(A U B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A U B) > s(A) * s(B) ◼ Where is the problem? —Null transactions, i.e., the support-based definition is not null-invariant! 403 Defining Negative Correlated Patterns (II) ◼ Definition 2 (negative itemset-based) ◼ ◼ ◼ ◼ ◼ X is a negative itemset if (1) X = Ā U B, where B is a set of positive items, and Ā is a set of negative items, |Ā|≥ 1, and (2) s(X) ≥ μ Itemsets X is negatively correlated, if This definition suffers a similar null-invariant problem Definition 3 (Kulzynski measure-based) If itemsets X and Y are frequent, but (P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern threshold, then X and Y are negatively correlated. Ex. For the same needle package problem, when no matter there are 200 or 105 transactions, if є = 0.01, we have (P(A|B) + P(B|A))/2 = (0.01 + 0.01)/2 < є 404 Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space ◼ Constraint-Based Frequent Pattern Mining ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 405 Constraint-based (Query-Directed) Mining ◼ Finding all the patterns in a database autonomously? — unrealistic! ◼ ◼ Data mining should be an interactive process ◼ ◼ The patterns could be too many but not focused! User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining ◼ ◼ ◼ User flexibility: provides constraints on what to be mined Optimization: explores such constraints for efficient mining — constraint-based mining: constraint-pushing, similar to push selection first in DB query processing Note: still find all the answers satisfying constraints, not finding some answers in “heuristic search” 406 Constraints in Data Mining ◼ ◼ ◼ ◼ ◼ Knowledge type constraint: ◼ classification, association, etc. Data constraint — using SQL-like queries ◼ find product pairs sold together in stores in Chicago this year Dimension/level constraint ◼ in relevance to region, price, brand, customer category Rule (or pattern) constraint ◼ small sales (price < $10) triggers big sales (sum > $200) Interestingness constraint ◼ strong rules: min_support 3%, min_confidence 60% 407 Meta-Rule Guided Mining ◼ Meta-rule can be in the rule form with partially instantiated predicates and constants P1(X, Y) ^ P2(X, W) => buys(X, “iPad”) ◼ The resulting rule derived can be age(X, “15-25”) ^ profession(X, “student”) => buys(X, “iPad”) ◼ In general, it can be in the form of P1 ^ P2 ^ … ^ Pl => Q1 ^ Q2 ^ … ^ Qr ◼ Method to find meta-rules ◼ ◼ ◼ Find frequent (l+r) predicates (based on min-support threshold) Push constants deeply when possible into the mining process (see the remaining discussions on constraint-push techniques) Use confidence, correlation, and other measures when possible 408 Constraint-Based Frequent Pattern Mining ◼ Pattern space pruning constraints ◼ ◼ ◼ ◼ ◼ Anti-monotonic: If constraint c is violated, its further mining can be terminated Monotonic: If c is satisfied, no need to check c again Succinct: c must be satisfied, so one can start with the data sets satisfying c Convertible: c is not monotonic nor anti-monotonic, but it can be converted into it if items in the transaction can be properly ordered Data space pruning constraint ◼ ◼ Data succinct: Data space can be pruned at the initial pattern mining process Data anti-monotonic: If a transaction t does not satisfy c, t can be pruned from its further mining 409 Pattern Space Pruning with Anti-Monotonicity Constraints ◼ ◼ A constraint C is anti-monotone if the super pattern satisfies C, all of its sub-patterns do so too In other words, anti-monotonicity: If an itemset S violates the constraint, so does any of its superset ◼ Ex. 1. sum(S.price) v is anti-monotone ◼ Ex. 2. range(S.profit) 15 is anti-monotone ◼ ◼ ◼ Itemset ab violates C ◼ So does every superset of ab Ex. 3. sum(S.Price) v is not anti-monotone Ex. 4. support count is anti-monotone: core property used in Apriori TDB (min_sup=2) TID Transaction 10 a, b, c, d, f 20 30 40 b, c, d, f, g, h a, c, d, e, f c, e, f, g Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 410 Pattern Space Pruning with Monotonicity Constraints TDB (min_sup=2) ◼ ◼ A constraint C is monotone if the pattern satisfies C, we do not need to check C in subsequent mining Alternatively, monotonicity: If an itemset S satisfies the constraint, so does any of its superset TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit ◼ Ex. 1. sum(S.Price) v is monotone a 40 ◼ Ex. 2. min(S.Price) v is monotone b 0 ◼ Ex. 3. C: range(S.profit) 15 c -20 d 10 e -30 f 30 g 20 h -10 ◼ Itemset ab satisfies C ◼ So does every superset of ab 411 Data Space Pruning with Data Antimonotonicity TDB (min_sup=2) ◼ ◼ A constraint c is data anti-monotone if for a pattern p cannot satisfy a transaction t under c, p’s superset cannot satisfy t under c either TID Transaction 10 a, b, c, d, f, h 20 b, c, d, f, g, h The key for data anti-monotone is recursive data 30 b, c, d, f, g reduction 40 c, e, f, g ◼ Ex. 1. sum(S.Price) v is data anti-monotone Item Profit ◼ Ex. 2. min(S.Price) v is data anti-monotone a 40 Ex. 3. C: range(S.profit) 25 is data antimonotone b 0 c -20 d -15 e -30 f -10 g 20 h -5 ◼ ◼ Itemset {b, c}’s projected DB: ◼ ◼ T10’: {d, f, h}, T20’: {d, f, g, h}, T30’: {d, f, g} since C cannot satisfy T10’, T10’ can be pruned 412 Pattern Space Pruning with Succinctness ◼ Succinctness: ◼ ◼ ◼ Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items ◼ min(S.Price) v is succinct ◼ sum(S.Price) v is not succinct Optimization: If C is succinct, C is pre-counting pushable 413 Naïve Algorithm: Apriori + Constraint Database D TID 100 200 300 400 itemset sup. C1 {1} 2 {2} 3 Scan D {3} 3 {4} 1 {5} 3 Items 134 235 1235 25 C2 itemset sup L2 itemset sup 2 2 3 2 {1 {1 {1 {2 {2 {3 C3 itemset {2 3 5} Scan D {1 3} {2 3} {2 5} {3 5} 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Constraint: Sum{S.price} < 5 414 Constrained Apriori : Push a Succinct Constraint Deep Database D TID 100 200 300 400 itemset sup. C1 {1} 2 {2} 3 Scan D {3} 3 {4} 1 {5} 3 Items 134 235 1235 25 C2 itemset sup L2 itemset sup 2 2 3 2 {1 {1 {1 {2 {2 {3 C3 itemset {2 3 5} Scan D {1 3} {2 3} {2 5} {3 5} 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} not immediately to be used Constraint: min{S.price } <= 1 415 Constrained FP-Growth: Push a Succinct Constraint Deep TID 100 200 300 400 Items 134 235 1235 25 Remove infrequent length 1 TID 100 200 300 400 Items 13 235 1235 25 FP-Tree 1-Projected DB TID Items 100 3 4 300 2 3 5 No Need to project on 2, 3, or 5 Constraint: min{S.price } <= 1 416 Data Anti-monotonic Constraint Deep Remove from data TID 100 200 300 400 Items 134 235 1235 25 TID Items 100 1 3 300 1 3 FP-Tree Single branch, we are done Constraint: min{S.price } <= 1 417 Constrained FP-Growth: Push a Data Anti-monotonic Constraint Deep TID Transaction 10 a, b, c, d, f, h 20 b, c, d, f, g, h 30 b, c, d, f, g 40 a, c, e, f, g B-Projected DB TID Transaction 10 20 30 a, c, d, f, h c, d, f, g, h c, d, f, g Single branch: bcdfg: 2 FP-Tree Recursive Data Pruning B FP-Tree TID Transaction 10 a, b, c, d, f, h 20 b, c, d, f, g, h 30 b, c, d, f, g 40 a, c, e, f, g Item Profit a 40 b 0 c -20 d -15 e -30 f -10 g 20 h -5 Constraint: range{S.price } > 25 min_sup >= 2 418 Convertible Constraints: Ordering Data in Transactions TDB (min_sup=2) ◼ ◼ Convert tough constraints into antimonotone or monotone by properly ordering items Examine C: avg(S.profit) 25 ◼ Order items in value-descending order ◼ ◼ <a, f, g, d, b, h, c, e> If an itemset afb violates C ◼ So does afbh, afb* ◼ It becomes anti-monotone! TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a b c d e f g h 40 0 -20 10 -30 30 20 -10 419 Strongly Convertible Constraints ◼ avg(X) 25 is convertible anti-monotone w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> ◼ If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd ◼ ◼ avg(X) 25 is convertible monotone w.r.t. item value ascending order R-1: <e, c, h, b, d, g, f, a> ◼ If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefix Thus, avg(X) 25 is strongly convertible Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 420 Can Apriori Handle Convertible Constraints? ◼ A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm ◼ ◼ ◼ ◼ Within the level wise framework, no direct pruning based on the constraint can be made Itemset df violates constraint C: avg(X) >= 25 Since adf satisfies C, Apriori needs df to assemble adf, df cannot be pruned But it can be pushed into frequent-pattern growth framework! Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 421 Pattern Space Pruning w. Convertible Constraints ◼ ◼ ◼ ◼ Item Value C: avg(X) >= 25, min_sup=2 a 40 List items in every transaction in value f 30 descending order R: <a, f, g, d, b, h, c, e> g 20 ◼ C is convertible anti-monotone w.r.t. R d 10 Scan TDB once b 0 ◼ remove infrequent items h -10 ◼ Item h is dropped c -20 e -30 ◼ Itemsets a and f are good, … TDB (min_sup=2) Projection-based mining TID Transaction ◼ Imposing an appropriate order on item 10 a, f, d, b, c projection 20 f, g, d, b, c ◼ Many tough constraints can be converted into 30 a, f, d, c, e (anti)-monotone 40 f, g, h, c, e 422 Handling Multiple Constraints ◼ ◼ ◼ Different constraints may require different or even conflicting item-ordering If there exists an order R s.t. both C1 and C2 are convertible w.r.t. R, then there is no conflict between the two convertible constraints If there exists conflict on order of items ◼ ◼ Try to satisfy one constraint first Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database 423 What Constraints Are Convertible? Constraint Convertible antimonotone Convertible monotone Strongly convertible avg(S) , v Yes Yes Yes median(S) , v Yes Yes Yes sum(S) v (items could be of any value, v 0) Yes No No sum(S) v (items could be of any value, v 0) No Yes No sum(S) v (items could be of any value, v 0) No Yes No sum(S) v (items could be of any value, v 0) Yes No No …… 424 Constraint-Based Mining — A General Picture Constraint Anti-monotone Monotone Succinct vS no yes yes SV no yes yes SV yes no yes min(S) v no yes yes min(S) v yes no yes max(S) v yes no yes max(S) v no yes yes count(S) v yes no weakly count(S) v no yes weakly sum(S) v ( a S, a 0 ) yes no no sum(S) v ( a S, a 0 ) no yes no range(S) v yes no no range(S) v no yes no avg(S) v, { =, , } convertible convertible no support(S) yes no no support(S) no yes no 425 Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space ◼ Constraint-Based Frequent Pattern Mining ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 426 Mining Colossal Frequent Patterns ◼ ◼ ◼ F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal Frequent Patterns by Core Pattern Fusion”, ICDE'07. We have many algorithms, but can we mine large (i.e., colossal) patterns? ― such as just size around 50 to 100? Unfortunately, not! Why not? ― the curse of “downward closure” of frequent patterns ◼ The “downward closure” property ◼ ◼ ◼ ◼ Any sub-pattern of a frequent pattern is frequent. Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1, a2), (a1, a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There are about 2100 such frequent itemsets! No matter using breadth-first search (e.g., Apriori) or depth-first search (FPgrowth), we have to examine so many patterns Thus the downward closure property leads to explosion! 427 Colossal Patterns: A Motivating Example Let’s make a set of 40 transactions T1 = 1 2 3 4 ….. 39 40 T2 = 1 2 3 4 ….. 39 40 : . : . : . : . T40=1 2 3 4 ….. 39 40 Then delete the items on the diagonal T1 = 2 3 4 ….. 39 40 T2 = 1 3 4 ….. 39 40 : . : . : . : . T40=1 2 3 4 …… 39 Closed/maximal patterns may partially alleviate the problem but not really solve it: We often need to mine scattered large patterns! Let the minimum support threshold σ= 20 40 There are frequent patterns of 20 size 20 Each is closed and maximal # patterns = n 2n 2 / n n / 2 The size of the answer set is exponential to n 428 Colossal Pattern Set: Small but Interesting ◼ ◼ It is often the case that only a small number of patterns are colossal, i.e., of large size Colossal patterns are usually attached with greater importance than those of small pattern sizes 429 Mining Colossal Patterns: Motivation and Philosophy ◼ ◼ Motivation: Many real-world tasks need mining colossal patterns ◼ Micro-array analysis in bioinformatics (when support is low) ◼ Biological sequence patterns ◼ Biological/sociological/information graph pattern mining No hope for completeness ◼ ◼ Jumping out of the swamp of the mid-sized results ◼ ◼ If the mining of mid-sized patterns is explosive in size, there is no hope to find colossal patterns efficiently by insisting “complete set” mining philosophy What we may develop is a philosophy that may jump out of the swamp of mid-sized results that are explosive in size and jump to reach colossal patterns Striving for mining almost complete colossal patterns ◼ The key is to develop a mechanism that may quickly reach colossal patterns and discover most of them 430 Alas, A Show of Colossal Pattern Mining! T1 = 2 3 4 ….. 39 40 T2 = 1 3 4 ….. 39 40 : . : . : . : . T40=1 2 3 4 …… 39 T41= 41 42 43 ….. 79 T42= 41 42 43 ….. 79 : . : . T60= 41 42 43 … 79 Let the min-support threshold σ= 20 40 20 closed/maximal Then there are frequent patterns of size 20 However, there is only one with size greater than 20, (i.e., colossal): α= {41,42,…,79} of size 39 The existing fastest mining algorithms (e.g., FPClose, LCM) fail to complete running Our algorithm outputs this colossal pattern in seconds 431 Methodology of Pattern-Fusion Strategy ◼ Pattern-Fusion traverses the tree in a bounded-breadth way ◼ Always pushes down a frontier of a bounded-size candidate pool ◼ Only a fixed number of patterns in the current candidate pool will be used as the starting nodes to go down in the pattern tree ― thus avoids the exponential search space ◼ Pattern-Fusion identifies “shortcuts” whenever possible ◼ Pattern growth is not performed by single-item addition but by leaps and bounded: agglomeration of multiple patterns in the pool ◼ These shortcuts will direct the search down the tree much more rapidly towards the colossal patterns 432 Observation: Colossal Patterns and Core Patterns Transaction Database D A colossal pattern α α α1 D Dαk α2 D Dα1 α Dα2 αk Subpatterns α1 to αk cluster tightly around the colossal pattern α by sharing a similar support. We call such subpatterns core patterns of α 433 Robustness of Colossal Patterns ◼ Core Patterns Intuitively, for a frequent pattern α, a subpattern β is a τ-core pattern of α if β shares a similar support set with α, i.e., | D | | D | 0 1 where τ is called the core ratio ◼ Robustness of Colossal Patterns A colossal pattern is robust in the sense that it tends to have much more core patterns than small patterns 434 Example: Core Patterns ◼ ◼ ◼ ◼ A colossal pattern has far more core patterns than a small-sized pattern A colossal pattern has far more core descendants of a smaller size c A random draw from a complete set of pattern of size c would more likely to pick a core descendant of a colossal pattern A colossal pattern can be generated by merging a set of core patterns Transaction (# of Ts) Core Patterns (τ = 0.5) (abe) (100) (abe), (ab), (be), (ae), (e) (bcf) (100) (bcf), (bc), (bf) (acf) (100) (acf), (ac), (af) (abcef) (100) (ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e), (abc), (abf), (abe), (ace), (acf), (afe), (bcf), (bce), (bfe), (cfe), (abcf), (abce), (bcfe), (acfe), (abfe), (abcef) 435 Colossal Patterns Correspond to Dense Balls ◼ Due to their robustness, colossal patterns correspond to dense balls ◼ ◼ Ω( 2^d) in population A random draw in the pattern space will hit somewhere in the ball with high probability 437 Idea of Pattern-Fusion Algorithm ◼ ◼ ◼ ◼ Generate a complete set of frequent patterns up to a small size Randomly pick a pattern β, and β has a high probability to be a core-descendant of some colossal pattern α Identify all α’s descendants in this complete set, and merge all of them ― This would generate a much larger core-descendant of α In the same fashion, we select K patterns. This set of larger core-descendants will be the candidate pool for the next iteration 438 Pattern-Fusion: The Algorithm ◼ ◼ ◼ Initialization (Initial pool): Use an existing algorithm to mine all frequent patterns up to a small size, e.g., 3 Iteration (Iterative Pattern Fusion): ◼ At each iteration, k seed patterns are randomly picked from the current pattern pool ◼ For each seed pattern thus picked, we find all the patterns within a bounding ball centered at the seed pattern ◼ All these patterns found are fused together to generate a set of super-patterns. All the super-patterns thus generated form a new pool for the next iteration Termination: when the current pool contains no more than K patterns at the beginning of an iteration 439 Why Is Pattern-Fusion Efficient? ◼ ◼ A bounded-breadth pattern tree traversal ◼ It avoids explosion in mining mid-sized ones ◼ Randomness comes to help to stay on the right path Ability to identify “short-cuts” and take “leaps” ◼ fuse small patterns together in one step to generate new patterns of significant sizes ◼ Efficiency 440 Pattern-Fusion Leads to Good Approximation ◼ Gearing toward colossal patterns ◼ ◼ The larger the pattern, the greater the chance it will be generated Catching outliers ◼ The more distinct the pattern, the greater the chance it will be generated 441 Experimental Setting ◼ Synthetic data set ◼ ◼ Diagn an n x (n-1) table where ith row has integers from 1 to n except i. Each row is taken as an itemset. min_support is n/2. Real data set ◼ ◼ Replace: A program trace data set collected from the “replace” program, widely used in software engineering research ALL: A popular gene expression data set, a clinical data on ALL-AML leukemia (www.broad.mit.edu/tools/data.html). ◼ ◼ Each item is a column, representing the activitiy level of gene/protein in the same Frequent pattern would reveal important correlation between gene expression patterns and disease outcomes 442 Experiment Results on Diagn ◼ ◼ ◼ LCM run time increases exponentially with pattern size n Pattern-Fusion finishes efficiently The approximation error of Pattern-Fusion (with min-sup 20) in comparison with the complete set) is rather close to uniform sampling (which randomly picks K patterns from the complete answer set) 443 Experimental Results on ALL ◼ ALL: A popular gene expression data set with 38 transactions, each with 866 columns ◼ There are 1736 items in total ◼ The table shows a high frequency threshold of 30 444 Experimental Results on REPLACE ◼ REPLACE ◼ A program trace data set, recording 4395 calls and transitions ◼ The data set contains 4395 transactions with 57 items in total ◼ With support threshold of 0.03, the largest patterns are of size 44 ◼ They are all discovered by Pattern-Fusion with different settings of K and τ, when started with an initial pool of 20948 patterns of size <=3 445 Experimental Results on REPLACE ◼ ◼ ◼ Approximation error when compared with the complete mining result Example. Out of the total 98 patterns of size >=42, when K=100, Pattern-Fusion returns 80 of them A good approximation to the colossal patterns in the sense that any pattern in the complete set is on average at most 0.17 items away from one of these 80 patterns 446 Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space ◼ Constraint-Based Frequent Pattern Mining ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 447 Mining Compressed Patterns: δclustering ◼ ◼ ◼ ◼ ◼ Why compressed patterns? ◼ too many, but less meaningful Pattern distance measure ID Item-Sets Support P1 {38,16,18,12} 205227 P2 {38,16,18,12,17} 205211 P3 {39,38,16,18,12,17} 101758 P4 {39,16,18,12,17} 161563 P5 {39,16,18,12} 161576 δ-clustering: For each pattern P, ◼ Closed frequent pattern find all patterns which can be ◼ Report P1, P2, P3, P4, P5 expressed by P and their distance ◼ Emphasize too much on to P are within δ (δ-cover) support All patterns in the cluster can be ◼ no compression represented by P ◼ Max-pattern, P3: info loss Xin et al., “Mining Compressed ◼ A desirable output: P2, P3, P4 Frequent-Pattern Sets”, VLDB’05 448 Redundancy-Award Top-k Patterns ◼ ◼ ◼ ◼ Why redundancy-aware top-k patterns? Desired patterns: high significance & low redundancy Propose the MMS (Maximal Marginal Significance) for measuring the combined significance of a pattern set Xin et al., Extracting Redundancy-Aware Top-K Patterns, KDD’06 449 Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space ◼ Constraint-Based Frequent Pattern Mining ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 450 How to Understand and Interpret Patterns? ◼ diaper beer ◼ ◼ Do they all make sense? What do they mean? How are they useful? female sterile (2) tekele morphological info. and simple statistics Semantic Information Not all frequent patterns are useful, only meaningful ones … Annotate patterns with semantic information A Dictionary Analogy Word: “pattern” – from Merriam-Webster Non-semantic info. Definitions indicating semantics Synonyms Related Words Examples of Usage Semantic Analysis with Context Models ◼ Task1: Model the context of a frequent pattern Based on the Context Model… ◼ Task2: Extract strongest context indicators ◼ Task3: Extract representative transactions ◼ Task4: Extract semantically similar patterns Annotating DBLP Co-authorship & Title Pattern Database: Frequent Patterns Authors Title X.Yan, P. Yu, J. Han Substructure Similarity Search in Graph Databases … … … … P1: { x_yan, j_han } Frequent Itemset P2: “substructure search” Semantic Annotations Pattern { x_yan, j_han} Non Sup = … CI {p_yu}, graph pattern, … Trans. gSpan: graph-base…… SSPs { j_wang }, {j_han, p_yu}, … Pattern = {xifeng_yan, jiawei_han} Context Units < { p_yu, j_han}, { d_xin }, … , “graph pattern”, … “substructure similarity”, … > Annotation Results: Context Indicator (CI) graph; {philip_yu}; mine close; graph pattern; sequential pattern; … Representative Transactions (Trans) > gSpan: graph-base substructure pattern mining; > mining close relational graph connect constraint; … Semantically Similar Patterns (SSP) {jiawei_han, philip_yu}; {jian_pei, jiawei_han}; {jiong_yang, philip_yu, wei_wang}; … Chapter 7 : Advanced Frequent Pattern Mining ◼ Pattern Mining: A Road Map ◼ Pattern Mining in Multi-Level, Multi-Dimensional Space ◼ Constraint-Based Frequent Pattern Mining ◼ Mining High-Dimensional Data and Colossal Patterns ◼ Mining Compressed or Approximate Patterns ◼ Pattern Exploration and Application ◼ Summary 455 Summary ◼ Roadmap: Many aspects & extensions on pattern mining ◼ Mining patterns in multi-level, multi dimensional space ◼ Mining rare and negative patterns ◼ Constraint-based pattern mining ◼ Specialized methods for mining high-dimensional data and colossal patterns ◼ Mining compressed or approximate patterns ◼ Pattern exploration and understanding: Semantic annotation of frequent patterns 456 Ref: Mining Multi-Level and Quantitative Rules ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules, KDD'99 T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95. R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97. R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95. R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96. K. Wang, Y. He, and J. Han. Mining frequent itemsets using support constraints. VLDB'00 K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97. 457 Ref: Mining Other Kinds of Rules ◼ ◼ ◼ ◼ ◼ ◼ ◼ F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98 Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE’98. H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern Extraction with Fascicles. VLDB'99 B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96. A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98. 458 Ref: Constraint-Based Pattern Mining ◼ R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97 ◼ R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning optimizations of constrained association rules. SIGMOD’98 ◼ G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00 ◼ J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with Convertible Constraints. ICDE'01 ◼ J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in Large Databases, CIKM'02 ◼ F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. ExAnte: Anticipated Data Reduction in Constrained Pattern Mining, PKDD'03 ◼ F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework for Graph Pattern Mining”, PAKDD'07 459 Ref: Mining Sequential Patterns ◼ X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns with gap constraints. ICDM'05 ◼ H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97. ◼ J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01. ◼ R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT’96. ◼ X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03. ◼ M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning:01. 460 Mining Graph and Structured Patterns ◼ ◼ ◼ ◼ ◼ ◼ A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. PKDD'00 M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01. X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. ICDM'02 X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. KDD'03 X. Yan, P. S. Yu, and J. Han. Graph indexing based on discriminative frequent structure analysis. ACM TODS, 30:960–993, 2005 X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based substructure similarity search. ACM Trans. Database Systems, 31:1418–1453, 2006 461 Ref: Mining Spatial, Spatiotemporal, Multimedia Data ◼ H. Cao, N. Mamoulis, and D. W. Cheung. Mining frequent spatiotemporal sequential patterns. ICDM'05 ◼ D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal Patterns. SSTD'01 ◼ K. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic Information Databases, SSD’95 ◼ H. Xiong, S. Shekhar, Y. Huang, V. Kumar, X. Ma, and J. S. Yoo. A framework for discovering co-location patterns in data sets with extended spatial objects. SDM'04 ◼ J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: From visual words to visual phrases. CVPR'07 O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE'00 ◼ 462 Ref: Mining Frequent Patterns in Time-Series Data ◼ B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98. ◼ J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99. ◼ J. Shieh and E. Keogh. iSAX: Indexing and mining terabyte sized time series. KDD'08 ◼ B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online Data Mining for Co-Evolving Time Sequences. ICDE'00. ◼ W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on Evolving Numerical Attributes. ICDE’01. ◼ J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data. TKDE’03 ◼ L. Ye and E. Keogh. Time series shapelets: A new primitive for data mining. KDD'09 463 Ref: FP for Classification and Clustering ◼ G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. KDD'99. ◼ B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. KDD’98. ◼ W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. ICDM'01. ◼ H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets. SIGMOD’ 02. ◼ J. Yang and W. Wang. CLUSEQ: efficient and effective sequence clustering. ICDE’03. ◼ X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. SDM'03. ◼ H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for Effective Classification”, ICDE'07 464 Ref: Privacy-Preserving FP Mining ◼ A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining of Association Rules. KDD’02. ◼ A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining. PODS’03 ◼ J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining in Vertically Partitioned Data. KDD’02 465 Mining Compressed Patterns ◼ ◼ ◼ D. Xin, H. Cheng, X. Yan, and J. Han. Extracting redundancyaware top-k patterns. KDD'06 D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed frequent-pattern sets. VLDB'05 X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: A profile-based approach. KDD'05 466 Mining Colossal Patterns ◼ ◼ F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal frequent patterns by core pattern fusion. ICDE'07 F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han. P. S. Yu, Mining Top-K Large Structural Patterns in a Massive Network. VLDB’11 467 Ref: FP Mining from Data Streams ◼ Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data Streams. VLDB'02. ◼ R. M. Karp, C. H. Papadimitriou, and S. Shenker. A simple algorithm for finding frequent elements in streams and bags. TODS 2003. ◼ G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. VLDB’02. ◼ A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. ICDT'05 468 Ref: Freq. Pattern Mining Applications ◼ ◼ ◼ ◼ ◼ ◼ ◼ T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or How to Build a Data Quality Browser. SIGMOD'02 M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han. DustMiner: Troubleshooting interactive complexity bugs in sensor networks., SenSys'08 Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In Proc. 2004 Symp. Operating Systems Design and Implementation (OSDI'04) Z. Li and Y. Zhou. PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. FSE'05 D. Lo, H. Cheng, J. Han, S. Khoo, and C. Sun. Classification of software behaviors for failure detection: A discriminative pattern mining approach. KDD'09 Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent patterns. ACM TKDD, 2007. K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT’02. 469 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 470 Chapter 8. Classification: Basic Concepts ◼ Classification: Basic Concepts ◼ Decision Tree Induction ◼ Bayes Classification Methods ◼ Rule-Based Classification ◼ Model Evaluation and Selection ◼ Techniques to Improve Classification Accuracy: Ensemble Methods ◼ Summary 472 Supervised vs. Unsupervised Learning ◼ Supervised learning (classification) ◼ Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations ◼ ◼ New data is classified based on the training set Unsupervised learning (clustering) ◼ The class labels of training data is unknown ◼ Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 473 Prediction Problems: Classification vs. Numeric Prediction ◼ ◼ ◼ Classification ◼ predicts categorical class labels (discrete or nominal) ◼ classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Numeric Prediction ◼ models continuous-valued functions, i.e., predicts unknown or missing values Typical applications ◼ Credit/loan approval: ◼ Medical diagnosis: if a tumor is cancerous or benign ◼ Fraud detection: if a transaction is fraudulent ◼ Web page categorization: which category it is 474 Classification—A Two-Step Process ◼ ◼ ◼ Model construction: describing a set of predetermined classes ◼ Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute ◼ The set of tuples used for model construction is training set ◼ The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects ◼ Estimate accuracy of the model ◼ The known label of test sample is compared with the classified result from the model ◼ Accuracy rate is the percentage of test set samples that are correctly classified by the model ◼ Test set is independent of training set (otherwise overfitting) ◼ If the accuracy is acceptable, use the model to classify new data Note: If the test set is used to select models, it is called validation (test) set 475 Process (1): Model Construction Training Data NAME M ike M ary B ill Jim D ave A nne RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 476 Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME Tom M erlisa G eo rg e Jo sep h RANK YEARS TENURED A ssistan t P ro f 2 no A sso ciate P ro f 7 no P ro fesso r 5 yes A ssistan t P ro f 7 yes Tenured? 477 Chapter 8. Classification: Basic Concepts ◼ Classification: Basic Concepts ◼ Decision Tree Induction ◼ Bayes Classification Methods ◼ Rule-Based Classification ◼ Model Evaluation and Selection ◼ Techniques to Improve Classification Accuracy: Ensemble Methods ◼ Summary 478 Decision Tree Induction: An Example Training data set: Buys_computer ❑ The data set follows an example of Quinlan’s ID3 (Playing Tennis) ❑ Resulting tree: age? ❑ <=30 31..40 overcast student? no no yes yes yes >40 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating buys_computer high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no credit rating? excellent fair yes 479 Algorithm for Decision Tree Induction ◼ ◼ Basic algorithm (a greedy algorithm) ◼ Tree is constructed in a top-down recursive divide-andconquer manner ◼ At start, all the training examples are at the root ◼ Attributes are categorical (if continuous-valued, they are discretized in advance) ◼ Examples are partitioned recursively based on selected attributes ◼ Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning ◼ All samples for a given node belong to the same class ◼ There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf ◼ There are no samples left 480 Brief Review of Entropy ◼ m=2 481 Attribute Selection Measure: Information Gain (ID3/C4.5) ◼ Select the attribute with the highest information gain ◼ Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| ◼ Expected information (entropy) needed to classify a tuple in D: m Info( D) = − pi log 2 ( pi ) ◼ ◼ i =1 Information needed (after using A to split D into v partitions) to v | D | classify D: j Info A ( D) = Info( D j ) j =1 | D | Information gained by branching on attribute A Gain(A) = Info(D) − InfoA(D) 482 Attribute Selection: Information Gain Class P: buys_computer = “yes” Class N: buys_computer = “no” Info( D) = I (9,5) = − age <=30 31…40 >40 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 483 Infoage ( D) = 9 9 5 5 log 2 ( ) − log 2 ( ) =0.940 14 14 14 14 pi 2 4 3 ni I(pi, ni) 3 0.971 0 0 2 0.971 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no + 5 4 I (2,3) + I (4,0) 14 14 5 I (3,2) = 0.694 14 5 I (2,3) means “age <=30” has 5 14 out of 14 samples, with 2 yes’es and 3 no’s. Hence Gain (age) = Info( D ) − Infoage ( D ) = 0.246 Similarly, Gain(income) = 0.029 Gain( student ) = 0.151 Gain(credit _ rating ) = 0.048 Computing Information-Gain for Continuous-Valued Attributes ◼ Let attribute A be a continuous-valued attribute ◼ Must determine the best split point for A ◼ ◼ Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point ◼ ◼ ◼ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split: ◼ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point 484 485 ◼ ◼ Gain Ratio for Attribute Selection (C4.5) Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) v SplitInfo A ( D) = − j =1 | Dj | |D| log 2 ( | Dj | |D| ) GainRatio(A) = Gain(A)/SplitInfo(A) Ex. ◼ ◼ gain_ratio(income) = 0.029/1.557 = 0.019 The attribute with the maximum gain ratio is selected as the splitting attribute ◼ ◼ 486 ◼ Gini Index (CART, IBM IntelligentMiner) If a data set D contains examples from n classes, gini n index, gini(D) is defined as 2 gini( D) = 1− p j j =1 where pj is the relative frequency of class j in D ◼ If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as gini A ( D) = ◼ ◼ Reduction in Impurity: |D1| |D | gini( D1) + 2 gini( D 2) |D| |D| gini( A) = gini(D) − giniA (D) The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute) 487 Computation of Gini Index ◼ ◼ Ex. D has 9 tuples in buys_computer2 = “yes” and 5 in “no” 2 9 5 gini ( D) = 1 − − = 0.459 14 14 Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4gini inincome D2{low,medium} ( D) = 10 Gini( D1 ) + 4 Gini( D2 ) 14 ◼ ◼ 14 Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Comparing Attribute Selection Measures ◼ The three measures, in general, return good results but ◼ Information gain: ◼ ◼ Gain ratio: ◼ ◼ biased towards multivalued attributes tends to prefer unbalanced splits in which one partition is much smaller than the others Gini index: ◼ biased to multivalued attributes ◼ has difficulty when # of classes is large ◼ tends to favor tests that result in equal-sized partitions and purity in both partitions 488 Other Attribute Selection Measures ◼ CHAID: a popular decision tree algorithm, measure based on χ2 test for independence ◼ C-SEP: performs better than info. gain and gini index in certain cases ◼ G-statistic: has a close approximation to χ2 distribution ◼ MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): ◼ The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree ◼ Multivariate splits (partition based on multiple variable combinations) ◼ ◼ CART: finds multivariate splits based on a linear comb. of attrs. Which attribute selection measure is the best? ◼ Most give good results, none is significantly superior than others 489 Overfitting and Tree Pruning ◼ ◼ Overfitting: An induced tree may overfit the training data ◼ Too many branches, some may reflect anomalies due to noise or outliers ◼ Poor accuracy for unseen samples Two approaches to avoid overfitting ◼ Prepruning: Halt tree construction early ̵ do not split a node if this would result in the goodness measure falling below a threshold ◼ Difficult to choose an appropriate threshold ◼ Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees ◼ Use a set of data different from the training data to decide which is the “best pruned tree” 490 Enhancements to Basic Decision Tree Induction ◼ Allow for continuous-valued attributes ◼ ◼ ◼ Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values ◼ Assign the most common value of the attribute ◼ Assign probability to each of the possible values Attribute construction ◼ ◼ Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication 491 Classification in Large Databases ◼ ◼ ◼ ◼ Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why is decision tree induction popular? ◼ relatively faster learning speed (than other classification methods) ◼ convertible to simple and easy to understand classification rules ◼ can use SQL queries for accessing databases ◼ comparable classification accuracy with other methods RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) ◼ Builds an AVC-list (attribute, value, class label) 492 Scalability Framework for RainForest ◼ Separates the scalability aspects from the criteria that determine the quality of the tree ◼ Builds an AVC-list: AVC (Attribute, Value, Class_label) ◼ AVC-set (of an attribute X ) ◼ Projection of training dataset onto the attribute X and class label where counts of individual class label are aggregated ◼ AVC-group (of a node n ) ◼ Set of AVC-sets of all predictor attributes at the node n 493 Rainforest: Training Set and Its AVC Sets Training Examples age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 AVC-set on Age income studentcredit_rating buys_computerAge Buy_Computer high no fair no yes no high no excellent no <=30 2 3 high no fair yes 31..40 4 0 medium no fair yes >40 3 2 low yes fair yes low yes excellent no low yes excellent yes AVC-set on Student medium no fair no low yes fair yes student Buy_Computer medium yes fair yes yes no medium yes excellent yes medium no excellent yes yes 6 1 high yes fair yes no 3 4 medium no excellent no AVC-set on income income Buy_Computer yes no high 2 2 medium 4 2 low 3 1 AVC-set on credit_rating Buy_Computer Credit rating yes no fair 6 2 excellent 3 3 494 BOAT (Bootstrapped Optimistic Algorithm for Tree Construction) ◼ Use a statistical technique called bootstrapping to create several smaller samples (subsets), each fits in memory ◼ Each subset is used to create a tree, resulting in several trees ◼ These trees are examined and used to construct a new tree T’ ◼ It turns out that T’ is very close to the tree that would be generated using the whole data set together ◼ Adv: requires only two scans of DB, an incremental alg. 495 Presentation of Classification Results June 7, 2020 Data Mining: Concepts and Techniques 496 SGI/MineSet 3.0 June 7, 2020 Data Mining: Concepts and Techniques 497 Perception-Based Classification (PBC) 498 Data Mining: Concepts and Techniques Chapter 8. Classification: Basic Concepts ◼ Classification: Basic Concepts ◼ Decision Tree Induction ◼ Bayes Classification Methods ◼ Rule-Based Classification ◼ Model Evaluation and Selection ◼ Techniques to Improve Classification Accuracy: Ensemble Methods ◼ Summary 499 Bayesian Classification: Why? ◼ ◼ ◼ ◼ ◼ A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods 500 Bayes’ Theorem: Basics M ◼ Total probability Theorem:P(B) = ◼ Bayes’ Theorem: P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X) i =1 P( B | A ) P ( A ) i i P(X) ◼ ◼ ◼ ◼ ◼ ◼ Let X be a data sample (“evidence”): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X P(H) (prior probability): the initial probability ◼ E.g., X will buy computer, regardless of age, income, … P(X): probability that sample data is observed P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds ◼ E.g., Given that X will buy computer, the prob. that X is 31..40, medium income 501 Prediction Based on Bayes’ Theorem ◼ Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X) P(X) ◼ Informally, this can be viewed as posteriori = likelihood x prior/evidence ◼ ◼ Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost 502 503Classification ◼ ◼ ◼ ◼ Is to Derive the Maximum Posteriori Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes’ theorem P(X | C )P(C ) i i P(C | X) = i P(X) ◼ Since P(X) is constant for all classes, only P(C | X) = P(X | C )P(C ) i i i needs to be maximized 504 ◼ Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between n attributes): P(X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i) k 1 2 n k =1 ◼ ◼ ◼ This greatly reduces the computation cost: Only counts the class distribution If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and ( x− ) standard deviation σ − 1 2 g ( x, , ) = and P(xk|Ci) is 2 e 2 2 P ( X | C i ) = g ( xk , Ci , Ci ) Naïve Bayes Classifier: Training Dataset Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair) age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income studentcredit_rating buys_compu high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no 505 Naïve Bayes Classifier: An Example income studentcredit_rating buys_comp high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 ◼ Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 ◼ X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) 506 ◼ P(Ci): age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 Avoiding the Zero-Probability Problem 507 ◼ Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero P( X | C i ) ◼ ◼ = n P( x k | C i ) k =1 Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) Use Laplacian correction (or Laplacian estimator) ◼ ◼ Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The “corrected” prob. estimates are close to their “uncorrected” counterparts Naïve Bayes Classifier: Comments ◼ ◼ ◼ Advantages ◼ Easy to implement ◼ Good results obtained in most of the cases Disadvantages ◼ Assumption: class conditional independence, therefore loss of accuracy ◼ Practically, dependencies exist among variables ◼ E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. ◼ Dependencies among these cannot be modeled by Naïve Bayes Classifier How to deal with these dependencies? Bayesian Belief 508 Chapter 8. Classification: Basic Concepts ◼ Classification: Basic Concepts ◼ Decision Tree Induction ◼ Bayes Classification Methods ◼ Rule-Based Classification ◼ Model Evaluation and Selection ◼ Techniques to Improve Classification Accuracy: Ensemble Methods ◼ Summary 509 Using IF-THEN Rules for Classification ◼ ◼ ◼ Represent the knowledge in the form of IF-THEN rules R: IF age = youth AND student = yes THEN buys_computer = yes ◼ Rule antecedent/precondition vs. rule consequent Assessment of a rule: coverage and accuracy ◼ ncovers = # of tuples covered by R ◼ ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| /* D: training data set */ accuracy(R) = ncorrect / ncovers If more than one rule are triggered, need conflict resolution ◼ Size ordering: assign the highest priority to the triggering rules that has the “toughest” requirement (i.e., with the most attribute tests) ◼ Class-based ordering: decreasing order of prevalence or misclassification cost per class ◼ Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality or by 510 Rule Extraction from a Decision Tree ◼ ◼ ◼ ◼ ◼ Rules are easier to understand than large trees age? One rule is created for each path from the <=30 31..40 root to a leaf student? yes Each attribute-value pair along a path forms a no yes conjunction: the leaf holds the class no yes prediction Rules are mutually exclusive and exhaustive >40 credit rating? excellent fair yes Example: Rule extraction from our buys_computer decisiontree IF age = young AND student = no no IF age = young AND student = yes yes IF age = mid-age THEN buys_computer = THEN buys_computer = THEN buys_computer = yes 511 Rule Induction: Sequential Covering Method ◼ ◼ ◼ ◼ Sequential covering algorithm: Extracts rules directly from training data Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none (or few) of the tuples of other classes Steps: ◼ Rules are learned one at a time ◼ Each time a rule is learned, the tuples covered by the rules are removed ◼ Repeat the process on the remaining tuples until termination condition, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold 512 Sequential Covering Algorithm while (enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered Examples coveredby Rule 2 Examples covered by Rule 1 by Rule 3 Positive example s 513 Rule Generation ◼ To generate a rule while(true) find the best predicate p if foil-gain(p) > threshold then add p to current rule else break A3=1&&A1=2 A3=1&&A1=2 &&A8=5 A3=1 Positiv e exampl Negativ e exampl 514 515 How to Learn-One-Rule? ◼ ◼ ◼ Start with the most general rule possible: condition = empty Adding new attributes by adopting a greedy depth-first strategy ◼ Picks the one that most improves the rule quality Rule-Quality measures: consider both coverage and accuracy pos ' pos FOIL _ Gain = pos '(log 2 − log 2 ) ◼ Foil-gain (in FOIL & RIPPER): pos assesses info_gain '+ neg ' pos + neg by extending condition ◼ ◼ favors rules that have high accuracy and cover many positive pos − neg FOIL _ Prune ( R ) = tuples pos + neg Rule pruning based on an independent set of test tuples Chapter 8. Classification: Basic Concepts ◼ Classification: Basic Concepts ◼ Decision Tree Induction ◼ Bayes Classification Methods ◼ Rule-Based Classification ◼ Model Evaluation and Selection ◼ Techniques to Improve Classification Accuracy: Ensemble Methods ◼ Summary 516 Model Evaluation and Selection ◼ ◼ ◼ ◼ Evaluation metrics: How can we measure accuracy? Other metrics to consider? Use validation test set of class-labeled tuples instead of training set when assessing accuracy Methods for estimating a classifier’s accuracy: ◼ Holdout method, random subsampling ◼ Cross-validation ◼ Bootstrap Comparing classifiers: ◼ Confidence intervals ◼ Cost-benefit analysis and ROC Curves 517 Classifier Evaluation Metrics: Confusion Matrix Confusion Matrix: Actual class\Predicted class C1 ¬ C1 C1 True Positives (TP) False Negatives (FN) ¬ C1 False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual class\Predicted buy_computer buy_computer class = yes = no ◼ ◼ Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals 518 Accuracy, Error Rate, Sensitivity and Specificity A\P ◼ ◼ C ¬C Class Imbalance Problem: C TP FN P ◼ One class may be rare, e.g. ¬C FP TN N fraud, or HIV-positive P’ N’ All ◼ Significant majority of the negative class and minority of Classifier Accuracy, or recognition rate: percentage of the positive class ◼ Sensitivity: True Positive test set tuples that are recognition rate correctly classified ◼ Sensitivity = TP/P Accuracy = (TP + TN)/All Error rate: 1 – accuracy, or ◼ Specificity: True Negative Error rate = (FP + FN)/All recognition rate ◼ Specificity = TN/N ◼ 519 Precision and Recall, and Fmeasures ◼ ◼ ◼ ◼ ◼ ◼ Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive Recall: completeness – what % of positive tuples did the classifier label as positive? Perfect score is 1.0 Inverse relationship between precision & recall F measure (F1 or F-score): harmonic mean of precision and recall, Fß: weighted measure of precision and recall ◼ assigns ß times as much weight to recall as to precision 520 Classifier Evaluation Metrics: Example ◼ Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.40 (accuracy) Precision = 90/230 = 39.13% 30.00% Recall = 90/300 = 521 Holdout & Cross-Validation Methods ◼ ◼ Holdout method ◼ Given data is randomly partitioned into two independent sets ◼ Training set (e.g., 2/3) for model construction ◼ Test set (e.g., 1/3) for accuracy estimation ◼ Random sampling: a variation of holdout ◼ Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) ◼ Randomly partition the data into k mutually exclusive subsets, each approximately equal size ◼ At i-th iteration, use Di as test set and others as training set ◼ Leave-one-out: k folds where k = # of tuples, for small sized data ◼ *Stratified cross-validation*: folds are stratified so 522 Evaluating Classifier Accuracy: Bootstrap ◼ Bootstrap ◼ Works well with small data sets ◼ Samples the given training tuples uniformly with replacement ◼ ◼ i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set Several bootstrap methods, and a common one is .632 boostrap ◼ ◼ A data set with d tuples is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368) Repeat the sampling procedure k times, overall accuracy of the model: 523 Estimating Confidence Intervals: Classifier Models M1 vs. M2 ◼ Suppose we have 2 classifiers, M1 and M2, which one is better? ◼ Use 10-fold cross-validation to obtain and ◼ These mean error rates are just estimates of error on the true population of future data cases ◼ What if the difference between the 2 error rates is just attributed to chance? ◼ Use a test of statistical significance ◼ Obtain confidence limits for our error estimates 524 Estimating Confidence Intervals: Null Hypothesis ◼ Perform 10-fold cross-validation ◼ Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10) ◼ Use t-test (or Student’s t-test) ◼ Null Hypothesis: M1 & M2 are the same ◼ If we can reject null hypothesis, then ◼ we conclude that the difference between M1 & M2 is statistically significant ◼ Chose model with lower error rate 525 Estimating Confidence Intervals: t-test ◼ If only 1 test set available: pairwise comparison ◼ ◼ ◼ ◼ For ith round of 10-fold cross-validation, the same cross and partitioning is used to obtain err(M1)i and err(M2)i Average over 10 rounds to get t-test computes t-statistic with k-1 degrees where of freedom: whesets available: use non-paired t-test If two test re where k1 & k2 are # of cross-validation samples used for M1 & 526 Estimating Confidence Intervals: Table for t-distribution ◼ ◼ Symmetric Significance level, e.g., sig = 0.05 or 5% means M1 & M2 are significantly different for 95% ◼ of population Confidence limit, z = sig/2 527 Estimating Confidence Intervals: Statistical Significance ◼ Are M1 & M2 significantly different? ◼ Compute t. Select significance level (e.g. sig = 5%) ◼ Consult table for t-distribution: Find t value corresponding to k-1 degrees of freedom (here, 9) ◼ t-distribution is symmetric: typically upper % points of distribution shown → look up value for confidence limit z=sig/2 (here, 0.025) ◼ If t > z or t < -z, then t value lies in rejection region: ◼ Reject null hypothesis that mean error rates of M1 & M2 are same ◼ Conclude: statistically significant difference between M1 & M2 ◼ Otherwise, conclude that any difference is chance 528 Model Selection: ROC Curves ◼ ◼ ◼ ◼ ◼ ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models Originated from signal detection theory Shows the trade-off between the true positive rate and the false positive rate The area under the ROC curve is a measure of the accuracy of the model Rank the test tuples in decreasing order: the one that is most likely to belong to the positive class appears at the top of the list ◼ ◼ ◼ ◼ Vertical axis represents the true positive rate Horizontal axis rep. the false positive rate The plot also shows a diagonal line A model with perfect accuracy will have an area of 1.0 529 Issues Affecting Model Selection ◼ Accuracy ◼ ◼ classifier accuracy: predicting class label Speed ◼ time to construct the model (training time) ◼ time to use the model (classification/prediction time) ◼ Robustness: handling noise and missing values ◼ Scalability: efficiency in disk-resident databases ◼ Interpretability ◼ ◼ understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 530 Chapter 8. Classification: Basic Concepts ◼ Classification: Basic Concepts ◼ Decision Tree Induction ◼ Bayes Classification Methods ◼ Rule-Based Classification ◼ Model Evaluation and Selection ◼ Techniques to Improve Classification Accuracy: Ensemble Methods ◼ Summary 531 Ensemble Methods: Increasing the Accuracy ◼ ◼ Ensemble methods ◼ Use a combination of models to increase accuracy ◼ Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M* Popular ensemble methods ◼ Bagging: averaging the prediction over a collection of classifiers ◼ Boosting: weighted vote with a collection of classifiers ◼ Ensemble: combining a set of heterogeneous classifiers 532 Bagging: Boostrap Aggregation ◼ ◼ ◼ ◼ ◼ Analogy: Diagnosis based on multiple doctors’ majority vote Training ◼ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap) ◼ A classifier model Mi is learned for each training set Di Classification: classify an unknown sample X ◼ Each classifier Mi returns its class prediction ◼ The bagged classifier M* counts the votes and assigns the class with the most votes to X Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple Accuracy ◼ Often significantly better than a single classifier derived from D ◼ For noise data: not considerably worse, more robust ◼ Proved improved accuracy in prediction 533 Boosting ◼ ◼ ◼ ◼ Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy How boosting works? ◼ Weights are assigned to each training tuple ◼ A series of k classifiers is iteratively learned ◼ After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi ◼ The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy Boosting algorithm can be extended for numeric prediction Comparing with bagging: Boosting tends to have greater 534 535 Adaboost (Freund and Schapire, 1997) ◼ ◼ ◼ ◼ Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd) Initially, all the weights of tuples are set the same (1/d) Generate k classifiers in k rounds. At round i, ◼ Tuples from D are sampled (with replacement) to form a training set Di of the same size ◼ Each tuple’s chance of being selected is based on its weight ◼ A classification model Mi is derived from Di ◼ Its error rate is calculated using Di as a test set ◼ If a tuple is misclassified, its weight is increased, o.w. it is decreased Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the dweights of the misclassified tuples: error ( M i ) = w j err ( X j ) j ◼ The weight of classifier Mi’s vote is log 1 − error ( M i ) error (M i ) Random Forest (Breiman 2001) ◼ ◼ ◼ ◼ Random Forest: ◼ Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of attributes at each node to determine the split ◼ During classification, each tree votes and the most popular class is returned Two Methods to construct Random Forest: ◼ Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size ◼ Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear combination of the existing attributes (reduces the correlation between individual classifiers) Comparable in accuracy to Adaboost, but more robust to errors and outliers Insensitive to the number of attributes selected for consideration at each split, and faster than bagging or boosting 536 Classification of Class-Imbalanced Data Sets ◼ ◼ ◼ Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oilspill, fault, etc. Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for classimbalanced data Typical methods for imbalance data in 2-class classification: ◼ Oversampling: re-sampling of data from positive class ◼ Under-sampling: randomly eliminate tuples from negative class ◼ Threshold-moving: moves the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors ◼ Ensemble techniques: Ensemble multiple classifiers introduced above 537 Chapter 8. Classification: Basic Concepts ◼ Classification: Basic Concepts ◼ Decision Tree Induction ◼ Bayes Classification Methods ◼ Rule-Based Classification ◼ Model Evaluation and Selection ◼ Techniques to Improve Classification Accuracy: Ensemble Methods ◼ Summary 538 Summary (I) ◼ ◼ ◼ ◼ Classification is a form of data analysis that extracts models describing important data classes. Effective and scalable methods have been developed for decision tree induction, Naive Bayesian classification, rulebased classification, and many other classification methods. Evaluation metrics include: accuracy, sensitivity, specificity, precision, recall, F measure, and Fß measure. Stratified k-fold cross-validation is recommended for accuracy estimation. Bagging and boosting can be used to increase overall accuracy by learning and combining a series of individual models. 539 Summary (II) ◼ Significance tests and ROC curves are useful for model selection. ◼ There have been numerous comparisons of the different classification methods; the matter remains a research topic ◼ No single method has been found to be superior over all others for all data sets ◼ Issues such as accuracy, training time, robustness, scalability, and interpretability must be considered and can involve trade-offs, further complicating the quest for an overall superior method 540 References (1) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997 C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1995 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. KDD'95 H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07 H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for Effective Classification, ICDE'08 W. Cohen. Fast effective rule induction. ICML'95 G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. SIGMOD'05 541 References (2) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990. G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. KDD'99. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001 U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Computer and System Sciences, 1997. J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. VLDB’98. J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree Construction. SIGMOD'99. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001. D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian 542 References (3) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000. J. Magidson. The Chaid approach to segmentation modeling: Chisquared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, Blackwell Business, 1994. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. EDBT'96. T. M. Mitchell. Machine Learning. McGraw Hill, 1997. S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345389, 1998 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 543 References (4) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. VLDB’98. J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. VLDB’96. J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990. P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2ed. Morgan Kaufmann, 2005. X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with 544 CS412 Midterm Exam Statistics ◼ ◼ ◼ Opinion Question Answering: ◼ Like the style: 70.83%, dislike: 29.16% ◼ Exam is hard: 55.75%, easy: 0.6%, just right: 43.63% ◼ Time: plenty:3.03%, enough: 36.96%, not: 60% ◼ <40: 2 ◼ 60-69: 37 Score distribution: # of students (Total: 180) ◼ 50-59: 15 ◼ >=90: 24 ◼ 40-49: 2 ◼ 80-89: 54 ◼ 70-79: 46 Final grading are based on overall score 546 Issues: Evaluating Classification Methods ◼ ◼ ◼ ◼ ◼ ◼ Accuracy ◼ classifier accuracy: predicting class label ◼ predictor accuracy: guessing value of predicted attributes Speed ◼ time to construct the model (training time) ◼ time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability ◼ understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 547 Predictor Error Measures ◼ ◼ ◼ Measure predictor accuracy: measure how far off the predicted value is from the actual known value Loss function: measures the error betw. yi and the predicted value yi’ ◼ Absolute error: | yi – yi’| ◼ Squared error: (yi – yi’)2 Test error (generalization error): the average loss over the test set d d ◼ Mean absolute error: | y i =1 i − yi ' | Mean squared error: (y i =1 d Relative absolute error: | y i =1 d i | y i =1 − yi ' ) 2 d ( yi − yi ' ) 2 d d ◼ i i − yi ' | −y| Relative squared error: i =1 d ( y − y) The mean squared-error exaggerates the presence of outliers i =1 2 i Popularly use (square) root mean-square error, similarly, root relative squared error 548 Scalable Decision Tree Induction Methods ◼ ◼ ◼ ◼ ◼ SLIQ (EDBT’96 — Mehta et al.) ◼ Builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) ◼ Constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) ◼ Integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) ◼ Builds an AVC-list (attribute, value, class label) BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh) ◼ Uses bootstrapping to create several small samples 549 Data Cube-Based Decision-Tree Induction ◼ ◼ Integration of generalization with decision-tree induction (Kamber et al.’97) Classification at primitive concept levels ◼ ◼ ◼ ◼ E.g., precise temperature, humidity, outlook, etc. Low-level concepts, scattered classes, bushy classification-trees Semantic interpretation problems Cube-based multi-level classification ◼ Relevance analysis at multi-levels ◼ Information-gain analysis with dimension + level 550 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 551 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 9 — Classification: Advanced Methods Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 552 Chapter 9. Classification: Advanced Methods ◼ Bayesian Belief Networks ◼ Classification by Backpropagation ◼ Support Vector Machines ◼ Classification by Using Frequent Patterns ◼ Lazy Learners (or Learning from Your Neighbors) ◼ Other Classification Methods ◼ Additional Topics Regarding Classification ◼ Summary 553 Bayesian Belief Networks ◼ Bayesian belief networks (also known as Bayesian networks, probabilistic networks): allow class conditional independencies between subsets of variables ◼ A (directed acyclic) graphical model of causal relationships ◼ ◼ Represents dependency among the variables Gives a specification of joint probability distribution ❑ Nodes: random variables ❑ Links: dependency Y X Z ❑ X and Y are the parents of Z, and Y is the parent of P P ❑ No dependency between Z and P ❑ Has no loops/cycles 554 Bayesian Belief Network: An Example Family History (FH) CPT: Conditional Probability Table Smoker (S) for variable LungCancer: (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LungCancer (LC) Emphysema LC 0.8 0.5 0.7 0.1 ~LC 0.2 0.5 0.3 0.9 shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Bayesian Belief Network Derivation of the probability of a particular combination of values of X, from CPT: n P( x1 ,..., xn ) = P( xi | Parents(Y i )) i =1 555 Training Bayesian Networks: Several Scenarios ◼ ◼ ◼ ◼ ◼ Scenario 1: Given both the network structure and all variables observable: compute only the CPT entries Scenario 2: Network structure known, some variables hidden: gradient descent (greedy hill-climbing) method, i.e., search for a solution along the steepest descent of a criterion function ◼ Weights are initialized to random probability values ◼ At each iteration, it moves towards what appears to be the best solution at the moment, w.o. backtracking ◼ Weights are updated at each iteration & converge to local optimum Scenario 3: Network structure unknown, all variables observable: search through the model space to reconstruct network topology Scenario 4: Unknown structure, all hidden variables: No good algorithms known for this purpose D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed.. MIT Press, 1999. 556 Chapter 9. Classification: Advanced Methods ◼ Bayesian Belief Networks ◼ Classification by Backpropagation ◼ Support Vector Machines ◼ Classification by Using Frequent Patterns ◼ Lazy Learners (or Learning from Your Neighbors) ◼ Other Classification Methods ◼ Additional Topics Regarding Classification ◼ Summary 557 Classification by Backpropagation ◼ ◼ ◼ ◼ ◼ Backpropagation: A neural network learning algorithm Started by psychologists and neurobiologists to develop and test computational analogues of neurons A neural network: A set of connected input/output units where each connection has a weight associated with it During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples Also referred to as connectionist learning due to the connections between units 558 Neural Network as a Classifier ◼ Weakness ◼ ◼ ◼ ◼ Long training time Require a number of parameters typically best determined empirically, e.g., the network topology or “structure.” Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of “hidden units” in the network Strength ◼ High tolerance to noisy data ◼ Ability to classify untrained patterns ◼ Well-suited for continuous-valued inputs and outputs ◼ Successful on an array of real-world data, e.g., hand-written letters ◼ Algorithms are inherently parallel ◼ Techniques have recently been developed for the extraction of rules from trained neural networks 559 A Multi-Layer Feed-Forward Neural Network Output vector w(jk +1) = w(jk ) + ( yi − yˆi( k ) ) xij Output layer Hidden layer wij Input layer Input vector: X 560 How A Multi-Layer Neural Network Works ◼ ◼ The inputs to the network correspond to the attributes measured for each training tuple Inputs are fed simultaneously into the units making up the input layer ◼ They are then weighted and fed simultaneously to a hidden layer ◼ The number of hidden layers is arbitrary, although usually only one ◼ ◼ ◼ The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction The network is feed-forward: None of the weights cycles back to an input unit or to an output unit of a previous layer From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely approximate any function 561 Defining a Network Topology ◼ ◼ ◼ ◼ ◼ Decide the network topology: Specify # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer Normalize the input values for each attribute measured in the training tuples to [0.0—1.0] One input unit per domain value, each initialized to 0 Output, if for classification and more than two classes, one output unit per class is used Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights 562 Backpropagation ◼ Iteratively process a set of training tuples & compare the network's prediction with the actual known target value ◼ For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value ◼ Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation” ◼ Steps ◼ Initialize weights to small random numbers, associated with biases ◼ Propagate the inputs forward (by applying activation function) ◼ Backpropagate the error (by updating weights and biases) ◼ Terminating condition (when error is very small, etc.) 563 Neuron: A Hidden/Output Layer Unit bias x0 w0 x1 w1 xn k f wn output y For Example n Input weight vector x vector w ◼ ◼ weighted sum Activation function y = sign( wi xi − k ) i =0 An n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping The inputs to unit are outputs from the previous layer. They are multiplied by their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it. 564 Efficiency and Interpretability ◼ ◼ Efficiency of backpropagation: Each epoch (one iteration through the training set) takes O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to n, the number of inputs, in worst case For easier comprehension: Rule extraction by network pruning ◼ ◼ ◼ ◼ Simplify the network structure by removing weighted links that have the least effect on the trained network Then perform link, unit, or activation value clustering The set of input and activation values are studied to derive rules describing the relationship between the input and hidden unit layers Sensitivity analysis: assess the impact that a given input variable has on a network output. The knowledge gained from this analysis can be represented in rules 565 Chapter 9. Classification: Advanced Methods ◼ Bayesian Belief Networks ◼ Classification by Backpropagation ◼ Support Vector Machines ◼ Classification by Using Frequent Patterns ◼ Lazy Learners (or Learning from Your Neighbors) ◼ Other Classification Methods ◼ Additional Topics Regarding Classification ◼ Summary 566 Classification: A Mathematical Mapping ◼ ◼ ◼ Classification: predicts categorical class labels ◼ E.g., Personal homepage classification ◼ xi = (x1, x2, x3, …), yi = +1 or –1 ◼ x1 : # of word “homepage” x ◼ x2 : # of word “welcome” x x x x n Mathematically, x X = , y Y = {+1, –1}, x o x x x ◼ We want to derive a function f: X → Y o o o x Linear Classification ooo o o ◼ Binary Classification problem o o o o ◼ Data above the red line belongs to class ‘x’ ◼ Data below red line belongs to class ‘o’ ◼ Examples: SVM, Perceptron, Probabilistic Classifiers 567 Discriminative Classifiers ◼ ◼ Advantages ◼ Prediction accuracy is generally high ◼ As compared to Bayesian methods – in general ◼ Robust, works when training examples contain errors ◼ Fast evaluation of the learned target function ◼ Bayesian networks are normally slow Criticism ◼ Long training time ◼ Difficult to understand the learned function (weights) ◼ Bayesian networks can be used easily for pattern discovery ◼ Not easy to incorporate domain knowledge ◼ Easy in the form of priors on the data or distributions 568 SVM—Support Vector Machines ◼ ◼ ◼ ◼ ◼ A relatively new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training data into a higher dimension With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”) With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors) 569 SVM—History and Applications ◼ Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s ◼ Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) ◼ Used for: classification and numeric prediction ◼ Applications: ◼ handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests 570 SVM—General Philosophy Small Margin Large Margin Support Vectors 571 SVM—Margins and Support Vectors Data Mining: Concepts and Techniques June 7, 2020 572 SVM—When Data Is Linearly Separable m Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin, 573 SVM—Linearly Separable ◼ A separating hyperplane can be written as W●X+b=0 where W={w1, w2, …, wn} is a weight vector and b a scalar (bias) ◼ For 2-D it can be written as w0 + w1 x1 + w2 x2 = 0 ◼ The hyperplane defining the sides of the margin: H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1 ◼ ◼ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints → Quadratic Programming (QP) → Lagrangian multipliers 574 Why Is SVM Effective on High Dimensional Data? ◼ The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data ◼ The support vectors are the essential or critical training examples — they lie closest to the decision boundary (MMH) ◼ If all other training examples are removed and the training is repeated, the same separating hyperplane would be found ◼ The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality ◼ Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high 575 SVM—Linearly Inseparable ◼ ◼ A2 Transform the original input data into a higher dimensional space A1 Search for a linear separating hyperplane in the new space 576 SVM: Different Kernel functions ◼ ◼ ◼ Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj) Typical Kernel Functions SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional parameters) 577 Scaling SVM by Hierarchical Micro-Clustering ◼ SVM is not scalable to the number of data objects in terms of training time and memory usage ◼ H. Yu, J. Yang, and J. Han, “Classifying Large Data Sets Using SVM with Hierarchical Clusters”, KDD'03) ◼ CB-SVM (Clustering-Based SVM) ◼ Given limited amount of system resources (e.g., memory), maximize the SVM performance in terms of accuracy and the training speed ◼ ◼ Use micro-clustering to effectively reduce the number of points to be considered At deriving support vectors, de-cluster micro-clusters near “candidate vector” to ensure high classification accuracy 578 CF-Tree: Hierarchical Micro-cluster ◼ ◼ Read the data set once, construct a statistical summary of the data (i.e., hierarchical clusters) given a limited amount of memory Micro-clustering: Hierarchical indexing structure ◼ provide finer samples closer to the boundary and coarser samples farther from the boundary 579 Selective Declustering: Ensure High Accuracy ◼ CF tree is a suitable base structure for selective declustering ◼ De-cluster only the cluster Ei such that ◼ ◼ Di – Ri < Ds, where Di is the distance from the boundary to the center point of Ei and Ri is the radius of Ei Decluster only the cluster whose subclusters have possibilities to be the support cluster of the boundary ◼ “Support cluster”: The cluster whose centroid is a support vector 580 CB-SVM Algorithm: Outline ◼ ◼ ◼ ◼ ◼ Construct two CF-trees from positive and negative data sets independently ◼ Need one scan of the data set Train an SVM from the centroids of the root entries De-cluster the entries near the boundary into the next level ◼ The children entries de-clustered from the parent entries are accumulated into the training set with the non-declustered parent entries Train an SVM again from the centroids of the entries in the training set Repeat until nothing is accumulated 581 Accuracy and Scalability on Synthetic Dataset ◼ Experiments on large synthetic data sets shows better accuracy than random sampling approaches and far more scalable than the original SVM algorithm 582 SVM vs. Neural Network ◼ SVM ◼ ◼ ◼ ◼ Deterministic algorithm Nice generalization properties Hard to learn – learned in batch mode using quadratic programming techniques Using kernels can learn very complex functions ◼ Neural Network ◼ ◼ ◼ ◼ Nondeterministic algorithm Generalizes well but doesn’t have strong mathematical foundation Can easily be learned in incremental fashion To learn complex functions—use multilayer perceptron (nontrivial) 583 SVM Related Links ◼ SVM Website: http://www.kernel-machines.org/ ◼ Representative implementations ◼ LIBSVM: an efficient implementation of SVM, multi- class classifications, nu-SVM, one-class SVM, including also various interfaces with java, python, etc. ◼ SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C ◼ SVM-torch: another recent implementation also written in C 584 Chapter 9. Classification: Advanced Methods ◼ Bayesian Belief Networks ◼ Classification by Backpropagation ◼ Support Vector Machines ◼ Classification by Using Frequent Patterns ◼ Lazy Learners (or Learning from Your Neighbors) ◼ Other Classification Methods ◼ Additional Topics Regarding Classification ◼ Summary 585 Associative Classification ◼ Associative classification: Major steps ◼ Mine data to find strong associations between frequent patterns (conjunctions of attribute-value pairs) and class labels ◼ Association rules are generated in the form of P1 ^ p2 … ^ pl → “Aclass = C” (conf, sup) ◼ ◼ Organize the rules to form a rule-based classifier Why effective? ◼ It explores highly confident associations among multiple attributes and may overcome some constraints introduced by decision-tree induction, which considers only one attribute at a time ◼ Associative classification has been found to be often more accurate than some traditional classification methods, such as C4.5 586 Typical Associative Classification Methods ◼ CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98) ◼ Mine possible association rules in the form of ◼ ◼ ◼ Build classifier: Organize rules according to decreasing precedence based on confidence and then support CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) ◼ ◼ Cond-set (a set of attribute-value pairs) → class label Classification: Statistical analysis on multiple rules CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03) ◼ Generation of predictive rules (FOIL-like analysis) but allow covered rules to retain with reduced weight ◼ Prediction using best k rules ◼ High efficiency, accuracy similar to CMAR 587 Frequent Pattern-Based Classification ◼ ◼ ◼ H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “Discriminative Frequent Pattern Analysis for Effective Classification”, ICDE'07 Accuracy issue ◼ Increase the discriminative power ◼ Increase the expressive power of the feature space Scalability issue ◼ It is computationally infeasible to generate all feature combinations and filter them with an information gain threshold ◼ Efficient method (DDPMine: FPtree pruning): H. Cheng, X. Yan, J. Han, and P. S. Yu, "Direct Discriminative Pattern Mining for Effective Classification", ICDE'08 588 Frequent Pattern vs. Single Feature The discriminative power of some frequent patterns is higher than that of single features. (a) Austral (b) Cleve (c) Sonar Fig. 1. Information Gain vs. Pattern Length 589 Empirical Results 1 InfoGain IG_UpperBnd 0.9 0.8 Information Gain 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 600 700 Support (a) Austral (b) Breast (c) Sonar Fig. 2. Information Gain vs. Pattern Frequency 590 Feature Selection ◼ ◼ ◼ Given a set of frequent patterns, both non-discriminative and redundant patterns exist, which can cause overfitting We want to single out the discriminative patterns and remove redundant ones The notion of Maximal Marginal Relevance (MMR) is borrowed ◼ A document has high marginal relevance if it is both relevant to the query and contains minimal marginal similarity to previously selected documents 591 Experimental Results 592 592 Scalability Tests 593 DDPMine: Branch-and-Bound Search sup( child ) sup( parent ) sup( b) sup( a ) a b a: constant, a parent node b: variable, a descendent Association between information gain and frequency 594 DDPMine Efficiency: Runtime PatClass Harmony PatClass: ICDE’07 Pattern Classification Alg. DDPMine 595 Chapter 9. Classification: Advanced Methods ◼ Bayesian Belief Networks ◼ Classification by Backpropagation ◼ Support Vector Machines ◼ Classification by Using Frequent Patterns ◼ Lazy Learners (or Learning from Your Neighbors) ◼ Other Classification Methods ◼ Additional Topics Regarding Classification ◼ Summary 596 Lazy vs. Eager Learning ◼ ◼ ◼ Lazy vs. eager learning ◼ Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple ◼ Eager learning (the above discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify Lazy: less time in training but more time in predicting Accuracy ◼ Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function ◼ Eager: must commit to a single hypothesis that covers the entire instance space 597 Lazy Learner: Instance-Based Methods ◼ ◼ Instance-based learning: ◼ Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified Typical approaches ◼ k-nearest neighbor approach ◼ Instances represented as points in a Euclidean space. ◼ Locally weighted regression ◼ Constructs local approximation ◼ Case-based reasoning ◼ Uses symbolic representations and knowledgebased inference 598 The k-Nearest Neighbor Algorithm ◼ ◼ ◼ ◼ ◼ All instances correspond to points in the n-D space The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2) Target function could be discrete- or real- valued For discrete-valued, k-NN returns the most common value among the k training examples nearest to xq Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples . _ _ _ _ + .+ _ xq + _ + . . . . 599 Discussion on the k-NN Algorithm ◼ ◼ k-NN for real-valued prediction for a given unknown tuple ◼ Returns the mean values of the k nearest neighbors Distance-weighted nearest neighbor algorithm ◼ Weight the contribution of each of the k neighbors according to their distance to the query xq 1 ◼ ◼ ◼ Give greater weight to closer neighbors w d ( xq , x )2 i Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes ◼ To overcome it, axes stretch or elimination of the least relevant attributes 600 Case-Based Reasoning (CBR) ◼ ◼ CBR: Uses a database of problem solutions to solve new problems Store symbolic description (tuples or cases)—not points in a Euclidean space ◼ Applications: Customer-service (product-related diagnosis), legal ruling ◼ Methodology ◼ ◼ ◼ ◼ Instances represented by rich symbolic descriptions (e.g., function graphs) Search for similar cases, multiple retrieved cases may be combined Tight coupling between case retrieval, knowledge-based reasoning, and problem solving Challenges ◼ ◼ Find a good similarity metric Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases 601 Chapter 9. Classification: Advanced Methods ◼ Bayesian Belief Networks ◼ Classification by Backpropagation ◼ Support Vector Machines ◼ Classification by Using Frequent Patterns ◼ Lazy Learners (or Learning from Your Neighbors) ◼ Other Classification Methods ◼ Additional Topics Regarding Classification ◼ Summary 602 Genetic Algorithms (GA) ◼ Genetic Algorithm: based on an analogy to biological evolution ◼ An initial population is created consisting of randomly generated rules ◼ ◼ ◼ Each rule is represented by a string of bits ◼ E.g., if A1 and ¬A2 then C2 can be encoded as 100 ◼ If an attribute has k > 2 values, k bits can be used Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offspring The fitness of a rule is represented by its classification accuracy on a set of training examples ◼ Offspring are generated by crossover and mutation ◼ The process continues until a population P evolves when each rule in P satisfies a prespecified threshold ◼ Slow but easily parallelizable 603 Rough Set Approach ◼ ◼ ◼ Rough sets are used to approximately or “roughly” define equivalent classes A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C) Finding the minimal subsets (reducts) of attributes for feature reduction is NP-hard but a discernibility matrix (which stores the differences between attribute values for each pair of data tuples) is used to reduce the computation intensity 604 Fuzzy Set Approaches ◼ ◼ Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as in a fuzzy membership graph) Attribute values are converted to fuzzy values. Ex.: ◼ ◼ ◼ ◼ Income, x, is assigned a fuzzy membership value to each of the discrete categories {low, medium, high}, e.g. $49K belongs to “medium income” with fuzzy value 0.15 but belongs to “high income” with fuzzy value 0.96 Fuzzy membership values do not have to sum to 1. Each applicable rule contributes a vote for membership in the categories Typically, the truth values for each predicted category are summed, and these sums are combined 605 Chapter 9. Classification: Advanced Methods ◼ Bayesian Belief Networks ◼ Classification by Backpropagation ◼ Support Vector Machines ◼ Classification by Using Frequent Patterns ◼ Lazy Learners (or Learning from Your Neighbors) ◼ Other Classification Methods ◼ Additional Topics Regarding Classification ◼ Summary 606 Multiclass Classification ◼ Classification involving more than two classes (i.e., > 2 Classes) ◼ Method 1. One-vs.-all (OVA): Learn a classifier one at a time ◼ ◼ Given m classes, train m classifiers: one for each class ◼ Classifier j: treat tuples in class j as positive & all others as negative ◼ To classify a tuple X, the set of classifiers vote as an ensemble Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes ◼ Given m classes, construct m(m-1)/2 binary classifiers ◼ A classifier is trained using tuples of the two classes ◼ ◼ To classify a tuple X, each classifier votes. X is assigned to the class with maximal vote Comparison ◼ ◼ All-vs.-all tends to be superior to one-vs.-all Problem: Binary classifier is sensitive to errors, and errors affect vote count 607 Error-Correcting Codes for Multiclass Classification ◼ ◼ Originally designed to correct errors during data transmission for communication tasks by exploring data redundancy Example ◼ A 7-bit codeword associated with classes 1-4 Class Error-Corr. Codeword C1 1 1 1 1 1 1 1 C2 0 0 0 0 1 1 1 C3 0 0 1 1 0 0 1 C4 0 1 0 1 0 1 0 Given a unknown tuple X, the 7-trained classifiers output: 0001010 ◼ Hamming distance: # of different bits between two codewords ◼ H(X, C1) = 5, by checking # of bits between [1111111] & [0001010] ◼ H(X, C2) = 3, H(X, C3) = 3, H(X, C4) = 1, thus C4 as the label for X Error-correcting codes can correct up to (h-1)/h 1-bit error, where h is the minimum Hamming distance between any two codewords If we use 1-bit per class, it is equiv. to one-vs.-all approach, the code are insufficient to self-correct When selecting error-correcting codes, there should be good row-wise and col.-wise separation between the codewords ◼ ◼ ◼ ◼ 608 Semi-Supervised Classification ◼ ◼ ◼ ◼ Semi-supervised: Uses labeled and unlabeled data to build a classifier Self-training: ◼ Build a classifier using the labeled data ◼ Use it to label the unlabeled data, and those with the most confident label prediction are added to the set of labeled data ◼ Repeat the above process ◼ Adv: easy to understand; disadv: may reinforce errors Co-training: Use two or more classifiers to teach each other ◼ Each learner uses a mutually independent set of features of each tuple to train a good classifier, say f1 ◼ Then f1 and f2 are used to predict the class label for unlabeled data X ◼ Teach each other: The tuple having the most confident prediction from f1 is added to the set of labeled data for f2, & vice versa Other methods, e.g., joint probability distribution of features and labels 609 Active Learning ◼ ◼ ◼ ◼ ◼ Class labels are expensive to obtain Active learner: query human (oracle) for labels Pool-based approach: Uses a pool of unlabeled data ◼ L: a small subset of D is labeled, U: a pool of unlabeled data in D ◼ Use a query function to carefully select one or more tuples from U and request labels from an oracle (a human annotator) ◼ The newly labeled samples are added to L, and learn a model ◼ Goal: Achieve high accuracy using as few labeled data as possible Evaluated using learning curves: Accuracy as a function of the number of instances queried (# of tuples to be queried should be small) Research issue: How to choose the data tuples to be queried? ◼ Uncertainty sampling: choose the least certain ones ◼ Reduce version space, the subset of hypotheses consistent w. the training data ◼ Reduce expected entropy over U: Find the greatest reduction in the total number of incorrect predictions 610 Transfer Learning: Conceptual Framework ◼ ◼ ◼ Transfer learning: Extract knowledge from one or more source tasks and apply the knowledge to a target task Traditional learning: Build a new classifier for each new task Transfer learning: Build new classifier by applying existing knowledge learned from source tasks Traditional Learning Transfer Learning 611 Transfer Learning: Methods and Applications ◼ ◼ ◼ ◼ Applications: Especially useful when data is outdated or distribution changes, e.g., Web document classification, e-mail spam filtering Instance-based transfer learning: Reweight some of the data from source tasks and use it to learn the target task TrAdaBoost (Transfer AdaBoost) ◼ Assume source and target data each described by the same set of attributes (features) & class labels, but rather diff. distributions ◼ Require only labeling a small amount of target data ◼ Use source data in training: When a source tuple is misclassified, reduce the weight of such tupels so that they will have less effect on the subsequent classifier Research issues ◼ Negative transfer: When it performs worse than no transfer at all ◼ Heterogeneous transfer learning: Transfer knowledge from different feature space or multiple source domains ◼ Large-scale transfer learning 612 Chapter 9. Classification: Advanced Methods ◼ Bayesian Belief Networks ◼ Classification by Backpropagation ◼ Support Vector Machines ◼ Classification by Using Frequent Patterns ◼ Lazy Learners (or Learning from Your Neighbors) ◼ Other Classification Methods ◼ Additional Topics Regarding Classification ◼ Summary 613 Summary ◼ Effective and advanced classification methods ◼ Bayesian belief network (probabilistic networks) ◼ Backpropagation (Neural networks) ◼ Support Vector Machine (SVM) ◼ Pattern-based classification ◼ ◼ Other classification methods: lazy learners (KNN, case-based reasoning), genetic algorithms, rough set and fuzzy set approaches Additional Topics on Classification ◼ Multiclass classification ◼ Semi-supervised classification ◼ Active learning ◼ Transfer learning 614 References ◼ Please see the references of Chapter 8 615 Surplus Slides What Is Prediction? ◼ ◼ ◼ ◼ (Numerical) prediction is similar to classification ◼ construct a model ◼ use model to predict continuous or ordered value for a given input Prediction is different from classification ◼ Classification refers to predict categorical class label ◼ Prediction models continuous-valued functions Major method for prediction: regression ◼ model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis ◼ Linear and multiple regression ◼ Non-linear regression ◼ Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees 617 Linear Regression ◼ Linear regression: involves a response variable y and a single predictor variable x y = w0 + w1 x where w0 (y-intercept) and w1 (slope) are regression coefficients ◼ Method of least squares: estimates the best-fitting straight line | D| (x − x )( yi − y ) w = 1 (x − x) i =1 i | D| i =1 ◼ 2 w = y −w x 0 1 i Multiple linear regression: involves more than one predictor variable ◼ Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|) ◼ Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2 ◼ Solvable by extension of least square method or using SAS, S-Plus ◼ Many nonlinear functions can be transformed into the above 618 Nonlinear Regression ◼ ◼ ◼ ◼ Some nonlinear models can be modeled by a polynomial function A polynomial regression model can be transformed into linear regression model. For example, y = w0 + w1 x + w2 x2 + w3 x3 convertible to linear with new variables: x2 = x2, x3= x3 y = w0 + w1 x + w2 x2 + w3 x3 Other functions, such as power function, can also be transformed to linear model Some models are intractable nonlinear (e.g., sum of exponential terms) ◼ possible to obtain least square estimates through extensive calculation on more complex formulae 619 Other Regression-Based Models ◼ Generalized linear model: ◼ ◼ ◼ ◼ ◼ ◼ Foundation on which linear regression can be applied to modeling categorical response variables Variance of y is a function of the mean value of y, not a constant Logistic regression: models the prob. of some event occurring as a linear function of a set of predictor variables Poisson regression: models the data that exhibit a Poisson distribution Log-linear models: (for categorical data) ◼ Approximate discrete multidimensional prob. distributions ◼ Also useful for data compression and smoothing Regression trees and model trees ◼ Trees to predict continuous values rather than class labels 620 Regression Trees and Model Trees ◼ Regression tree: proposed in CART system (Breiman et al. 1984) ◼ CART: Classification And Regression Trees ◼ Each leaf stores a continuous-valued prediction ◼ It is the average value of the predicted attribute for the training tuples that reach the leaf ◼ Model tree: proposed by Quinlan (1992) ◼ Each leaf holds a regression model—a multivariate linear equation for the predicted attribute ◼ ◼ A more general case than regression tree Regression and model trees tend to be more accurate than linear regression when the data are not represented well by a simple linear model 621 Predictive Modeling in Multidimensional Databases ◼ ◼ ◼ ◼ ◼ Predictive modeling: Predict data values or construct generalized linear models based on the database data One can only predict value ranges or category distributions Method outline: ◼ Minimal generalization ◼ Attribute relevance analysis ◼ Generalized linear model construction ◼ Prediction Determine the major factors which influence the prediction ◼ Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc. Multi-level prediction: drill-down and roll-up analysis 622 Prediction: Numerical Data 623 Prediction: Categorical Data 624 SVM—Introductory Literature ◼ “Statistical Learning Theory” by Vapnik: extremely hard to understand, containing many errors too. ◼ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. ◼ Better than the Vapnik’s book, but still written too hard for introduction, and the examples are so not-intuitive ◼ The book “An Introduction to Support Vector Machines” by N. Cristianini and J. Shawe-Taylor ◼ Also written hard for introduction, but the explanation about the mercer’s theorem is better than above literatures ◼ The neural network book by Haykins ◼ Contains one nice chapter of SVM introduction 625 Notes about SVM— Introductory Literature ◼ “Statistical Learning Theory” by Vapnik: difficult to understand, containing many errors. ◼ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. ◼ Easier than Vapnik’s book, but still not introductory level; the examples are not so intuitive ◼ The book An Introduction to Support Vector Machines by Cristianini and Shawe-Taylor ◼ Not introductory level, but the explanation about Mercer’s Theorem is better than above literatures ◼ Neural Networks and Learning Machines by Haykin ◼ Contains a nice chapter on SVM introduction 626 Associative Classification Can Achieve High Accuracy and Efficiency (Cong et al. SIGMOD05) 627 A Closer Look at CMAR ◼ ◼ ◼ ◼ CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) Efficiency: Uses an enhanced FP-tree that maintains the distribution of class labels among tuples satisfying each frequent itemset Rule pruning whenever a rule is inserted into the tree ◼ Given two rules, R1 and R2, if the antecedent of R1 is more general than that of R2 and conf(R1) ≥ conf(R2), then prune R2 ◼ Prunes rules for which the rule antecedent and class are not positively correlated, based on a χ2 test of statistical significance Classification based on generated/pruned rules ◼ If only one rule satisfies tuple X, assign the class label of the rule ◼ If a rule set S satisfies X, CMAR ◼ divides S into groups according to class labels 2 ◼ uses a weighted χ measure to find the strongest group of rules, based on the statistical correlation of rules within a group ◼ assigns X the class label of the strongest group 628 Perceptron & Winnow • Vector: x, w x2 • Scalar: x, y, w Input: {(x1, y1), …} Output: classification function f(x) f(xi) > 0 for yi = +1 x1 f(xi) < 0 for yi = -1 • Perceptron: f(x) update => +b=0 Wwx additively or w1x1+w 2x2+b = 0 • Winnow: update W multiplicatively 629 What is Cluster Analysis? ◼ ◼ ◼ ◼ Cluster: A collection of data objects ◼ similar (or related) to one another within the same group ◼ dissimilar (or unrelated) to the objects in other groups Cluster analysis (or clustering, data segmentation, …) ◼ Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) Typical applications ◼ As a stand-alone tool to get insight into data distribution ◼ As a preprocessing step for other algorithms 630 Clustering for Data Understanding and Applications ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species Information retrieval: document clustering Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate, find patterns of atmospheric and ocean Economic Science: market resarch 631 Clustering as a Preprocessing Tool (Utility) ◼ Summarization: ◼ ◼ Compression: ◼ ◼ Image processing: vector quantization Finding K-nearest Neighbors ◼ ◼ Preprocessing for regression, PCA, classification, and association analysis Localizing search to one or a small number of clusters Outlier detection ◼ Outliers are often viewed as those “far away” from any cluster 632 Quality: What Is Good Clustering? ◼ A good clustering method will produce high quality clusters ◼ ◼ high intra-class similarity: cohesive within clusters ◼ low inter-class similarity: distinctive between clusters The quality of a clustering method depends on ◼ the similarity measure used by the method ◼ its implementation, and ◼ Its ability to discover some or all of the hidden patterns 633 Measure the Quality of Clustering ◼ ◼ Dissimilarity/Similarity metric ◼ Similarity is expressed in terms of a distance function, typically metric: d(i, j) ◼ The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables ◼ Weights should be associated with different variables based on applications and data semantics Quality of clustering: ◼ There is usually a separate “quality” function that measures the “goodness” of a cluster. ◼ It is hard to define “similar enough” or “good enough” ◼ The answer is typically highly subjective 634 Considerations for Cluster Analysis ◼ Partitioning criteria ◼ ◼ Separation of clusters ◼ ◼ Exclusive (e.g., one customer belongs to only one region) vs. nonexclusive (e.g., one document may belong to more than one class) Similarity measure ◼ ◼ Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) Clustering space ◼ Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering) 635 Requirements and Challenges ◼ ◼ ◼ ◼ ◼ Scalability ◼ Clustering all the data instead of only on samples Ability to deal with different types of attributes ◼ Numerical, binary, categorical, ordinal, linked, and mixture of these Constraint-based clustering ◼ User may give inputs on constraints ◼ Use domain knowledge to determine input parameters Interpretability and usability Others ◼ Discovery of clusters with arbitrary shape ◼ Ability to deal with noisy data ◼ Incremental clustering and insensitivity to input order ◼ High dimensionality 636 Major Clustering Approaches (I) ◼ ◼ ◼ ◼ Partitioning approach: ◼ Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors ◼ Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: ◼ Create a hierarchical decomposition of the set of data (or objects) using some criterion ◼ Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach: ◼ Based on connectivity and density functions ◼ Typical methods: DBSACN, OPTICS, DenClue Grid-based approach: ◼ based on a multiple-level granularity structure ◼ Typical methods: STING, WaveCluster, CLIQUE 637 Major Clustering Approaches (II) ◼ ◼ ◼ ◼ Model-based: ◼ A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other ◼ Typical methods: EM, SOM, COBWEB Frequent pattern-based: ◼ Based on the analysis of frequent patterns ◼ Typical methods: p-Cluster User-guided or constraint-based: ◼ Clustering by considering user-specified or application-specific constraints ◼ Typical methods: COD (obstacles), constrained clustering Link-based clustering: ◼ Objects are often linked together in various ways ◼ Massive links can be used to cluster objects: SimRank, LinkClus 638 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 639 Partitioning Algorithms: Basic Concept ◼ Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) E = ik=1 pCi ( p − ci ) 2 ◼ Given k, find a partition of k clusters that optimizes the chosen partitioning criterion ◼ Global optimal: exhaustively enumerate all partitions ◼ Heuristic methods: k-means and k-medoids algorithms ◼ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster ◼ k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 640 The K-Means Clustering Method ◼ Given k, the k-means algorithm is implemented in four steps: ◼ ◼ ◼ ◼ Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when the assignment does not change 641 An Example of K-Means Clustering K=2 Arbitrarily partition objects into k groups The initial data set ◼ Partition objects into k nonempty subsets ◼ Repeat ◼ ◼ Compute centroid (i.e., mean point) for each partition ◼ Assign each object to the cluster of its nearest centroid Update the cluster centroids Loop if needed Reassign objects Update the cluster centroids Until no change 642 Comments on the K-Means Method ◼ Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. ◼ Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) ◼ Comment: Often terminates at a local optimal. ◼ Weakness ◼ Applicable only to objects in a continuous n-dimensional space ◼ ◼ ◼ ◼ Using the k-modes method for categorical data In comparison, k-medoids can be applied to a wide range of data Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) Sensitive to noisy data and outliers 643 Variations of the K-Means Method ◼ ◼ Most of the variants of the k-means which differ in ◼ Selection of the initial k means ◼ Dissimilarity calculations ◼ Strategies to calculate cluster means Handling categorical data: k-modes ◼ Replacing means of clusters with modes ◼ Using new dissimilarity measures to deal with categorical objects ◼ Using a frequency-based method to update modes of clusters ◼ A mixture of categorical and numerical data: k-prototype method 644 What Is the Problem of the K-Means Method? ◼ The k-means algorithm is sensitive to outliers ! ◼ Since an object with an extremely large value may substantially distort the distribution of the data ◼ K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 645 PAM: A Typical K-Medoids Algorithm Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary choose k object as initial medoids 7 6 5 4 3 2 1 7 6 5 4 3 2 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remainin g object to nearest medoids 7 6 5 4 3 2 1 0 0 K=2 Until no change 10 If quality is improved. 3 4 5 6 7 8 9 10 10 Compute total cost of swapping 9 Swapping O and Oramdom 2 Randomly select a nonmedoid object,Oramdom Total Cost = 26 Do loop 1 8 7 6 9 8 7 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 646 The K-Medoid Clustering Method ◼ K-Medoids Clustering: Find representative objects (medoids) in clusters ◼ PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987) ◼ Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering ◼ PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity) ◼ Efficiency improvement on PAM ◼ CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples ◼ CLARANS (Ng & Han, 1994): Randomized re-sampling 647 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 648 Hierarchical Clustering ◼ Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative (AGNES) Step 3 Step 2 Step 1 Step 0 divisive (DIANA) 649 AGNES (Agglomerative Nesting) ◼ Introduced in Kaufmann and Rousseeuw (1990) ◼ Implemented in statistical packages, e.g., Splus ◼ Use the single-link method and the dissimilarity matrix ◼ Merge nodes that have the least dissimilarity ◼ Go on in a non-descending fashion ◼ Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 650 Dendrogram: Shows How Clusters are Merged Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster 651 DIANA (Divisive Analysis) ◼ Introduced in Kaufmann and Rousseeuw (1990) ◼ Implemented in statistical analysis packages, e.g., Splus ◼ Inverse order of AGNES ◼ Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 652 Distance between Clusters ◼ X X Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) ◼ Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) ◼ Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) ◼ Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) ◼ Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) ◼ Medoid: a chosen, centrally located object in the cluster 653 Centroid, Radius and Diameter of a Cluster (for numerical data sets) ◼ ◼ Centroid: the “middle” of a cluster Cm = iN= 1(t ip ) N Radius: square root of average distance from any point of the cluster to its centroid ◼ N (t − cm ) 2 Rm = i =1 ip N Diameter: square root of average mean squared distance between all pairs of points in the cluster N N (t − t ) 2 Dm = i =1 i =1 ip iq N ( N −1) 654 Extensions to Hierarchical Clustering ◼ Major weakness of agglomerative clustering methods ◼ Can never undo what was done previously ◼ Do not scale well: time complexity of at least O(n2), where n is the number of total objects ◼ Integration of hierarchical & distance-based clustering ◼ BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters ◼ CHAMELEON (1999): hierarchical clustering using dynamic modeling 655 BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) ◼ ◼ Zhang, Ramakrishnan & Livny, SIGMOD’96 Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering ◼ ◼ ◼ Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans ◼ Weakness: handles only numeric data, and sensitive to the order of the data record 656 Clustering Feature Vector in BIRCH Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points N LS: linear sum of N points: X i i =1 CF = (5, (16,30),(54,190)) SS: square sum of N points N Xi i =1 2 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 (3,4) (2,6) (4,5) (4,7) (3,8) 657 CF-Tree in BIRCH ◼ ◼ Clustering feature: ◼ Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd moments of the subcluster from the statistical point of view ◼ Registers crucial measurements for computing cluster and utilizes storage efficiently A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering ◼ A nonleaf node in a tree has descendants or “children” ◼ The nonleaf nodes store sums of the CFs of their children A CF tree has two parameters ◼ Branching factor: max # of children ◼ Threshold: max diameter of sub-clusters stored at the leaf nodes 658 The CF Tree Structure Root B=7 CF1 CF2 CF3 CF6 L=6 child1 child2 child3 child6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 CF6 next Leaf node prev CF1 CF2 CF4 next 659 The Birch Algorithm ◼ ◼ ◼ ◼ Cluster Diameter 1 2 ( xi − x j ) n( n − 1) For each point in the input ◼ Find closest leaf entry ◼ Add point to leaf entry and update CF ◼ If entry diameter > max_diameter, then split leaf, and possibly parents Algorithm is O(n) Concerns ◼ Sensitive to insertion order of data points ◼ Since we fix the size of leaf nodes, so clusters may not be so natural ◼ Clusters tend to be spherical given the radius and diameter measures 660 CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999) ◼ CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999 ◼ Measures the similarity based on a dynamic model ◼ ◼ Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters Graph-based, and a two-phase algorithm 1. Use a graph-partitioning algorithm: cluster objects into a large number of relatively small sub-clusters 2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters 661 Overall Framework of CHAMELEON Construct (K-NN) Partition the Graph Sparse Graph Data Set K-NN Graph P and q are connected if q is among the top k closest neighbors of p Merge Partition Relative interconnectivity: connectivity of c1 and c2 over internal connectivity Final Clusters Relative closeness: closeness of c1 and c2 over internal closeness 662 CHAMELEON (Clustering Complex Objects) 663 Probabilistic Hierarchical Clustering ◼ ◼ Algorithmic hierarchical clustering ◼ Nontrivial to choose a good distance measure ◼ Hard to handle missing attribute values ◼ Optimization goal not clear: heuristic, local search Probabilistic hierarchical clustering ◼ ◼ ◼ ◼ Use probabilistic models to measure distances between clusters Generative model: Regard the set of data objects to be clustered as a sample of the underlying data generation mechanism to be analyzed Easy to understand, same efficiency as algorithmic agglomerative clustering method, can handle partially observed data In practice, assume the generative models adopt common distributions functions, e.g., Gaussian distribution or Bernoulli distribution, governed by parameters 664 Generative Model ◼ ◼ ◼ ◼ Given a set of 1-D points X = {x1, …, xn} for clustering analysis & assuming they are generated by a Gaussian distribution: The probability that a point xi ∈ X is generated by the model The likelihood that X is generated by the model: The task of learning the generative model: find the the maximum likelihood parameters μ and σ2 such that 665 A Probabilistic Hierarchical Clustering Algorithm ◼ ◼ ◼ For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality can be measured by, where P() is the maximum likelihood Distance between clusters C1 and C2: Algorithm: Progressively merge points and clusters Input: D = {o1, ..., on}: a data set containing n objects Output: A hierarchy of clusters Method Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n; For i = 1 to n { Find pair of clusters Ci and Cj such that Ci,Cj = argmaxi ≠ j {log (P(Ci∪Cj )/(P(Ci)P(Cj ))}; If log (P(Ci∪Cj )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj } 666 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 667 Density-Based Clustering Methods ◼ ◼ ◼ Clustering based on density (local cluster criterion), such as density-connected points Major features: ◼ Discover clusters of arbitrary shape ◼ Handle noise ◼ One scan ◼ Need density parameters as termination condition Several interesting studies: ◼ DBSCAN: Ester, et al. (KDD’96) ◼ OPTICS: Ankerst, et al (SIGMOD’99). ◼ DENCLUE: Hinneburg & D. Keim (KDD’98) ◼ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) 668 Density-Based Clustering: Basic Concepts ◼ ◼ ◼ Two parameters: ◼ Eps: Maximum radius of the neighbourhood ◼ MinPts: Minimum number of points in an Epsneighbourhood of that point NEps(p): {q belongs to D | dist(p,q) ≤ Eps} Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if ◼ p belongs to NEps(q) ◼ core point condition: |NEps (q)| ≥ MinPts p q MinPts = 5 Eps = 1 cm 669 Density-Reachable and Density-Connected ◼ Density-reachable: ◼ ◼ A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi p p1 q Density-connected ◼ A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q o 670 DBSCAN: Density-Based Spatial Clustering of Applications with Noise ◼ ◼ Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core MinPts = 5 671 DBSCAN: The Algorithm ◼ ◼ ◼ ◼ ◼ Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed 672 DBSCAN: Sensitive to Parameters 673 OPTICS: A Cluster-Ordering Method (1999) ◼ OPTICS: Ordering Points To Identify the Clustering Structure ◼ Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99) ◼ Produces a special order of the database wrt its density-based clustering structure ◼ This cluster-ordering contains info equiv to the densitybased clusterings corresponding to a broad range of parameter settings ◼ Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure ◼ Can be represented graphically or using visualization techniques 674 OPTICS: Some Extension from DBSCAN ◼ Index-based: ◼ k = number of dimensions ◼ N = 20 ◼ D p = 75% M = N(1-p) = 5 ◼ Complexity: O(NlogN) Core Distance: ◼ min eps s.t. point is core Reachability Distance p2 Max (core-distance (o), d (o, p)) ◼ ◼ ◼ r(p1, o) = 2.8cm. r(p2,o) = 4cm p1 o o MinPts = 5 e = 3 cm 675 Reachability -distance undefined e e‘ e Cluster-order of the objects 676 Density-Based Clustering: OPTICS & Its Applications 677 DENCLUE: Using Statistical Density Functions ◼ DENsity-based CLUstEring by Hinneburg & Keim (KDD’98) ◼ Using statistical density functions: f Gaussian ( x , y ) = e ◼ Major features − d ( x,y) 2 2 total influence on x 2 influence of y on x f D Gaussian ( x) = i =1 − e 2 2 D f Gaussian ( x, xi ) = i =1 ( xi − x) e ◼ Solid mathematical foundation ◼ Good for data sets with large amounts of noise ◼ N d ( x , xi ) 2 N − d ( x , xi ) 2 2 2 gradient of x in the direction of xi Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets ◼ Significant faster than existing algorithm (e.g., DBSCAN) ◼ But needs a large number of parameters 678 Denclue: Technical Essence ◼ ◼ ◼ ◼ ◼ ◼ ◼ Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure Influence function: describes the impact of a data point within its neighborhood Overall density of the data space can be calculated as the sum of the influence function of all data points Clusters can be determined mathematically by identifying density attractors Density attractors are local maximal of the overall density function Center defined clusters: assign to each density attractor the points density attracted to it Arbitrary shaped cluster: merge density attractors that are connected through paths of high density (> threshold) 679 Density Attractor 680 Center-Defined and Arbitrary 681 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 682 Grid-Based Clustering Method ◼ ◼ Using multi-resolution grid data structure Several interesting methods ◼ STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) ◼ WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) ◼ ◼ A multi-resolution clustering approach using wavelet method CLIQUE: Agrawal, et al. (SIGMOD’98) ◼ Both grid-based and subspace clustering 683 STING: A Statistical Information Grid Approach ◼ ◼ ◼ Wang, Yang and Muntz (VLDB’97) The spatial area is divided into rectangular cells There are several levels of cells corresponding to different levels of resolution 684 The STING Clustering Method ◼ ◼ ◼ ◼ ◼ ◼ Each cell at a high level is partitioned into a number of smaller cells in the next lower level Statistical info of each cell is calculated and stored beforehand and is used to answer queries Parameters of higher level cells can be easily calculated from parameters of lower level cell ◼ count, mean, s, min, max ◼ type of distribution—normal, uniform, etc. Use a top-down approach to answer spatial data queries Start from a pre-selected layer—typically with a small number of cells For each cell in the current level compute the confidence interval 685 STING Algorithm and Its Analysis ◼ ◼ ◼ ◼ ◼ Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Advantages: ◼ Query-independent, easy to parallelize, incremental update ◼ O(K), where K is the number of grid cells at the lowest level Disadvantages: ◼ All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected 686 CLIQUE (Clustering In QUEst) ◼ ◼ ◼ Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98) Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space CLIQUE can be considered as both density-based and grid-based ◼ ◼ ◼ ◼ It partitions each dimension into the same number of equal length interval It partitions an m-dimensional data space into non-overlapping rectangular units A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter A cluster is a maximal set of connected dense units within a subspace 687 CLIQUE: The Major Steps ◼ ◼ ◼ Partition the data space and find the number of points that lie inside each cell of the partition. Identify the subspaces that contain clusters using the Apriori principle Identify clusters ◼ ◼ ◼ Determine dense units in all subspaces of interests Determine connected dense units in all subspaces of interests. Generate minimal description for the clusters ◼ Determine maximal regions that cover a cluster of connected dense units for each cluster ◼ Determination of minimal cover for each cluster 688 =3 30 40 Vacation 20 50 Salary (10,000) 0 1 2 3 4 5 6 7 30 Vacation (week) 0 1 2 3 4 5 6 7 age 60 20 30 40 50 age 60 50 age 689 Strength and Weakness of CLIQUE ◼ Strength ◼ ◼ automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces ◼ insensitive to the order of records in input and does not presume some canonical data distribution ◼ scales linearly with the size of input and has good scalability as the number of dimensions in the data increases Weakness ◼ The accuracy of the clustering result may be degraded at the expense of simplicity of the method 690 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 691 Assessing Clustering Tendency ◼ ◼ Assess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distribution Test spatial randomness by statistic test: Hopkins Static ◼ Given a dataset D regarded as a sample of a random variable o, determine how far away o is from being uniformly distributed in the data space ◼ Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D: xi = min{dist (pi, v)} where v in D ◼ Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and v ≠ qi ◼ Calculate the Hopkins Statistic: ◼ If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D is highly skewed, H is close to 0 692 Determine the Number of Clusters ◼ ◼ ◼ Empirical method ◼ # of clusters ≈√n/2 for a dataset of n points Elbow method ◼ Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters Cross validation method ◼ Divide a given data set into m parts ◼ Use m – 1 parts to obtain a clustering model ◼ Use the remaining part to test the quality of the clustering ◼ E.g., For each point in the test set, find the closest centroid, and use the sum of squared distance between all points in the test set and the closest centroids to measure how well the model fits the test set ◼ For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and find # of clusters that fits the data 693 the best Measuring Clustering Quality ◼ Two methods: extrinsic vs. intrinsic ◼ Extrinsic: supervised, i.e., the ground truth is available ◼ ◼ ◼ Compare a clustering against the ground truth using certain clustering quality measure Ex. BCubed precision and recall metrics Intrinsic: unsupervised, i.e., the ground truth is unavailable ◼ ◼ Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are Ex. Silhouette coefficient 694 Measuring Clustering Quality: Extrinsic Methods ◼ ◼ Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg. Q is good if it satisfies the following 4 essential criteria ◼ Cluster homogeneity: the purer, the better ◼ Cluster completeness: should assign objects belong to the same category in the ground truth to the same cluster ◼ Rag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other” category) ◼ Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces 695 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 696 Summary ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods K-means and K-medoids algorithms are popular partitioning-based clustering algorithms Birch and Chameleon are interesting hierarchical clustering algorithms, and there are also probabilistic hierarchical clustering algorithms DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithm Quality of clustering results can be evaluated in various ways 697 CS512-Spring 2011: An Introduction ◼ Coverage ◼ Cluster Analysis: Chapter 11 ◼ Outlier Detection: Chapter 12 ◼ Mining Sequence Data: BK2: Chapter 8 ◼ Mining Graphs Data: BK2: Chapter 9 ◼ Social and Information Network Analysis ◼ ◼ ◼ ◼ ◼ ◼ BK2: Chapter 9 Partial coverage: Mark Newman: “Networks: An Introduction”, Oxford U., 2010 Scattered coverage: Easley and Kleinberg, “Networks, Crowds, and Markets: Reasoning About a Highly Connected World”, Cambridge U., 2010 Recent research papers Mining Data Streams: BK2: Chapter 8 Requirements ◼ One research project ◼ One class presentation (15 minutes) ◼ Two homeworks (no programming assignment) ◼ Two midterm exams (no final exam) 698 References (1) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99. Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987. D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. VLDB’98. V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99. 699 References (2) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98. S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March 1999. A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD’98. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988. G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98. 700 References (3) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988. R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94. L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review, SIGKDD Explorations, 6(1), June 2004 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98. A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large Databases, ICDT'01. A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles, ICDE'01 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD’02 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96 X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic Links”, VLDB'06 701 Slides unused in class 702 A Typical K-Medoids Algorithm (PAM) Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary choose k object as initial medoids 7 6 5 4 3 2 1 7 6 5 4 3 2 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remainin g object to nearest medoids 7 6 5 4 3 2 1 0 0 K=2 Until no change 10 If quality is improved. 3 4 5 6 7 8 9 10 10 Compute total cost of swapping 9 Swapping O and Oramdom 2 Randomly select a nonmedoid object,Oramdom Total Cost = 26 Do loop 1 8 7 6 9 8 7 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 703 PAM (Partitioning Around Medoids) (1987) ◼ PAM (Kaufman and Rousseeuw, 1987), built in Splus ◼ Use real object to represent the cluster ◼ ◼ ◼ Select k representative objects arbitrarily For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih For each pair of i and h, ◼ ◼ ◼ If TCih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change 704 PAM Clustering: Finding the Best Cluster Center ◼ Case 1: p currently belongs to oj. If oj is replaced by orandom as a representative object and p is the closest to one of the other representative object oi, then p is reassigned to oi 705 What Is the Problem with PAM? ◼ ◼ Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean Pam works efficiently for small data sets but does not scale well for large data sets. ◼ O(k(n-k)2 ) for each iteration where n is # of data,k is # of clusters ➔Sampling-based method CLARA(Clustering LARge Applications) 706 CLARA (Clustering Large Applications) (1990) ◼ CLARA (Kaufmann and Rousseeuw in 1990) ◼ ◼ Built in statistical analysis packages, such as SPlus It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output ◼ Strength: deals with larger data sets than PAM ◼ Weakness: ◼ ◼ Efficiency depends on the sample size A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased 707 CLARANS (“Randomized” CLARA) (1994) ◼ ◼ ◼ CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94) ◼ Draws sample of neighbors dynamically ◼ The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids ◼ If the local optimum is found, it starts with new randomly selected node in search for a new local optimum Advantages: More efficient and scalable than both PAM and CLARA Further improvement: Focusing techniques and spatial access structures (Ester et al.’95) 708 ROCK: Clustering Categorical Data ◼ ◼ ◼ ◼ ROCK: RObust Clustering using linKs ◼ S. Guha, R. Rastogi & K. Shim, ICDE’99 Major ideas ◼ Use links to measure similarity/proximity ◼ Not distance-based Algorithm: sampling-based clustering ◼ Draw random sample ◼ Cluster with links ◼ Label data in disk Experiments ◼ Congressional voting, mushroom data 709 Similarity Measure in ROCK ◼ ◼ ◼ ◼ Traditional measures for categorical data may not work well, e.g., Jaccard coefficient Example: Two groups (clusters) of transactions ◼ C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e} ◼ C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g} Jaccard co-efficient may lead to wrong clustering result ◼ C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d}) ◼ C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f}) Jaccard co-efficient-based similarity function: T1 T2 Sim( T1 , T2 ) = T1 T2 ◼ Ex. Let T1 = {a, b, c}, T2 = {c, d, e} Sim (T 1, T 2) = {c} {a, b, c, d , e} = 1 = 0.2 5 710 Link Measure in ROCK ◼ ◼ Clusters ◼ C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e} ◼ C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g} Neighbors ◼ Two transactions are neighbors if sim(T1,T2) > threshold Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f} ◼ T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e}, {a,b,f}, {a,b,g} ◼ T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d} ◼ T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g} Link Similarity ◼ Link similarity between two transactions is the # of common neighbors ◼ ◼ ◼ link(T1, T2) = 4, since they have 4 common neighbors ◼ ◼ {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e} link(T1, T3) = 3, since they have 3 common neighbors ◼ {a, b, d}, {a, b, e}, {a, b, g} 711 Aggregation-Based Similarity Computation 0.2 4 0.9 1.0 0.8 10 11 ST2 5 0.9 1.0 13 14 12 a b ST1 For each node nk ∈ {n10, n11, n12} and nl ∈ {n13, n14}, their pathbased similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl). sim (na , nb ) = k =10 s(nk , n4 ) 12 3 s(n , n ) 14 l =13 4 5 s(nl , n5 ) 2 = 0.171 takes O(3+2) time After aggregation, we reduce quadratic time computation to linear time computation. 713 Computing Similarity with Aggregation Average similarity and total weight sim(na, nb) can be computed from aggregated similarities a:(0.9,3) 0.2 4 10 11 12 a b:(0.95,2) 5 13 14 b sim(na, nb) = avg_sim(na,n4) x s(n4, n5) x avg_sim(nb,n5) = 0.9 x 0.2 x 0.95 = 0.171 To compute sim(na,nb): ◼ ◼ ◼ Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb with nj. Calculate similarity (and weight) between na and nb w.r.t. ni and nj. Calculate weighted average similarity between na and nb w.r.t. all such pairs. 714 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Overview of Clustering Methods ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Summary 715 Link-Based Clustering: Calculate Similarities Based On Links Authors Tom Proceedings sigmod03 sigmod04 Mike Cathy John Mary Conferences ◼ sigmod sigmod05 vldb03 vldb04 vldb05 aaai04 aaai05 vldb The similarity between two objects x and y is defined as the average similarity between objects linked with x and those with y: C sim (a, b ) = I (a ) I (b ) aaai Jeh & Widom, KDD’2002: SimRank Two objects are similar if they are linked with the same or similar objects ◼ I ( a ) I (b ) sim (I (a ), I (b)) i =1 j =1 i j Issue: Expensive to compute: ◼ For a dataset of N objects and M links, it takes O(N2) space and O(M2) time to compute all similarities. 716 Observation 1: Hierarchical Structures ◼ Hierarchical structures often exist naturally among objects (e.g., taxonomy of animals) A hierarchical structure of products in Walmart Relationships between articles and words (Chakrabarti, Papadimitriou, Modha, Faloutsos, 2004) grocery TV electronics DVD apparel Articles All camera Words 717 Observation 2: Distribution of Similarity portion of entries 0.4 Distribution of SimRank similarities among DBLP authors 0.3 0.2 0.1 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 similarity value ◼ Power law distribution exists in similarities ◼ 56% of similarity entries are in [0.005, 0.015] ◼ 1.4% of similarity entries are larger than 0.1 ◼ Can we design a data structure that stores the significant similarities and compresses insignificant ones? 718 A Novel Data Structure: SimTree Each non-leaf node represents a group of similar lower-level nodes Each leaf node represents an object Similarities between siblings are stored Canon A40 digital camera Digital Sony V3 digital Cameras Consumer camera electronics Apparels TVs 719 Similarity Defined by SimTree Similarity between two sibling nodes n1 and n2 n1 Adjustment ratio for node n7 0.8 n7 Path-based node similarity ◼ ◼ ◼ 0.9 0.8 n8 n3 0.9 n6 n40.3 n5 0.9 ◼ n2 0.2 1.0 n9 simp(n7,n8) = s(n7, n4) x s(n4, n5) x s(n5, n8) Similarity between two nodes is the average similarity between objects linked with them in other SimTrees Adjust/ ratio for x =Average similarity between x and all other nodes Average similarity between x’s parent and all other nodes 720 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Method ◼ Initialize a SimTree for objects of each type ◼ Repeat until stable ◼ For each SimTree, update the similarities between its nodes using similarities in other SimTrees ◼ Similarity between two nodes x and y is the average similarity between objects linked with them ◼ Adjust the structure of each SimTree ◼ Assign each node to the parent node that it is most similar to For details: X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic Links”, VLDB'06 721 Initialization of SimTrees ◼ ◼ Initializing a SimTree ◼ Repeatedly find groups of tightly related nodes, which are merged into a higher-level node Tightness of a group of nodes ◼ For a group of nodes {n1, …, nk}, its tightness is defined as the number of leaf nodes in other SimTrees that are connected to all of {n1, …, nk} Nodes n1 n2 Leaf nodes in another SimTree 1 2 3 4 5 The tightness of {n1, n2} is 3 722 Finding Tight Groups by Freq. Pattern Mining ◼ Finding tight groups Frequent pattern mining Reduced to The tightness of a g1 group of nodes is the support of a frequent pattern g2 ◼ n1 n2 n3 n4 1 2 3 4 5 6 7 8 9 Transactions {n1} {n1, n2} {n2} {n1, n2} {n1, n2} {n2, n3, n4} {n4} {n3, n4} {n3, n4} Procedure of initializing a tree ◼ Start from leaf nodes (level-0) ◼ At each level l, find non-overlapping groups of similar nodes with frequent pattern mining 723 Adjusting SimTree Structures n1 n2 0.9 n4 0.8 n7 ◼ n6 n5 n7 n8 n3 n9 After similarity changes, the tree structure also needs to be changed ◼ If a node is more similar to its parent’s sibling, then move it to be a child of that sibling ◼ Try to move each node to its parent’s sibling that it is most similar to, under the constraint that each parent node can have at most c children 724 Complexity For two types of objects, N in each, and M linkages between them. Time Space Updating similarities O(M(logN)2) O(M+N) Adjusting tree structures O(N) O(N) LinkClus O(M(logN)2) O(M+N) SimRank O(M2) O(N2) 725 Experiment: Email Dataset ◼ ◼ ◼ ◼ ◼ F. Nielsen. Email dataset. Approach www.imm.dtu.dk/~rem/data/Email-1431.zip LinkClus 370 emails on conferences, 272 on jobs, and 789 spam emails SimRank Accuracy: measured by manually labeled ReCom data F-SimRank Accuracy of clustering: % of pairs of objects in the same cluster that share common label CLARANS Accuracy time (s) 0.8026 1579.6 0.7965 39160 0.5711 74.6 0.3688 479.7 0.4768 8.55 Approaches compared: ◼ SimRank (Jeh & Widom, KDD 2002): Computing pair-wise similarities ◼ SimRank with FingerPrints (F-SimRank): Fogaras & R´acz, WWW 2005 ◼ ◼ pre-computes a large sample of random paths from each object and uses samples of two objects to estimate SimRank similarity ReCom (Wang et al. SIGIR 2003) ◼ Iteratively clustering objects using cluster labels of linked objects 726 WaveCluster: Clustering by Wavelet Analysis (1998) ◼ ◼ ◼ Sheikholeslami, Chatterjee, and Zhang (VLDB’98) A multi-resolution clustering approach which applies wavelet transform to the feature space; both grid-based and density-based Wavelet transform: A signal processing technique that decomposes a signal into different frequency sub-band ◼ Data are transformed to preserve relative distance between objects at different levels of resolution ◼ Allows natural clusters to become more distinguishable 727 The WaveCluster Algorithm ◼ ◼ How to apply wavelet transform to find clusters ◼ Summarizes the data by imposing a multidimensional grid structure onto data space ◼ These multidimensional spatial data objects are represented in a n-dimensional feature space ◼ Apply wavelet transform on feature space to find the dense regions in the feature space ◼ Apply wavelet transform multiple times which result in clusters at different scales from fine to coarse Major features: ◼ Complexity O(N) ◼ Detect arbitrary shaped clusters at different scales ◼ Not sensitive to noise, not sensitive to input order ◼ Only applicable to low dimensional data 728 Quantization & Transformation ◼ Quantize data into m-D grid structure, then wavelet transform a) scale 1: high resolution b) scale 2: medium resolution c) scale 3: low resolution 729 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 730 Chapter 12. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Mining Contextual and Collective Outliers ◼ Outlier Detection in High Dimensional Data ◼ Summary 731 What Are Outliers? ◼ ◼ ◼ ◼ ◼ Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism ◼ Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ... Outliers are different from the noise data ◼ Noise is random error or variance in a measured variable ◼ Noise should be removed before outlier detection Outliers are interesting: It violates the mechanism that generates the normal data Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model Applications: ◼ Credit card fraud detection ◼ Telecom fraud detection ◼ Customer segmentation 732 Types of Outliers (I) ◼ ◼ ◼ Three kinds: global, contextual and collective outliers Global Outlier Global outlier (or point anomaly) ◼ Object is Og if it significantly deviates from the rest of the data set ◼ Ex. Intrusion detection in computer networks ◼ Issue: Find an appropriate measurement of deviation Contextual outlier (or conditional outlier) ◼ Object is Oc if it deviates significantly based on a selected context o ◼ Ex. 80 F in Urbana: outlier? (depending on summer or winter?) ◼ Attributes of data objects should be divided into two groups ◼ Contextual attributes: defines the context, e.g., time & location ◼ Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature ◼ Can be viewed as a generalization of local outliers—whose density significantly deviates from its local area ◼ Issue: How to define or formulate meaningful context? 733 Types of Outliers (II) ◼ Collective Outliers ◼ ◼ A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers Applications: E.g., intrusion detection: ◼ Collective Outlier When a number of computers keep sending denial-of-service packages to each other Detection of collective outliers ◼ Consider not only behavior of individual objects, but also that of groups of objects ◼ Need to have the background knowledge on the relationship among data objects, such as a distance or similarity measure on objects. A data set may have multiple types of outlier One object may belong to more than one type of outlier ◼ ◼ ◼ 734 Challenges of Outlier Detection ◼ ◼ ◼ ◼ Modeling normal objects and outliers properly ◼ Hard to enumerate all possible normal behaviors in an application ◼ The border between normal and outlier objects is often a gray area Application-specific outlier detection ◼ Choice of distance measure among objects and the model of relationship among objects are often application-dependent ◼ E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations Handling noise in outlier detection ◼ Noise may distort the normal objects and blur the distinction between normal objects and outliers. It may help hide outliers and reduce the effectiveness of outlier detection Understandability ◼ Understand why these are outliers: Justification of the detection ◼ Specify the degree of an outlier: the unlikelihood of the object being generated by a normal mechanism 735 Chapter 12. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Mining Contextual and Collective Outliers ◼ Outlier Detection in High Dimensional Data ◼ Summary 736 Outlier Detection I: Supervised Methods ◼ ◼ Two ways to categorize outlier detection methods: ◼ Based on whether user-labeled examples of outliers can be obtained: ◼ Supervised, semi-supervised vs. unsupervised methods ◼ Based on assumptions about normal data and outliers: ◼ Statistical, proximity-based, and clustering-based methods Outlier Detection I: Supervised Methods ◼ Modeling outlier detection as a classification problem ◼ Samples examined by domain experts used for training & testing ◼ Methods for Learning a classifier for outlier detection effectively: ◼ Model normal objects & report those not matching the model as outliers, or ◼ Model outliers and treat those not matching the model as normal ◼ Challenges ◼ Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial outliers ◼ Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not mislabeling normal objects as outliers) 737 Outlier Detection II: Unsupervised Methods ◼ ◼ ◼ ◼ ◼ Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having some distinct features An outlier is expected to be far away from any groups of normal objects Weakness: Cannot detect collective outlier effectively ◼ Normal objects may not share any strong patterns, but the collective outliers may share high similarity in a small area Ex. In some intrusion or virus detection, normal activities are diverse ◼ Unsupervised methods may have a high false positive rate but still miss many real outliers. ◼ Supervised methods can be more effective, e.g., identify attacking some key resources Many clustering methods can be adapted for unsupervised methods ◼ Find clusters, then outliers: not belonging to any cluster ◼ Problem 1: Hard to distinguish noise from outliers ◼ Problem 2: Costly since first clustering: but far less outliers than normal objects ◼ Newer methods: tackle outliers directly 738 Outlier Detection III: Semi-Supervised Methods ◼ Situation: In many applications, the number of labeled data is often small: Labels could be on outliers only, normal objects only, or both ◼ Semi-supervised outlier detection: Regarded as applications of semisupervised learning ◼ If some labeled normal objects are available ◼ Use the labeled examples and the proximate unlabeled objects to train a model for normal objects ◼ Those not fitting the model of normal objects are detected as outliers ◼ If only some labeled outliers are available, a small number of labeled outliers many not cover the possible outliers well ◼ To improve the quality of outlier detection, one can get help from models for normal objects learned from unsupervised methods 739 Outlier Detection (1): Statistical Methods ◼ Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model (a stochastic model) ◼ ◼ The data not following the model are outliers. Example (right figure): First use Gaussian distribution to model the normal data ◼ For each object y in region R, estimate gD(y), the probability of y fits the Gaussian distribution ◼ If gD(y) is very low, y is unlikely generated by the Gaussian model, thus an outlier ◼ Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data ◼ There are rich alternatives to use various statistical models ◼ E.g., parametric vs. non-parametric 740 Outlier Detection (2): Proximity-Based Methods ◼ ◼ ◼ ◼ ◼ ◼ An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set Example (right figure): Model the proximity of an object using its 3 nearest neighbors ◼ Objects in region R are substantially different from other objects in the data set. ◼ Thus the objects in R are outliers The effectiveness of proximity-based methods highly relies on the proximity measure. In some applications, proximity or distance measures cannot be obtained easily. Often have a difficulty in finding a group of outliers which stay close to each other Two major types of proximity-based outlier detection ◼ Distance-based vs. density-based 741 Outlier Detection (3): Clustering-Based Methods Normal data belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters ◼ ◼ ◼ ◼ Example (right figure): two clusters ◼ All points not in R form a large cluster ◼ The two points in R form a tiny cluster, thus are outliers Since there are many clustering methods, there are many clustering-based outlier detection methods as well Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets 742 Chapter 12. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Mining Contextual and Collective Outliers ◼ Outlier Detection in High Dimensional Data ◼ Summary 743 Statistical Approaches ◼ ◼ ◼ ◼ ◼ Statistical approaches assume that the objects in a data set are generated by a stochastic process (a generative model) Idea: learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers Methods are divided into two categories: parametric vs. non- parametric Parametric method ◼ Assumes that the normal data is generated by a parametric distribution with parameter θ ◼ The probability density function of the parametric distribution f(x, θ) gives the probability that object x is generated by the distribution ◼ The smaller this value, the more likely x is an outlier Non-parametric method ◼ Not assume an a-priori statistical model and determine the model from the input data ◼ Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance 744 Univariate Outliers Based on Normal Distribution ◼ ◼ ◼ Univariate data: A data set involving only one attribute or variable Often assume that data are generated from a normal distribution, learn the parameters from the input data, and identify the points with low probability as outliers Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4} ◼ ◼ ◼ ◼ Use the maximum likelihood method to estimate μ and σ Taking derivatives with respect to μ and σ2, we derive the following maximum likelihood estimates For the above data with n = 10, we have Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since 745 Parametric Methods I: The Grubb’s Test ◼ Univariate outlier detection: The Grubb's test (maximum normed residual test) ─ another statistical method under normal distribution ◼ For each object x in a data set, compute its z-score: x is an outlier if where is the value taken by a t-distribution at a significance level of α/(2N), and N is the # of objects in the data set 746 Parametric Methods II: Detection of Multivariate Outliers ◼ Multivariate data: A data set involving two or more attributes or variables ◼ Transform the multivariate outlier detection task into a univariate outlier detection problem ◼ Method 1. Compute Mahalaobis distance ◼ Let ō be the mean vector for a multivariate data set. Mahalaobis distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō) where S is the covariance matrix ◼ ◼ Use the Grubb's test on this measure to detect outliers Method 2. Use χ2 –statistic: ◼ where Ei is the mean of the i-dimension among all objects, and n is the dimensionality ◼ If χ2 –statistic is large, then object oi is an outlier 747 Parametric Methods III: Using Mixture of Parametric Distributions ◼ Assuming data generated by a normal distribution could be sometimes overly simplified ◼ Example (right figure): The objects between the two clusters cannot be captured as outliers since they are close to the estimated mean ◼ To overcome this problem, assume the normal data is generated by two normal distributions. For any object o in the data set, the probability that o is generated by the mixture of the two distributions is given by where fθ1 and fθ2 are the probability density functions of θ1 and θ2 ◼ Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data ◼ An object o is an outlier if it does not belong to any cluster 748 Non-Parametric Methods: Detection Using Histogram ◼ ◼ ◼ ◼ ◼ The model of normal data is learned from the input data without any a priori structure. Often makes fewer assumptions about the data, and thus can be applicable in more scenarios Outlier detection using histogram: ◼ Figure shows the histogram of purchase amounts in transactions ◼ A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000 Problem: Hard to choose an appropriate bin size for histogram ◼ Too small bin size → normal objects in empty/rare bins, false positive ◼ Too big bin size → outliers in some frequent bins, false negative Solution: Adopt kernel density estimation to estimate the probability density distribution of the data. If the estimated density function is high, the object is likely normal. Otherwise, it is likely an outlier. 749 Chapter 12. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Mining Contextual and Collective Outliers ◼ Outlier Detection in High Dimensional Data ◼ Summary 750 Proximity-Based Approaches: Distance-Based vs. Density-Based Outlier Detection ◼ ◼ ◼ Intuition: Objects that are far away from the others are outliers Assumption of proximity-based approach: The proximity of an outlier deviates significantly from that of most of the others in the data set Two types of proximity-based outlier detection methods ◼ ◼ Distance-based outlier detection: An object o is an outlier if its neighborhood does not have enough other points Density-based outlier detection: An object o is an outlier if its density is relatively much lower than that of its neighbors 751 Distance-Based Outlier Detection ◼ ◼ ◼ ◼ ◼ For each object o, examine the # of other objects in the rneighborhood of o, where r is a user-specified distance threshold An object o is an outlier if most (taking π as a fraction threshold) of the objects in D are far away from o, i.e., not in the r-neighborhood of o An object o is a DB(r, π) outlier if Equivalently, one can check the distance between o and its k-th nearest neighbor ok, where . o is an outlier if dist(o, ok) > r Efficient computation: Nested loop algorithm ◼ ◼ For any object oi, calculate its distance from other objects, and count the # of other objects in the r-neighborhood. If π∙n other objects are within r distance, terminate the inner loop 752 Distance-Based Outlier Detection: A Grid-Based Method ◼ ◼ ◼ ◼ ◼ Why efficiency is still a concern? When the complete set of objects cannot be held into main memory, cost I/O swapping The major cost: (1) each object tests against the whole data set, why not only its close neighbor? (2) check objects one by one, why not group by group? Grid-based method (CELL): Data space is partitioned into a multi-D grid. Each cell is a hyper cube with diagonal length r/2 Pruning using the level-1 & level 2 cell properties: ◼ For any possible point x in cell C and any possible point y in a level-1 cell, dist(x,y) ≤ r ◼ For any possible point x in cell C and any point y such that dist(x,y) ≥ r, y is in a level-2 cell Thus we only need to check the objects that cannot be pruned, and even for such an object o, only need to compute the distance between o and the objects in the level-2 cells (since beyond level-2, the distance from o is more than r) 753 Density-Based Outlier Detection ◼ ◼ Local outliers: Outliers comparing to their local neighborhoods, instead of the global data distribution In Fig., o1 and o2 are local outliers to C1, o3 is a global outlier, but o4 is not an outlier. However, proximity-based clustering cannot find o1 and o2 are outlier (e.g., comparing with O4). ◼ Intuition (density-based outlier detection): The density around an outlier object is significantly different from the density around its neighbors ◼ Method: Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliers ◼ k-distance of an object o, distk(o): distance between o and its k-th NN ◼ k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)} ◼ Nk(o) could be bigger than k since multiple objects may have identical distance to o 754 Local Outlier Factor: LOF ◼ Reachability distance from o’ to o: ◼ ◼ where k is a user-specified parameter Local reachability density of o: ◼ LOF (Local outlier factor) of an object o is the average of the ratio of local reachability of o and those of o’s k-nearest neighbors ◼ The lower the local reachability density of o, and the higher the local reachability density of the kNN of o, the higher LOF ◼ This captures a local outlier whose local density is relatively low comparing to the local densities of its kNN 755 Chapter 12. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Mining Contextual and Collective Outliers ◼ Outlier Detection in High Dimensional Data ◼ Summary 756 Clustering-Based Outlier Detection (1 & 2): Not belong to any cluster, or far from the closest one ◼ ◼ ◼ ◼ An object is an outlier if (1) it does not belong to any cluster, (2) there is a large distance between the object and its closest cluster , or (3) it belongs to a small or sparse cluster Case I: Not belong to any cluster ◼ Identify animals not part of a flock: Using a densitybased clustering method such as DBSCAN Case 2: Far from its closest cluster ◼ Using k-means, partition data points of into clusters ◼ For each object o, assign an outlier score based on its distance from its closest center ◼ If dist(o, co)/avg_dist(co) is large, likely an outlier Ex. Intrusion detection: Consider the similarity between data points and the clusters in a training data set ◼ ◼ Use a training set to find patterns of “normal” data, e.g., frequent itemsets in each segment, and cluster similar connections into groups Compare new data points with the clusters mined—Outliers are possible attacks 757 Clustering-Based Outlier Detection (3): Detecting Outliers in Small Clusters ◼ FindCBLOF: Detect outliers in small clusters ◼ ◼ ◼ ◼ ◼ Find clusters, and sort them in decreasing size To each data point, assign a cluster-based local outlier factor (CBLOF): If obj p belongs to a large cluster, CBLOF = cluster_size X similarity between p and cluster If p belongs to a small one, CBLOF = cluster size X similarity betw. p and the closest large cluster Ex. In the figure, o is outlier since its closest large cluster is C1, but the similarity between o and C1 is small. For any point in C3, its closest large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small 758 Clustering-Based Method: Strength and Weakness ◼ ◼ Strength ◼ Detect outliers without requiring any labeled data ◼ Work for many types of data ◼ Clusters can be regarded as summaries of the data ◼ Once the cluster are obtained, need only compare any object against the clusters to determine whether it is an outlier (fast) Weakness ◼ Effectiveness depends highly on the clustering method used—they may not be optimized for outlier detection ◼ High computational cost: Need to first find clusters ◼ A method to reduce the cost: Fixed-width clustering ◼ A point is assigned to a cluster if the center of the cluster is within a pre-defined distance threshold from the point ◼ If a point cannot be assigned to any existing cluster, a new cluster is created and the distance threshold may be learned Chapter 12. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Mining Contextual and Collective Outliers ◼ Outlier Detection in High Dimensional Data ◼ Summary 760 Classification-Based Method I: One-Class Model ◼ ◼ ◼ Idea: Train a classification model that can distinguish “normal” data from outliers A brute-force approach: Consider a training set that contains samples labeled as “normal” and others labeled as “outlier” ◼ But, the training set is typically heavily biased: # of “normal” samples likely far exceeds # of outlier samples ◼ Cannot detect unseen anomaly One-class model: A classifier is built to describe only the normal class. ◼ Learn the decision boundary of the normal class using classification methods such as SVM ◼ Any samples that do not belong to the normal class (not within the decision boundary) are declared as outliers ◼ Adv: can detect new outliers that may not appear close to any outlier objects in the training set ◼ Extension: Normal objects may belong to multiple classes 761 Classification-Based Method II: Semi-Supervised Learning ◼ ◼ ◼ Semi-supervised learning: Combining classification-based and clustering-based methods Method ◼ Using a clustering-based approach, find a large cluster, C, and a small cluster, C1 ◼ Since some objects in C carry the label “normal”, treat all objects in C as normal ◼ Use the one-class model of this cluster to identify normal objects in outlier detection ◼ Since some objects in cluster C1 carry the label “outlier”, declare all objects in C1 as outliers ◼ Any object that does not fall into the model for C (such as a) is considered an outlier as well Comments on classification-based outlier detection methods ◼ Strength: Outlier detection is fast ◼ Bottleneck: Quality heavily depends on the availability and quality of the training set, but often difficult to obtain representative and highquality training data 762 Chapter 12. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Mining Contextual and Collective Outliers ◼ Outlier Detection in High Dimensional Data ◼ Summary 763 into Conventional Outlier Detection ◼ ◼ ◼ ◼ If the contexts can be clearly identified, transform it to conventional outlier detection 1. Identify the context of the object using the contextual attributes 2. Calculate the outlier score for the object in the context using a conventional outlier detection method Ex. Detect outlier customers in the context of customer groups ◼ Contextual attributes: age group, postal code ◼ Behavioral attributes: # of trans/yr, annual total trans. amount Steps: (1) locate c’s context, (2) compare c with the other customers in the same group, and (3) use a conventional outlier detection method If the context contains very few customers, generalize contexts ◼ Ex. Learn a mixture model U on the contextual attributes, and another mixture model V of the data on the behavior attributes ◼ Learn a mapping p(Vi|Uj): the probability that a data object o belonging to cluster Uj on the contextual attributes is generated by cluster Vi on the behavior attributes ◼ Outlier score: 764 Mining Contextual Outliers II: Modeling Normal Behavior with Respect to Contexts ◼ In some applications, one cannot clearly partition the data into contexts ◼ ◼ Model the “normal” behavior with respect to contexts ◼ ◼ ◼ ◼ Ex. if a customer suddenly purchased a product that is unrelated to those she recently browsed, it is unclear how many products browsed earlier should be considered as the context Using a training data set, train a model that predicts the expected behavior attribute values with respect to the contextual attribute values An object is a contextual outlier if its behavior attribute values significantly deviate from the values predicted by the model Using a prediction model that links the contexts and behavior, these methods avoid the explicit identification of specific contexts Methods: A number of classification and prediction techniques can be used to build such models, such as regression, Markov Models, and Finite State Automaton 765 Mining Collective Outliers I: On the Set of “Structured Objects” ◼ ◼ ◼ ◼ ◼ Collective outlier if objects as a group deviate significantly from the entire data Need to examine the structure of the data set, i.e, the relationships between multiple data objects Each of these structures is inherent to its respective type of data ◼ For temporal data (such as time series and sequences), we explore the structures formed by time, which occur in segments of the time series or subsequences ◼ For spatial data, explore local areas ◼ For graph and network data, we explore subgraphs Difference from the contextual outlier detection: the structures are often not explicitly defined, and have to be discovered as part of the outlier detection process. Collective outlier detection methods: two categories ◼ Reduce the problem to conventional outlier detection ◼ Identify structure units, treat each structure unit (e.g., subsequence, time series segment, local area, or subgraph) as a data object, and extract features ◼ Then outlier detection on the set of “structured objects” constructed as such using the extracted features 766 Mining Collective Outliers II: Direct Modeling of the Expected Behavior of Structure Units ◼ ◼ ◼ ◼ Models the expected behavior of structure units directly Ex. 1. Detect collective outliers in online social network of customers ◼ Treat each possible subgraph of the network as a structure unit ◼ Collective outlier: An outlier subgraph in the social network ◼ Small subgraphs that are of very low frequency ◼ Large subgraphs that are surprisingly frequent Ex. 2. Detect collective outliers in temporal sequences ◼ Learn a Markov model from the sequences ◼ A subsequence can then be declared as a collective outlier if it significantly deviates from the model Collective outlier detection is subtle due to the challenge of exploring the structures in data ◼ The exploration typically uses heuristics, and thus may be application dependent ◼ The computational cost is often high due to the sophisticated mining process 767 Chapter 12. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Mining Contextual and Collective Outliers ◼ Outlier Detection in High Dimensional Data ◼ Summary 768 Challenges for Outlier Detection in HighDimensional Data ◼ ◼ ◼ ◼ Interpretation of outliers ◼ Detecting outliers without saying why they are outliers is not very useful in high-D due to many features (or dimensions) are involved in a high-dimensional data set ◼ E.g., which subspaces that manifest the outliers or an assessment regarding the “outlier-ness” of the objects Data sparsity ◼ Data in high-D spaces are often sparse ◼ The distance between objects becomes heavily dominated by noise as the dimensionality increases Data subspaces ◼ Adaptive to the subspaces signifying the outliers ◼ Capturing the local behavior of data Scalable with respect to dimensionality ◼ # of subspaces increases exponentially 769 Approach I: Extending Conventional Outlier Detection ◼ Method 1: Detect outliers in the full space, e.g., HilOut Algorithm ◼ Find distance-based outliers, but use the ranks of distance instead of the absolute distance in outlier detection ◼ For each object o, find its k-nearest neighbors: nn1(o), . . . , nnk(o) ◼ The weight of object o: All objects are ranked in weight-descending order ◼ Top-l objects in weight are output as outliers (l: user-specified parm) ◼ Employ space-filling curves for approximation: scalable in both time and space w.r.t. data size and dimensionality Method 2: Dimensionality reduction ◼ Works only when in lower-dimensionality, normal instances can still be distinguished from outliers ◼ PCA: Heuristically, the principal components with low variance are preferred because, on such dimensions, normal objects are likely close to each other and outliers often deviate from the majority ◼ ◼ 770 Approach II: Finding Outliers in Subspaces ◼ ◼ ◼ Extending conventional outlier detection: Hard for outlier interpretation Find outliers in much lower dimensional subspaces: easy to interpret why and to what extent the object is an outlier ◼ E.g., find outlier customers in certain subspace: average transaction amount >> avg. and purchase frequency << avg. Ex. A grid-based subspace outlier detection method ◼ Project data onto various subspaces to find an area whose density is much lower than average ◼ Discretize the data into a grid with φ equi-depth (why?) regions ◼ Search for regions that are significantly sparse ◼ Consider a k-d cube: k ranges on k dimensions, with n objects ◼ If objects are independently distributed, the expected number of objects falling into a k-dimensional region is (1/ φ)kn = fkn,the standard deviation is ◼ The sparsity coefficient of cube C: ◼ If S(C) < 0, C contains less objects than expected ◼ The more negative, the sparser C is and the more likely the objects in C are outliers in the subspace 771 Approach III: Modeling High-Dimensional Outliers ◼ ◼ ◼ ◼ ◼ ◼ Develop new models for highdimensional outliers directly A set of points Avoid proximity measures and adopt form a cluster new heuristics that do not deteriorate except c (outlier) in high-dimensional data Ex. Angle-based outliers: Kriegel, Schubert, and Zimek [KSZ08] For each point o, examine the angle ∆xoy for every pair of points x, y. ◼ Point in the center (e.g., a), the angles formed differ widely ◼ An outlier (e.g., c), angle variable is substantially smaller Use the variance of angles for a point to determine outlier Combine angles and distance to model outliers ◼ Use the distance-weighted angle variance as the outlier score ◼ Angle-based outlier factor (ABOF): ◼ ◼ Efficient approximation computation method is developed It can be generalized to handle arbitrary types of data 772 Chapter 12. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Mining Contextual and Collective Outliers ◼ Outlier Detection in High Dimensional Data ◼ Summary 773 Summary ◼ Types of outliers ◼ ◼ global, contextual & collective outliers Outlier detection ◼ supervised, semi-supervised, or unsupervised ◼ Statistical (or model-based) approaches ◼ Proximity-base approaches ◼ Clustering-base approaches ◼ Classification approaches ◼ Mining contextual and collective outliers ◼ Outlier detection in high dimensional data 774 References (I) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika, 66:229–248, 1979. M. Agyemang, K. Barker, and R. Alhajj. A comprehensive survey of numeric and symbolic outlier mining techniques. Intell. Data Anal., 10:521–538, 2006. F. J. Anscombe and I. Guttman. Rejection of outliers. Technometrics, 2:123–147, 1960. D. Agarwal. Detecting anomalies in cross-classified streams: a bayesian approach. Knowl. Inf. Syst., 11:29–44, 2006. F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets. TKDE, 2005. C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD’01 R.J. Beckman and R.D. Cook. Outlier...s. Technometrics, 25:119–149, 1983. I. Ben-Gal. Outlier detection. In Maimon O. and Rockach L. (eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic, 2005. M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD’00 D. Barbar´a, Y. Li, J. Couto, J.-L. Lin, and S. Jajodia. Bootstrapping a data mining intrusion detection system. SAC’03 Z. A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris. A comparative study for outlier detection techniques in data mining. IEEE Conf. on Cybernetics and Intelligent Systems, 2006. S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. KDD’03 D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusion using bayesian estimators. SDM’01 V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41:1–58, 2009. D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative selection algorithm. In CEC’02 References (2) ◼ E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining for Security Applications, 2002. ◼ E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00 ◼ T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1:291–316, 1997. ◼ V. J. Hodge and J. Austin. A survey of outlier detection methdologies. Artif. Intell. Rev., 22:85–126, 2004. ◼ D. M. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980. ◼ Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recogn. Lett., 24, June, 2003. ◼ W. Jin, K. H. Tung, and J. Han. Mining top-n local outliers in large databases. KDD’01 ◼ W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. PAKDD’06 ◼ E. Knorr and R. Ng. A unified notion of outliers: Properties and computation. KDD’97 ◼ E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98 ◼ E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB J., 8:237– 253, 2000. ◼ H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. KDD’08 ◼ M. Markou and S. Singh. Novelty detection: A review—part 1: Statistical approaches. Signal Process., 83:2481– 2497, 2003. ◼ M. Markou and S. Singh. Novelty detection: A review—part 2: Neural network based approaches. Signal Process., 83:2499–2521, 2003. ◼ C. C. Noble and D. J. Cook. Graph-based anomaly detection. KDD’03 References (3) ◼ S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. ICDE’03 ◼ A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Comput. Netw., 51, 2007. ◼ X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl. and Data Eng., 19, 2007. ◼ Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric space. KDD’06 ◼ N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions into information systems. Quality and Reliability Engineering International, 17:105–112, 2001. ◼ B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for coevolving time sequences. ICDE’00 Un-Used Slides 778 Statistical Approaches Assume a model underlying distribution that generates data set (e.g. normal distribution) ◼ Use discordancy tests depending on ◼ data distribution ◼ distribution parameter (e.g., mean, variance) ◼ number of expected outliers ◼ Drawbacks ◼ most tests are for single attribute ◼ In many cases, data distribution may not be known 779 Outlier Discovery: Distance-Based Approach ◼ ◼ ◼ Introduced to counter the main limitations imposed by statistical methods ◼ We need multi-dimensional analysis without knowing data distribution Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O Algorithms for mining distance-based outliers [Knorr & Ng, VLDB’98] ◼ Index-based algorithm ◼ Nested-loop algorithm ◼ Cell-based algorithm 780 Density-Based Local Outlier Detection ◼ M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000. ◼ Distance-based outlier detection is based on global distance distribution ◼ It encounters difficulties to identify outliers if data is not uniformly distributed ◼ ◼ Ex. C1 contains 400 loosely distributed points, C2 has 100 tightly condensed ◼ Need the concept of local outlier Local outlier factor (LOF) ◼ Assume outlier is not crisp ◼ Each point has a LOF points, 2 outlier points o1, o2 ◼ Distance-based method cannot identify o2 as an outlier 781 Outlier Discovery: Deviation-Based Approach ◼ ◼ ◼ Identifies outliers by examining the main characteristics of objects in a group Objects that “deviate” from this description are considered outliers Sequential exception technique ◼ ◼ simulates the way in which humans can distinguish unusual objects from among a series of supposedly like objects OLAP data cube technique ◼ uses data cubes to identify regions of anomalies in large multidimensional data 782 References (1) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika, 1979. Malik Agyemang, Ken Barker, and Rada Alhajj. A comprehensive survey of numeric and symbolic outlier mining techniques. Intell. Data Anal., 2006. Deepak Agarwal. Detecting anomalies in cross-classied streams: a bayesian approach. Knowl. Inf. Syst., 2006. C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD'01. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Optics-of: Identifying local outliers. PKDD '99 M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD'00. V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 2009. D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative selection algorithm. Computational Intelligence, 2002. E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining for Security Applications, 2002. E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00. T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1997. R. Fujimaki, T. Yairi, and K. Machida. An approach to spacecraft anomaly detection problem using kernel feature space. KDD '05 F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 1969. 783 References (2) ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ V. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 2004. Douglas M Hawkins. Identification of Outliers. Chapman and Hall, 1980. P. S. Horn, L. Feng, Y. Li, and A. J. Pesce. Effect of Outliers and Nonhealthy Individuals on Reference Interval Estimation. Clin Chem, 2001. W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. PAKDD'06 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98 M. Markou and S. Singh.. Novelty detection: a review| part 1: statistical approaches. Signal Process., 83(12), 2003. M. Markou and S. Singh. Novelty detection: a review| part 2: neural network based approaches. Signal Process., 83(12), 2003. S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. ICDE'03. A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Comput. Netw., 51(12):3448{3470, 2007. W. Stefansky. Rejecting outliers in factorial designs. Technometrics, 14(2):469{479, 1972. X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl. and Data Eng., 19(5):631{645, 2007. Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric space. KDD '06: N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions into information systems. Quality and Reliability Engineering International, 2001. 784 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 13 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. Chapter 13: Data Mining Trends and Research Frontiers ◼ Mining Complex Types of Data ◼ Other Methodologies of Data Mining ◼ Data Mining Applications ◼ Data Mining and Society ◼ Data Mining Trends ◼ Summary 787 Mining Complex Types of Data ◼ Mining Sequence Data ◼ Mining Time Series ◼ Mining Symbolic Sequences ◼ Mining Biological Sequences ◼ Mining Graphs and Networks ◼ Mining Other Kinds of Data 788 Mining Sequence Data ◼ Similarity Search in Time Series Data ◼ ◼ Regression and Trend Analysis in Time-Series Data ◼ ◼ Feature-based vs. sequence-distance-based vs. model-based Alignment of Biological Sequences ◼ ◼ GSP, PrefixSpan, constraint-based sequential pattern mining Sequence Classification ◼ ◼ long term + cyclic + seasonal variation + random movements Sequential Pattern Mining in Symbolic Sequences ◼ ◼ Subsequence match, dimensionality reduction, query-based similarity search, motif-based similarity search Pair-wise vs. multi-sequence alignment, substitution matirces, BLAST Hidden Markov Model for Biological Sequence Analysis ◼ Markov chain vs. hidden Markov models, forward vs. Viterbi vs. BaumWelch algorithms 789 Mining Graphs and Networks ◼ Graph Pattern Mining ◼ ◼ Statistical Modeling of Networks ◼ ◼ ◼ Small world phenomenon, power law (log-tail) distribution, densification Clustering and Classification of Graphs and Homogeneous Networks ◼ Clustering: Fast Modularity vs. SCAN ◼ Classification: model vs. pattern-based mining Clustering, Ranking and Classification of Heterogeneous Networks ◼ ◼ Frequent subgraph patterns, closed graph patterns, gSpan vs. CloseGraph RankClus, RankClass, and meta path-based, user-guided methodology Role Discovery and Link Prediction in Information Networks ◼ PathPredict ◼ Similarity Search and OLAP in Information Networks: PathSim, GraphCube ◼ Evolution of Social and Information Networks: EvoNetClus 790 Mining Other Kinds of Data ◼ Mining Spatial Data ◼ ◼ Mining Spatiotemporal and Moving Object Data ◼ ◼ Topic modeling, i-topic model, integration with geo- and networked data Mining Web Data ◼ ◼ Social media data, geo-tagged spatial clustering, periodicity analysis, … Mining Text Data ◼ ◼ Applications: healthcare, air-traffic control, flood simulation Mining Multimedia Data ◼ ◼ Spatiotemporal data mining, trajectory mining, periodica, swarm, … Mining Cyber-Physical System Data ◼ ◼ Spatial frequent/co-located patterns, spatial clustering and classification Web content, web structure, and web usage mining Mining Data Streams ◼ Dynamics, one-pass, patterns, clustering, classification, outlier detection 791 Chapter 13: Data Mining Trends and Research Frontiers ◼ Mining Complex Types of Data ◼ Other Methodologies of Data Mining ◼ Data Mining Applications ◼ Data Mining and Society ◼ Data Mining Trends ◼ Summary 792 Other Methodologies of Data Mining ◼ Statistical Data Mining ◼ Views on Data Mining Foundations ◼ Visual and Audio Data Mining 793 Major Statistical Data Mining Methods ◼ Regression ◼ Generalized Linear Model ◼ Analysis of Variance ◼ Mixed-Effect Models ◼ Factor Analysis ◼ Discriminant Analysis ◼ Survival Analysis 794 Statistical Data Mining (1) ◼ ◼ There are many well-established statistical techniques for data analysis, particularly for numeric data ◼ applied extensively to data from scientific experiments and data from economics and the social sciences Regression predict the value of a response (dependent) variable from one or more predictor (independent) variables where the variables are numeric ◼ forms of regression: linear, multiple, weighted, polynomial, nonparametric, and robust ◼ 795 Scientific and Statistical Data Mining (2) ◼ ◼ Generalized linear models ◼ allow a categorical response variable (or some transformation of it) to be related to a set of predictor variables ◼ similar to the modeling of a numeric response variable using linear regression ◼ include logistic regression and Poisson regression Mixed-effect models For analyzing grouped data, i.e. data that can be classified according to one or more grouping variables ◼ Typically describe relationships between a response variable and some covariates in data grouped according to one or more factors ◼ 796 Scientific and Statistical Data Mining (3) ◼ Regression trees ◼ Binary trees used for classification and prediction ◼ ◼ ◼ Similar to decision trees:Tests are performed at the internal nodes In a regression tree the mean of the objective attribute is computed and used as the predicted value Analysis of variance ◼ Analyze experimental data for two or more populations described by a numeric response variable and one or more categorical variables (factors) 797 Statistical Data Mining (4) ◼ ◼ Factor analysis ◼ determine which variables are combined to generate a given factor ◼ e.g., for many psychiatric data, one can indirectly measure other quantities (such as test scores) that reflect the factor of interest Discriminant analysis ◼ predict a categorical response variable, commonly used in social science ◼ Attempts to determine several discriminant functions (linear combinations of the independent variables) that discriminate among the groups defined by the response variable www.spss.com/datamine/factor.htm 798 Statistical Data Mining (5) ◼ Time series: many methods such as autoregression, ARIMA (Autoregressive integrated moving-average modeling), long memory time-series modeling ◼ Quality control: displays group summary charts ◼ Survival analysis ❑ Predicts the probability that a patient undergoing a medical treatment would survive at least to time t (life span prediction) 799 Other Methodologies of Data Mining ◼ Statistical Data Mining ◼ Views on Data Mining Foundations ◼ Visual and Audio Data Mining 800 Views on Data Mining Foundations (I) ◼ ◼ Data reduction ◼ Basis of data mining: Reduce data representation ◼ Trades accuracy for speed in response Data compression ◼ ◼ Basis of data mining: Compress the given data by encoding in terms of bits, association rules, decision trees, clusters, etc. Probability and statistical theory ◼ Basis of data mining: Discover joint probability distributions of random variables 801 Views on Data Mining Foundations (II) ◼ Microeconomic view ◼ ◼ A view of utility: Finding patterns that are interesting only to the extent in that they can be used in the decision-making process of some enterprise Pattern Discovery and Inductive databases ◼ ◼ ◼ ◼ Basis of data mining: Discover patterns occurring in the database, such as associations, classification models, sequential patterns, etc. Data mining is the problem of performing inductive logic on databases The task is to query the data and the theory (i.e., patterns) of the database Popular among many researchers in database systems 802 Other Methodologies of Data Mining ◼ Statistical Data Mining ◼ Views on Data Mining Foundations ◼ Visual and Audio Data Mining 803 Visual Data Mining ◼ ◼ Visualization: Use of computer graphics to create visual images which aid in the understanding of complex, often massive representations of data Visual Data Mining: discovering implicit but useful knowledge from large data sets using visualization techniques Multimedia Human Computer Systems Computer Graphics Interfaces Visual Data Mining High Pattern Performance Recognition Computing 804 Visualization ◼ Purpose of Visualization ◼ ◼ ◼ ◼ ◼ Gain insight into an information space by mapping data onto graphical primitives Provide qualitative overview of large data sets Search for patterns, trends, structure, irregularities, relationships among data. Help find interesting regions and suitable parameters for further quantitative analysis. Provide a visual proof of computer representations derived 805 Visual Data Mining & Data Visualization ◼ ◼ Integration of visualization and data mining ◼ data visualization ◼ data mining result visualization ◼ data mining process visualization ◼ interactive visual data mining Data visualization ◼ Data in a database or data warehouse can be viewed ◼ at different levels of abstraction ◼ as different combinations of attributes or dimensions ◼ Data can be presented in various visual forms 806 Data Mining Result Visualization ◼ ◼ Presentation of the results or knowledge obtained from data mining in visual forms Examples ◼ Scatter plots and boxplots (obtained from descriptive data mining) ◼ Decision trees ◼ Association rules ◼ Clusters ◼ Outliers ◼ Generalized rules 807 Boxplots from Statsoft: Multiple Variable Combinations 808 Visualization of Data Mining Results in SAS Enterprise Miner: Scatter Plots 809 Visualization of Association Rules in SGI/MineSet 3.0 810 Visualization of a Decision Tree in SGI/MineSet 3.0 811 Visualization of Cluster Grouping in IBM Intelligent Miner 812 Data Mining Process Visualization ◼ Presentation of the various processes of data mining in visual forms so that users can see ◼ Data extraction process ◼ Where the data is extracted ◼ How the data is cleaned, integrated, preprocessed, and mined ◼ Method selected for data mining ◼ Where the results are stored ◼ How they may be viewed 813 Visualization of Data Mining Processes by Clementine See your solution discovery process clearly Understand variations with visualized data 814 Interactive Visual Data Mining ◼ ◼ Using visualization tools in the data mining process to help users make smart data mining decisions Example ◼ ◼ Display the data distribution in a set of attributes using colored sectors or columns (depending on whether the whole space is represented by either a circle or a set of columns) Use the display to which sector should first be selected for classification and where a good split point for this sector may be 815 Interactive Visual Mining by Perception-Based Classification (PBC) 816 Audio Data Mining ◼ ◼ ◼ ◼ ◼ Uses audio signals to indicate the patterns of data or the features of data mining results An interesting alternative to visual mining An inverse task of mining audio (such as music) databases which is to find patterns from audio data Visual data mining may disclose interesting patterns using graphical displays, but requires users to concentrate on watching patterns Instead, transform patterns into sound and music and listen to pitches, rhythms, tune, and melody in order to identify anything interesting or unusual 817 Chapter 13: Data Mining Trends and Research Frontiers ◼ Mining Complex Types of Data ◼ Other Methodologies of Data Mining ◼ Data Mining Applications ◼ Data Mining and Society ◼ Data Mining Trends ◼ Summary 818 Data Mining Applications ◼ ◼ Data mining: A young discipline with broad and diverse applications ◼ There still exists a nontrivial gap between generic data mining methods and effective and scalable data mining tools for domain-specific applications Some application domains (briefly discussed here) ◼ Data Mining for Financial data analysis ◼ Data Mining for Retail and Telecommunication Industries ◼ Data Mining in Science and Engineering ◼ Data Mining for Intrusion Detection and Prevention ◼ Data Mining and Recommender Systems 819 Data Mining for Financial Data Analysis (I) ◼ ◼ ◼ Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality Design and construction of data warehouses for multidimensional data analysis and data mining ◼ View the debt and revenue changes by month, by region, by sector, and by other factors ◼ Access statistical information such as max, min, total, average, trend, etc. Loan payment prediction/consumer credit policy analysis ◼ feature selection and attribute relevance ranking ◼ Loan payment performance ◼ Consumer credit rating 820 Data Mining for Financial Data Analysis (II) ◼ ◼ Classification and clustering of customers for targeted marketing ◼ multidimensional segmentation by nearest-neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group Detection of money laundering and other financial crimes ◼ integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs) ◼ Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences) 821 Data Mining for Retail & Telcomm. Industries (I) ◼ ◼ Retail industry: huge amounts of data on sales, customer shopping history, e-commerce, etc. Applications of retail data mining ◼ Identify customer buying behaviors ◼ Discover customer shopping patterns and trends ◼ Improve the quality of customer service ◼ Achieve better customer retention and satisfaction ◼ Enhance goods consumption ratios ◼ ◼ Design more effective goods transportation and distribution policies Telcomm. and many other industries: Share many similar goals and expectations of retail data mining 822 Data Mining Practice for Retail Industry ◼ ◼ Design and construction of data warehouses Multidimensional analysis of sales, customers, products, time, and region ◼ Analysis of the effectiveness of sales campaigns ◼ Customer retention: Analysis of customer loyalty ◼ ◼ ◼ Use customer loyalty card information to register sequences of purchases of particular customers Use sequential pattern mining to investigate changes in customer consumption or loyalty Suggest adjustments on the pricing and variety of goods ◼ Product recommendation and cross-reference of items ◼ Fraudulent analysis and the identification of usual patterns ◼ Use of visualization tools in data analysis 823 Data Mining in Science and Engineering ◼ Data warehouses and data preprocessing ◼ ◼ Mining complex data types ◼ ◼ Resolving inconsistencies or incompatible data collected in diverse environments and different periods (e.g. eco-system studies) Spatiotemporal, biological, diverse semantics and relationships Graph-based and network-based mining ◼ Links, relationships, data flow, etc. ◼ Visualization tools and domain-specific knowledge ◼ Other issues ◼ ◼ Data mining in social sciences and social studies: text and social media Data mining in computer science: monitoring systems, software bugs, network intrusion 824 Data Mining for Intrusion Detection and Prevention ◼ Majority of intrusion detection and prevention systems use ◼ ◼ ◼ Signature-based detection: use signatures, attack patterns that are preconfigured and predetermined by domain experts Anomaly-based detection: build profiles (models of normal behavior) and detect those that are substantially deviate from the profiles What data mining can help ◼ ◼ ◼ New data mining algorithms for intrusion detection Association, correlation, and discriminative pattern analysis help select and build discriminative classifiers Analysis of stream data: outlier detection, clustering, model shifting ◼ Distributed data mining ◼ Visualization and querying tools 825 Data Mining and Recommender Systems ◼ ◼ ◼ Recommender systems: Personalization, making product recommendations that are likely to be of interest to a user Approaches: Content-based, collaborative, or their hybrid ◼ Content-based: Recommends items that are similar to items the user preferred or queried in the past ◼ Collaborative filtering: Consider a user's social environment, opinions of other customers who have similar tastes or preferences Data mining and recommender systems ◼ Users C × items S: extract from known to unknown ratings to predict user-item combinations ◼ Memory-based method often uses k-nearest neighbor approach ◼ Model-based method uses a collection of ratings to learn a model (e.g., probabilistic models, clustering, Bayesian networks, etc.) ◼ Hybrid approaches integrate both to improve performance (e.g., using ensemble) 826 Chapter 13: Data Mining Trends and Research Frontiers ◼ Mining Complex Types of Data ◼ Other Methodologies of Data Mining ◼ Data Mining Applications ◼ Data Mining and Society ◼ Data Mining Trends ◼ Summary 827 Ubiquitous and Invisible Data Mining ◼ ◼ Ubiquitous Data Mining ◼ Data mining is used everywhere, e.g., online shopping ◼ Ex. Customer relationship management (CRM) Invisible Data Mining ◼ ◼ ◼ ◼ ◼ Invisible: Data mining functions are built in daily life operations Ex. Google search: Users may be unaware that they are examining results returned by data Invisible data mining is highly desirable Invisible mining needs to consider efficiency and scalability, user interaction, incorporation of background knowledge and visualization techniques, finding interesting patterns, real-time, … Further work: Integration of data mining into existing business and scientific technologies to provide domain-specific data mining tools 828 Privacy, Security and Social Impacts of Data Mining ◼ Many data mining applications do not touch personal data ◼ ◼ ◼ E.g., meteorology, astronomy, geography, geology, biology, and other scientific and engineering data Many DM studies are on developing scalable algorithms to find general or statistically significant patterns, not touching individuals The real privacy concern: unconstrained access of individual records, especially privacy-sensitive information ◼ Method 1: Removing sensitive IDs associated with the data ◼ Method 2: Data security-enhancing methods ◼ ◼ ◼ Multi-level security model: permit to access to only authorized level Encryption: e.g., blind signatures, biometric encryption, and anonymous databases (personal information is encrypted and stored at different locations) Method 3: Privacy-preserving data mining methods 829 Privacy-Preserving Data Mining ◼ ◼ Privacy-preserving (privacy-enhanced or privacy-sensitive) mining: ◼ Obtaining valid mining results without disclosing the underlying sensitive data values ◼ Often needs trade-off between information loss and privacy Privacy-preserving data mining methods: ◼ Randomization (e.g., perturbation): Add noise to the data in order to mask some attribute values of records ◼ K-anonymity and l-diversity: Alter individual records so that they cannot be uniquely identified ◼ ◼ ◼ ◼ k-anonymity: Any given record maps onto at least k other records l-diversity: enforcing intra-group diversity of sensitive values Distributed privacy preservation: Data partitioned and distributed either horizontally, vertically, or a combination of both Downgrading the effectiveness of data mining: The output of data mining may violate privacy ◼ Modify data or mining results, e.g., hiding some association rules or slightly distorting some classification models 830 Chapter 13: Data Mining Trends and Research Frontiers ◼ Mining Complex Types of Data ◼ Other Methodologies of Data Mining ◼ Data Mining Applications ◼ Data Mining and Society ◼ Data Mining Trends ◼ Summary 831 Trends of Data Mining ◼ Application exploration: Dealing with application-specific problems ◼ Scalable and interactive data mining methods ◼ Integration of data mining with Web search engines, database systems, data warehouse systems and cloud computing systems ◼ Mining social and information networks ◼ Mining spatiotemporal, moving objects and cyber-physical systems ◼ Mining multimedia, text and web data ◼ Mining biological and biomedical data ◼ Data mining with software engineering and system engineering ◼ Visual and audio data mining ◼ Distributed data mining and real-time data stream mining ◼ Privacy protection and information security in data mining 832 Chapter 13: Data Mining Trends and Research Frontiers ◼ Mining Complex Types of Data ◼ Other Methodologies of Data Mining ◼ Data Mining Applications ◼ Data Mining and Society ◼ Data Mining Trends ◼ Summary 833 Summary ◼ ◼ We present a high-level overview of mining complex data types Statistical data mining methods, such as regression, generalized linear models, analysis of variance, etc., are popularly adopted ◼ Researchers also try to build theoretical foundations for data mining ◼ Visual/audio data mining has been popular and effective ◼ ◼ ◼ ◼ Application-based mining integrates domain-specific knowledge with data analysis techniques and provide mission-specific solutions Ubiquitous data mining and invisible data mining are penetrating our data lives Privacy and data security are importance issues in data mining, and privacy-preserving data mining has been developed recently Our discussion on trends in data mining shows that data mining is a promising, young field, with great, strategic importance 834 References and Further Reading ❖ The books lists a lot of references for further reading. Here we only list a few books ◼ E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011 ◼ ◼ ◼ ◼ ◼ ◼ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed., Wiley-Interscience, 2000 D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected World. Cambridge University Press, 2010. U. Fayyad, G. Grinstein, and A. Wierse (eds.), Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. 2011 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009 ◼ D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. ◼ B. Liu. Web Data Mining, Springer 2006. ◼ T. M. Mitchell. Machine Learning, McGraw Hill, 1997 ◼ M. Newman. Networks: An Introduction. Oxford University Press, 2010. ◼ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 ◼ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005 835 836