Uploaded by Vikas Tomar

Data Mining

Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 1 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
2
Why Data Mining?
◼
The Explosive Growth of Data: from terabytes to petabytes
◼
Data collection and data availability
◼
Automated data collection tools, database systems, Web,
computerized society
◼
Major sources of abundant data
◼
Business: Web, e-commerce, transactions, stocks, …
◼
Science: Remote sensing, bioinformatics, scientific simulation, …
◼
Society and everyone: news, digital cameras, YouTube
◼
We are drowning in data, but starving for knowledge!
◼
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
3
Evolution of Sciences
◼
Before 1600, empirical science
◼
1600-1950s, theoretical science
◼
◼
1950s-1990s, computational science
◼
◼
◼
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
1990-now, data science
◼
The flood of data from new scientific instruments and simulations
◼
The ability to economically store and manage petabytes of data online
◼
The Internet and computing Grid that makes all these archives universally accessible
◼
◼
Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
4
Evolution of Database Technology
◼
1960s:
◼
◼
1970s:
◼
◼
◼
Relational data model, relational DBMS implementation
1980s:
◼
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◼
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
◼
◼
Data collection, database creation, IMS and network DBMS
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
◼
Stream data management and mining
◼
Data mining and its applications
◼
Web technology (XML, data integration) and global information systems
5
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
6
What Is Data Mining?
◼
Data mining (knowledge discovery from data)
◼
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
◼
◼
Alternative names
◼
◼
Data mining: a misnomer?
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
◼
Simple search and query processing
◼
(Deductive) expert systems
7
Knowledge Discovery (KDD) Process
◼
◼
This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
Data mining plays an essential
role in the knowledge discovery
Data Mining
process
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
8
Example: A Web Mining Framework
◼
Web mining usually involves
◼
Data cleaning
◼
Data integration from multiple sources
◼
Warehousing the data
◼
Data cube construction
◼
Data selection for data mining
◼
Data mining
◼
Presentation of the mining results
◼
Patterns and knowledge to be used or stored
into knowledge-base
9
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
10
Example: Mining vs. Data Exploration
◼
◼
◼
◼
◼
Business intelligence view
◼ Warehouse, data cube, reporting but not much
mining
Business objects vs. data mining tools
Supply chain example: tools
Data presentation
Exploration
11
KDD Process: A Typical View from ML and
Statistics
Input Data
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
◼
Data
Mining
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………
PostProcessing
Pattern
Pattern
Pattern
Pattern
evaluation
selection
interpretation
visualization
This is a view from typical machine learning and statistics communities
12
Example: Medical Data Mining
◼
◼
Health care & medical data mining – often
adopted such a view in statistics and machine
learning
Preprocessing of the data (including feature
extraction and dimension reduction)
◼
Classification or/and clustering processes
◼
Post-processing for presentation
13
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
14
Multi-Dimensional View of Data Mining
◼
◼
◼
◼
Data to be mined
◼ Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social
and information networks
Knowledge to be mined (or: Data mining functions)
◼ Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
◼ Descriptive vs. predictive data mining
◼ Multiple/integrated functions and mining at multiple levels
Techniques utilized
◼ Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
Applications adapted
◼ Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
15
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
16
Data Mining: On What Kinds of Data?
◼
Database-oriented data sets and applications
◼
◼
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
◼
Data streams and sensor data
◼
Time-series data, temporal data, sequence data (incl. bio-sequences)
◼
Structure data, graphs, social networks and multi-linked data
◼
Object-relational databases
◼
Heterogeneous databases and legacy databases
◼
Spatial data and spatiotemporal data
◼
Multimedia database
◼
Text databases
◼
The World-Wide Web
17
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
18
Data Mining Function: (1) Generalization
◼
Information integration and data warehouse construction
◼
◼
Data cube technology
◼
◼
◼
Data cleaning, transformation, integration, and
multidimensional data model
Scalable methods for computing (i.e.,
materializing) multidimensional aggregates
OLAP (online analytical processing)
Multidimensional concept description: Characterization
and discrimination
◼
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
19
Data Mining Function: (2) Association and
Correlation Analysis
◼
Frequent patterns (or frequent itemsets)
◼
◼
What items are frequently purchased together
in your Walmart?
Association, correlation vs. causality
◼
A typical association rule
◼
◼
◼
◼
Diaper → Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly
correlated?
How to mine such patterns and rules efficiently in large
datasets?
How to use such patterns for classification, clustering,
20
Data Mining Function: (3) Classification
◼
Classification and label prediction
◼
Construct models (functions) based on some training examples
◼
Describe and distinguish classes or concepts for future prediction
◼
◼
◼
Predict some unknown class labels
Typical methods
◼
◼
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, patternbased classification, logistic regression, …
Typical applications:
◼
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
21
Data Mining Function: (4) Cluster Analysis
◼
◼
◼
◼
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing
interclass similarity
Many methods and applications
22
Data Mining Function: (5) Outlier Analysis
◼
Outlier analysis
◼
◼
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
◼
Methods: by product of clustering or regression analysis, …
◼
Useful in fraud detection, rare events analysis
23
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
◼
Sequence, trend and evolution analysis
◼
◼
Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
Sequential pattern mining
◼
◼
◼
Periodicity analysis
Motifs and biological sequence analysis
◼
◼
◼
e.g., first buy digital camera, then buy large SD
memory cards
Approximate and consecutive motifs
Similarity-based analysis
Mining data streams
24
Structure and Network Analysis
◼
◼
◼
Graph mining
◼ Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
Information network analysis
◼ Social networks: actors (objects, nodes) and relationships (edges)
◼ e.g., author networks in CS, terrorist networks
◼ Multiple heterogeneous networks
◼ A person could be multiple information networks: friends,
family, classmates, …
◼ Links carry a lot of semantic information: Link mining
Web mining
◼ Web is a big information network: from PageRank to Google
◼ Analysis of Web information networks
◼ Web community discovery, opinion mining, usage mining, …
25
Evaluation of Knowledge
◼
◼
Are all mined knowledge interesting?
◼
One can mine tremendous amount of “patterns” and knowledge
◼
Some may fit only certain dimension space (time, location, …)
◼
Some may not be representative, may be transient, …
Evaluation of mined knowledge → directly mine only
interesting knowledge?
◼
Descriptive vs. predictive
◼
Coverage
◼
Typicality vs. novelty
◼
Accuracy
◼
Timeliness
◼
…
26
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
27
Data Mining: Confluence of Multiple Disciplines
Machine
Learning
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
High-Performance
Computing
28
Why Confluence of Multiple Disciplines?
◼
Tremendous amount of data
◼
◼
High-dimensionality of data
◼
◼
Micro-array may have tens of thousands of dimensions
High complexity of data
◼
◼
◼
◼
◼
◼
◼
Algorithms must be highly scalable to handle such as tera-bytes of
data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
29
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
30
Applications of Data Mining
◼
Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
◼
Collaborative analysis & recommender systems
◼
Basket data analysis to targeted marketing
◼
◼
◼
Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
From major dedicated data mining systems/tools (e.g., SAS, MS SQLServer Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
31
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
32
Major Issues in Data Mining (1)
◼
◼
Mining Methodology
◼
Mining various and new kinds of knowledge
◼
Mining knowledge in multi-dimensional space
◼
Data mining: An interdisciplinary effort
◼
Boosting the power of discovery in a networked environment
◼
Handling noise, uncertainty, and incompleteness of data
◼
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
◼
Interactive mining
◼
Incorporation of background knowledge
◼
Presentation and visualization of data mining results
33
Major Issues in Data Mining (2)
◼
◼
◼
Efficiency and Scalability
◼
Efficiency and scalability of data mining algorithms
◼
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
◼
Handling complex types of data
◼
Mining dynamic, networked, and global data repositories
Data mining and society
◼
Social impacts of data mining
◼
Privacy-preserving data mining
◼
Invisible data mining
34
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
35
A Brief History of Data Mining Society
◼
1989 IJCAI Workshop on Knowledge Discovery in Databases
◼
◼
1991-1994 Workshops on Knowledge Discovery in Databases
◼
◼
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
◼
Journal of Data Mining and Knowledge Discovery (1997)
◼
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
◼
More conferences on data mining
◼
◼
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
ACM Transactions on KDD starting in 2007
36
Conferences and Journals on Data Mining
◼
KDD Conferences
◼
ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining (KDD)
◼
SIAM Data Mining Conf. (SDM)
◼
(IEEE) Int. Conf. on Data Mining
(ICDM)
◼
European Conf. on Machine
Learning and Principles and
practices of Knowledge Discovery
and Data Mining (ECML-PKDD)
◼
Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
◼
Int. Conf. on Web Search and
Data Mining (WSDM)
◼
Other related conferences
◼
◼
◼
DB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, …
Web and IR conferences: WWW,
SIGIR, WSDM
◼
ML conferences: ICML, NIPS
◼
PR conferences: CVPR,
Journals
◼
◼
Data Mining and Knowledge
Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge and
Data Eng. (TKDE)
◼
KDD Explorations
◼
ACM Trans. on KDD
37
Where to Find References? DBLP, CiteSeer, Google
◼
Data mining and KDD (SIGKDD: CDROM)
◼
◼
◼
Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
◼
◼
◼
◼
◼
Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems,
Statistics
◼
◼
◼
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
Web and IR
◼
◼
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning
◼
◼
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
◼
◼
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
38
Chapter 1. Introduction
◼
Why Data Mining?
◼
What Is Data Mining?
◼
A Multi-Dimensional View of Data Mining
◼
What Kind of Data Can Be Mined?
◼
What Kinds of Patterns Can Be Mined?
◼
What Technology Are Used?
◼
What Kind of Applications Are Targeted?
◼
Major Issues in Data Mining
◼
A Brief History of Data Mining and Data Mining Society
◼
Summary
39
Summary
◼
◼
◼
◼
◼
Data mining: Discovering interesting patterns and knowledge from
massive amount of data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
◼
Data mining technologies and applications
◼
Major issues in data mining
40
Recommended Reference Books
◼
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
◼
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
◼
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
◼
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
◼
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
◼
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
◼
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
◼
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
◼
B. Liu, Web Data Mining, Springer 2006.
◼
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
◼
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
◼
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
◼
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
◼
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
41
Concepts and
Techniques
— Chapter 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights reserved.
42
Chapter 2: Getting to Know Your
Data
◼
Data Objects and Attribute Types
◼
Basic Statistical Descriptions of Data
◼
Data Visualization
◼
Measuring Data Similarity and Dissimilarity
◼
Summary
43
44
Types of Data Sets
◼
pla
y
ball
score
game
wi
n
lost
timeout
season
◼
coach
◼
team
◼
Record
◼
Relational records
◼
Data matrix, e.g., numerical matrix,
crosstabs
◼
Document data: text documents: termfrequency vector
◼
Transaction data
Graph and network
◼
World Wide Web
◼
Social or information networks
◼
Molecular Structures
Ordered
◼
Video data: sequence of images
◼
Temporal data: time-series
◼
Sequential Data: transaction sequences
◼
Genetic sequence data
Spatial, image and multimedia:
◼
Spatial data: maps
◼
Image data:
◼
Video data:
Document 1
3
0
5
0
2
6
0
2
0
2
Document 2
0
7
0
2
1
0
0
3
0
0
Document 3
0
1
0
0
1
2
2
0
3
0
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Important Characteristics of Structured
Data
◼
Dimensionality
◼
◼
Sparsity
◼
◼
Only presence counts
Resolution
◼
◼
Curse of dimensionality
Patterns depend on the scale
Distribution
◼
Centrality and dispersion
45
Data Objects
◼
Data sets are made up of data objects.
◼
A data object represents an entity.
◼
Examples:
◼
◼
sales database: customers, store items, sales
◼
medical database: patients, treatments
◼
university database: students, professors, courses
Also called samples , examples, instances, data points,
objects, tuples.
◼
Data objects are described by attributes.
◼
Database rows -> data objects; columns ->attributes.
46
Attributes
◼
Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
◼
◼
E.g., customer _ID, name, address
Types:
◼ Nominal
◼ Binary
◼ Numeric: quantitative
◼ Interval-scaled
◼ Ratio-scaled
47
Attribute Types
◼
◼
◼
Nominal: categories, states, or “names of things”
◼
Hair_color = {auburn, black, blond, brown, grey, red, white}
◼
marital status, occupation, ID numbers, zip codes
Binary
◼
Nominal attribute with only 2 states (0 and 1)
◼
Symmetric binary: both outcomes equally important
◼
e.g., gender
◼
Asymmetric binary: outcomes not equally important.
◼
e.g., medical test (positive vs. negative)
◼
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
◼
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
◼
Size = {small, medium, large}, grades, army rankings
48
Numeric Attribute Types
◼
◼
◼
Quantity (integer or real-valued)
Interval
◼
Measured on a scale of equal-sized units
◼
Values have order
◼
E.g., temperature in C˚or F˚, calendar dates
◼
No true zero-point
Ratio
◼
Inherent zero-point
◼
We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
◼
e.g., temperature in Kelvin, length, counts,
monetary quantities
49
Discrete vs. Continuous Attributes
◼
◼
Discrete Attribute
◼ Has only a finite or countably infinite set of values
◼ E.g., zip codes, profession, or the set of words in a
collection of documents
◼ Sometimes, represented as integer variables
◼ Note: Binary attributes are a special case of discrete
attributes
Continuous Attribute
◼ Has real numbers as attribute values
◼ E.g., temperature, height, or weight
◼ Practically, real values can only be measured and
represented using a finite number of digits
◼ Continuous attributes are typically represented as
floating-point variables
50
Chapter 2: Getting to Know Your
Data
◼
Data Objects and Attribute Types
◼
Basic Statistical Descriptions of Data
◼
Data Visualization
◼
Measuring Data Similarity and Dissimilarity
◼
Summary
51
Basic Statistical Descriptions of Data
◼
◼
◼
◼
Motivation
◼ To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
◼ median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
◼ Data dispersion: analyzed with multiple granularities
of precision
◼ Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
◼ Folding measures into numerical dimensions
◼ Boxplot or quantile analysis on the transformed cube
52
53
Measuring the Central Tendency
◼
Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
◼
◼
◼
1 n
x =  xi
n i =1
n
Weighted arithmetic mean:
Trimmed mean: chopping extreme values
x=
Median:
◼
Middle value if odd number of values, or average of
w x
i =1
n
i
i
w
i =1
i
the middle two values otherwise
◼
◼
Estimated by interpolation (for grouped data):
Mode
median = L1 + (
n / 2 − ( freq)l
freqmedian
◼
Value that occurs most frequently in the data
◼
Unimodal, bimodal, trimodal
◼
Empirical formula:
) width
mean − mode = 3  (mean − median)
x

=
N
54
Symmetric vs. Skewed
Data
◼
Median, mean and mode of
symmetric, positively and
negatively skewed data
positively skewed
June 7, 2020
symmetric
negatively skewed
Data Mining: Concepts and Techniques
Measuring the Dispersion of
Data
55
◼
Quartiles, outliers and boxplots
◼
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
◼
Inter-quartile range: IQR = Q3 – Q1
◼
Five number summary: min, Q1, median, Q3, max
◼
Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
◼
◼
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
◼
Variance: (algebraic, scalable computation)
1 n
1 n 2 1 n 2
2
s =
( xi − x ) =
[ xi − ( xi ) ]

n − 1 i =1
n − 1 i =1
n i =1
2
◼
1
 =
N
2
n
1
(
x
−

)
=

i
N
i =1
2
n
 xi −  2
i =1
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
2
56
Boxplot Analysis
◼
Five-number summary of a distribution
◼
◼
Minimum, Q1, Median, Q3, Maximum
Boxplot
◼
◼
◼
◼
◼
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the
box
Whiskers: two lines outside the box extended
to Minimum and Maximum
Outliers: points beyond a specified outlier
threshold, plotted individually
Visualization of Data Dispersion: 3-D
Boxplots
June 7, 2020
Data Mining: Concepts and Techniques
57
58
Properties of Normal Distribution Curve
◼
The normal (distribution) curve
◼ From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
◼
From μ–2σ to μ+2σ: contains about 95% of it
◼ From μ–3σ to μ+3σ: contains about 99.7% of it
Graphic Displays of Basic Statistical
Descriptions
◼
Boxplot: graphic display of five-number summary
◼
Histogram: x-axis are values, y-axis repres. frequencies
◼
Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
◼
Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
◼
Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
59
Histogram Analysis
◼
◼
◼
Histogram: Graph display of
tabulated frequencies, shown as
bars
40
It shows what proportion of cases
fall into each of several categories
30
35
25
Differs from a bar chart in that it is
20
the area of the bar that denotes the
15
value, not the height as in bar
charts, a crucial distinction when the 10
categories are not of uniform width
5
◼
The categories are usually specified
0
as non-overlapping intervals of
some variable. The categories (bars)
must be adjacent
10000
30000
50000
70000
90000
60
Histograms Often Tell More than Boxplots
◼
The two histograms
shown in the left may
have the same boxplot
representation
◼
◼
61
The same values
for: min, Q1,
median, Q3, max
But they have rather
different data
distributions
Quantile Plot
◼
◼
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Plots quantile information
◼ For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi
Data Mining: Concepts and Techniques
62
Quantile-Quantile (Q-Q) Plot
◼
◼
◼
Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
View: Is there is a shift in going from one distribution to another?
Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.
63
Scatter plot
◼
◼
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
64
Positively and Negatively Correlated Data
◼
The left half fragment is positively
correlated
◼
The right half is negative correlated
65
Uncorrelated Data
66
Chapter 2: Getting to Know Your
Data
◼
Data Objects and Attribute Types
◼
Basic Statistical Descriptions of Data
◼
Data Visualization
◼
Measuring Data Similarity and Dissimilarity
◼
Summary
67
Data Visualization
◼
Why data visualization?
◼
◼
◼
◼
◼
◼
Gain insight into an information space by mapping data onto graphical
primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships among
data
Help find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of computer representations derived
Categorization of visualization methods:
◼
Pixel-oriented visualization techniques
◼
Geometric projection visualization techniques
◼
Icon-based visualization techniques
◼
Hierarchical visualization techniques
◼
Visualizing complex data and relations
68
Pixel-Oriented Visualization
Techniques
◼
◼
◼
For a data set of m dimensions, create m windows on the screen, one
for each dimension
The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
The colors of the pixels reflect the corresponding values
(a) Income
(b) Credit Limit
(c) transaction volume
(d) age
69
Laying Out Pixels in Circle Segments
◼
To save space and show the connections among multiple dimensions,
space filling is often done in a circle segment
(a) Representing a data record
in circle segment
(b) Laying out pixels in circle segment
70
Geometric Projection Visualization
Techniques
◼
◼
Visualization of geometric transformations and projections
of the data
Methods
◼
Direct visualization
◼
Scatterplot and scatterplot matrices
◼
Landscapes
◼
Projection pursuit technique: Help users find meaningful
projections of multidimensional data
◼
Prosection views
◼
Hyperslice
◼
Parallel coordinates
71
Direct Data Visualization
Ribbons with Twists Based on Vorticity
72
Data Mining: Concepts and Techniques
Used by ermission of M. Ward, Worcester Polytechnic Institute
Scatterplot Matrices
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]
73
Used by permission of B. Wright, Visible Decisions Inc.
Landscapes
◼
◼
news articles
visualized as
a landscape
Visualization of the data as perspective landscape
The data needs to be transformed into a (possibly artificial) 2D
spatial representation which preserves the characteristics of the data
74
Parallel Coordinates
◼
◼
◼
n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
Every data item corresponds to a polygonal line which intersects each
of the axes at the point which corresponds to the value for the
attribute
• • •
Attr. 1
Attr. 2
Attr. 3
Attr. k
75
Parallel Coordinates of a Data Set
76
Icon-Based Visualization Techniques
◼
Visualization of the data values as features of icons
◼
Typical visualization methods
◼
◼
Chernoff Faces
◼
Stick Figures
General techniques
◼
◼
◼
Shape coding: Use shape to represent certain
information encoding
Color icons: Use color icons to encode more information
Tile bars: Use small icons to represent the relevant
feature vectors in document retrieval
77
Chernoff Faces
◼
◼
◼
◼
78
A way to display variables on a two-dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose length, etc.
The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening): Each assigned one of 10 possible values, generated using
Mathematica (S. Dickson)
REFERENCE: Gonick, L. and Smith, W. The
Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
Stick Figure
A census data
figure showing
age, income,
gender,
education, etc.
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
79
Hierarchical Visualization Techniques
◼
◼
Visualization of the data using a hierarchical
partitioning into subspaces
Methods
◼
Dimensional Stacking
◼
Worlds-within-Worlds
◼
Tree-Map
◼
Cone Trees
◼
InfoCube
80
Dimensional Stacking
attribute 4
attribute 2
attribute 3
attribute 1
◼
◼
◼
◼
◼
81
Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are ‘stacked’ into each other
Partitioning of the attribute value ranges into classes. The
important attributes should be used on the outer levels.
Adequate for data with ordinal attributes of low cardinality
But, difficult to display more than nine dimensions
Important to map dimensions appropriately
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
82
Worlds-within-Worlds
Assign the function and two most important parameters to innermost
world
◼
Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
◼
Software that uses this paradigm
◼
◼
◼
83
N–vision: Dynamic
interaction through data
glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
Auto Visual: Static
interaction by means of
queries
Tree-Map
◼
◼
Screen-filling method which uses a hierarchical partitioning
of the screen into regions depending on the attribute values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
MSR Netscan Image
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg
84
Tree-Map of a File System (Schneiderman)
85
InfoCube
◼
◼
A 3-D visualization technique where hierarchical
information is displayed as nested semi-transparent
cubes
The outermost cubes correspond to the top level
data, while the subnodes or the lower level data
are represented as smaller cubes inside the
outermost cubes, and so on
86
Three-D Cone Trees
◼
3D cone tree visualization technique works
well for up to a thousand nodes or so
◼
◼
◼
◼
First build a 2D circle tree that arranges its
nodes in concentric circles centered on the
root node
Cannot avoid overlaps when projected to
2D
G. Robertson, J. Mackinlay, S. Card. “Cone
Trees: Animated 3D Visualizations of
Hierarchical Information”, ACM SIGCHI'91
Graph from Nadeau Software Consulting
website: Visualize a social network data set
that models the way an infection spreads
from one person to the next
Ack.: http://nadeausoftware.com/articles/visualization
87
Visualizing Complex Data and Relations
◼
◼
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
The importance of
tag is represented
by font size/color
Besides text data,
there are also
methods to visualize
relationships, such as
visualizing social
networks
◼
◼
Newsmap: Google News Stories in 2005
Chapter 2: Getting to Know Your
Data
◼
Data Objects and Attribute Types
◼
Basic Statistical Descriptions of Data
◼
Data Visualization
◼
Measuring Data Similarity and Dissimilarity
◼
Summary
89
Similarity and Dissimilarity
◼
◼
◼
Similarity
◼ Numerical measure of how alike two data objects are
◼ Value is higher when objects are more alike
◼ Often falls in the range [0,1]
Dissimilarity (e.g., distance)
◼ Numerical measure of how different two data objects
are
◼ Lower when objects are more alike
◼ Minimum dissimilarity is often 0
◼ Upper limit varies
Proximity refers to a similarity or dissimilarity
90
Data Matrix and Dissimilarity
Matrix
◼
◼
Data matrix
◼ n data points with p
dimensions
◼ Two modes
Dissimilarity matrix
◼ n data points, but
registers only the
distance
◼ A triangular matrix
◼ Single mode
 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
...
x if
...
...
...
...
...
x nf
...
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...
x1p 

... 
x ip 

... 
x np 







... 0
91
Proximity Measure for Nominal Attributes
◼
◼
Can take 2 or more states, e.g., red, yellow, blue,
green (generalization of a binary attribute)
Method 1: Simple matching
◼
◼
m: # of matches, p: total # of variables
m
d (i, j) = p −
p
Method 2: Use a large number of binary attributes
◼
creating a new binary attribute for each of the
M nominal states
92
93
Proximity Measure for Binary Attributes
Object j
◼
A contingency table for binary data
Object i
◼
Distance measure for symmetric
binary variables:
◼
Distance measure for asymmetric
binary variables:
◼
Jaccard coefficient (similarity
measure for asymmetric binary
variables):
◼
Note: Jaccard coefficient is the same as “coherence”:
Dissimilarity between Binary
Variables
◼
Example
Name
Jack
Mary
Jim
◼
◼
◼
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
Gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
0+1
= 0.33
2+ 0+1
1+1
d ( jack , jim ) =
= 0.67
1+1+1
1+ 2
d ( jim , mary ) =
= 0.75
1+1+ 2
d ( jack , mary ) =
94
95
◼
Standardizing Numeric Data
x
−

Z-score: z =

◼
◼
◼
◼
X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
the distance between the raw score and the population mean in
units of the standard deviation
negative when the raw score is below the mean, “+” when above
An alternative way: Calculate the mean absolute deviation
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m = 1 (x + x + ... + x )
f
nf
n 1f 2 f
.
◼
◼
standardized measure (z-score):
xif − m f
zif =
sf
Using mean absolute deviation is more robust than using standard
deviation
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point
x1
x2
x3
x4
attribute1 attribute2
1
2
3
5
2
0
4
5
Dissimilarity Matrix
(with Euclidean Distance)
x1
x1
x2
x3
x4
x2
0
3.61
5.1
4.24
x3
0
5.1
1
x4
0
5.39
0
96
Distance on Numeric Data: Minkowski
Distance
◼
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
◼
◼
Properties
◼
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
◼
d(i, j) = d(j, i) (Symmetry)
◼
d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
97
98
Special Cases of Minkowski Distance
◼
h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1
i2 j 2
ip jp
◼
h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1
i2 j 2
ip jp
◼
h → . “supremum” (Lmax norm, L norm) distance.
◼ This is the maximum difference between any component
(attribute) of the vectors
Example: Minkowski Distance
Dissimilarity Matrices
point
x1
x2
x3
x4
attribute 1 attribute 2
1
2
3
5
2
0
4
5
Manhattan (L1)
L
x1
x2
x3
x4
x1
0
5
3
6
x2
x3
x4
0
6
1
0
7
0
x2
x3
x4
Euclidean (L2)
L2
x1
x2
x3
x4
x1
0
3.61
2.24
4.24
0
5.1
1
0
5.39
0
Supremum
L
x1
x2
x3
x4
x1
x2
0
3
2
3
x3
0
5
1
x4
0
5
0
99
Ordinal Variables
◼
An ordinal variable can be discrete or continuous
◼
Order is important, e.g., rank
◼
Can be treated like interval-scaled
rif {1,..., M f }
◼ replace xif by their rank
◼
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
zif
◼
rif −1
=
M f −1
compute the dissimilarity using methods for intervalscaled variables
100
101
Attributes of Mixed Type
◼
◼
A database may contain all attribute types
◼ Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
◼
◼
◼
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
◼ Compute ranks rif and
zif = r − 1
M −1
◼ Treat zif as interval-scaled
if
f
Cosine Similarity
◼
◼
◼
◼
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
Other vector objects: gene features in micro-arrays, …
Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d
102
Example: Cosine Similarity
◼
◼
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d
Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
103
Chapter 2: Getting to Know Your
Data
◼
Data Objects and Attribute Types
◼
Basic Statistical Descriptions of Data
◼
Data Visualization
◼
Measuring Data Similarity and Dissimilarity
◼
Summary
104
Summary
◼
Data attribute types: nominal, binary, ordinal, interval-scaled, ratioscaled
◼
Many types of data sets, e.g., numerical, text, graph, Web, image.
◼
Gain insight into the data by:
◼
Basic statistical data description: central tendency, dispersion,
graphical displays
◼
Data visualization: map data onto graphical primitives
◼
Measure data similarity
◼
Above steps are the beginning of data preprocessing.
◼
Many methods have been developed but still an active area of research.
105
References
◼
W. Cleveland, Visualizing Data, Hobart Press, 1993
◼
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼
U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼
D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼
S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼
E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
◼
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
106
Data Mining:
Concepts and
Techniques
(3rd ed.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
107
Chapter 3: Data Preprocessing
◼
Data Preprocessing: An Overview
◼
Data Quality
◼
Major Tasks in Data Preprocessing
◼
Data Cleaning
◼
Data Integration
◼
Data Reduction
◼
Data Transformation and Data Discretization
◼
Summary
108
Data Quality: Why Preprocess the Data?
◼
Measures for data quality: A multidimensional view
◼
Accuracy: correct or wrong, accurate or not
◼
Completeness: not recorded, unavailable, …
◼
Consistency: some modified but some not, dangling, …
◼
Timeliness: timely update?
◼
Believability: how trustable the data are correct?
◼
Interpretability: how easily the data can be
understood?
109
Major Tasks in Data Preprocessing
◼
Data cleaning
◼
◼
Data integration
◼
◼
◼
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Integration of multiple databases, data cubes, or files
Data reduction
◼
Dimensionality reduction
◼
Numerosity reduction
◼
Data compression
Data transformation and data discretization
◼
Normalization
◼
Concept hierarchy generation
110
Chapter 3: Data Preprocessing
◼
Data Preprocessing: An Overview
◼
Data Quality
◼
Major Tasks in Data Preprocessing
◼
Data Cleaning
◼
Data Integration
◼
Data Reduction
◼
Data Transformation and Data Discretization
◼
Summary
111
Data Cleaning
◼
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
◼
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
◼
◼
noisy: containing noise, errors, or outliers
◼
◼
◼
e.g., Occupation=“ ” (missing data)
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
◼
Age=“42”, Birthday=“03/07/2010”
◼
Was rating “1, 2, 3”, now rating “A, B, C”
◼
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
◼
Jan. 1 as everyone’s birthday?
112
Incomplete (Missing) Data
◼
Data is not always available
◼
◼
Missing data may be due to
◼
equipment malfunction
◼
inconsistent with other recorded data and thus deleted
◼
data not entered due to misunderstanding
◼
◼
◼
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
certain data may not be considered important at the
time of entry
not register history or changes of the data
Missing data may need to be inferred
113
How to Handle Missing Data?
◼
Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
◼
Fill in the missing value manually: tedious + infeasible?
◼
Fill in it automatically with
◼
a global constant : e.g., “unknown”, a new class?!
◼
the attribute mean
◼
◼
the attribute mean for all samples belonging to the
same class: smarter
the most probable value: inference-based such as
Bayesian formula or decision tree
114
Noisy Data
◼
◼
◼
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
◼ faulty data collection instruments
◼ data entry problems
◼ data transmission problems
◼ technology limitation
◼ inconsistency in naming convention
Other data problems which require data cleaning
◼ duplicate records
◼ incomplete data
◼ inconsistent data
115
How to Handle Noisy Data?
◼
◼
◼
◼
Binning
◼ first sort data and partition into (equal-frequency) bins
◼ then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
◼ smooth by fitting the data into regression functions
Clustering
◼ detect and remove outliers
Combined computer and human inspection
◼ detect suspicious values and check by human (e.g.,
deal with possible outliers)
116
Data Cleaning as a Process
◼
◼
◼
Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency, distribution)
◼ Check field overloading
◼ Check uniqueness rule, consecutive rule and null rule
◼ Use commercial tools
◼ Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
◼ Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
Data migration and integration
◼ Data migration tools: allow transformations to be specified
◼ ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
◼ Iterative and interactive (e.g., Potter’s Wheels)
117
Chapter 3: Data Preprocessing
◼
Data Preprocessing: An Overview
◼
Data Quality
◼
Major Tasks in Data Preprocessing
◼
Data Cleaning
◼
Data Integration
◼
Data Reduction
◼
Data Transformation and Data Discretization
◼
Summary
118
Data Integration
◼
Data integration:
◼
◼
Schema integration: e.g., A.cust-id  B.cust-#
◼
◼
Combines data from multiple sources into a coherent store
Integrate metadata from different sources
Entity identification problem:
◼
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
◼
Detecting and resolving data value conflicts
◼
For the same real world entity, attribute values from different
sources are different
◼
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
119
Handling Redundancy in Data Integration
◼
Redundant data occur often when integration of multiple
databases
◼
Object identification: The same attribute or object
may have different names in different databases
◼
Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
◼
◼
Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
120
Correlation Analysis (Nominal Data)
◼
Χ2 (chi-square) test
2
(
Observed
−
Expected
)
2 = 
Expected
◼
◼
◼
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
◼
# of hospitals and # of car-theft in a city are correlated
◼
Both are causally linked to the third variable: population
121
Chi-Square Calculation: An Example
◼
Play chess
Not play chess
Sum (row)
Like science fiction
250(90)
200(360)
450
Not like science fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
 =
+
+
+
= 507.93
90
210
360
840
2
◼
It shows that like_science_fiction and play_chess are
correlated in the group
122
Correlation Analysis (Numeric Data)
◼
Correlation coefficient (also called Pearson’s product
moment coefficient)
i=1 (ai − A)(bi − B)
n
rA, B =
(n − 1) A B

=
n
i =1
(ai bi ) − n AB
(n − 1) A B
where n is the number of tuples, A and B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
◼
◼
If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated
123
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
124
Correlation (viewed as linear
relationship)
◼
◼
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, A and B, and then take their dot product
a'k = (ak − mean( A)) / std ( A)
b'k = (bk − mean( B)) / std ( B)
correlatio n( A, B) = A'• B'
125
Covariance (Numeric Data)
◼
Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
◼
◼
◼
Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
◼
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence126
Co-Variance: An Example
◼
It can be simplified in computation as
◼
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
◼
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
◼
◼
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
◼
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
◼
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing
◼
Data Preprocessing: An Overview
◼
Data Quality
◼
Major Tasks in Data Preprocessing
◼
Data Cleaning
◼
Data Integration
◼
Data Reduction
◼
Data Transformation and Data Discretization
◼
Summary
128
Data Reduction Strategies
◼
◼
◼
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
Data reduction strategies
◼ Dimensionality reduction, e.g., remove unimportant attributes
◼ Wavelet transforms
◼ Principal Components Analysis (PCA)
◼ Feature subset selection, feature creation
◼ Numerosity reduction (some simply call it: Data Reduction)
◼ Regression and Log-Linear Models
◼ Histograms, clustering, sampling
◼ Data cube aggregation
◼ Data compression
129
Data Reduction 1: Dimensionality
Reduction
◼
Curse of dimensionality
◼
◼
◼
◼
◼
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
◼
Avoid the curse of dimensionality
◼
Help eliminate irrelevant features and reduce noise
◼
Reduce time and space required in data mining
◼
Allow easier visualization
Dimensionality reduction techniques
◼
Wavelet transforms
◼
Principal Component Analysis
◼
Supervised and nonlinear techniques (e.g., feature selection)
130
Mapping Data to a New Space
◼
◼
Fourier transform
Wavelet transform
Two Sine Waves
Two Sine Waves + Noise
Frequency
131
What Is Wavelet Transform?
◼
Decomposes a signal into
different frequency subbands
◼
◼
◼
◼
Applicable to ndimensional signals
Data are transformed to
preserve relative distance
between objects at different
levels of resolution
Allow natural clusters to
become more distinguishable
Used for image compression
132
Wavelet Transformation
Haar2
◼
◼
◼
◼
Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
Daubechie4
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:
◼
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
◼
Each transform has 2 functions: smoothing, difference
◼
Applies to pairs of data, resulting in two set of data of length L/2
◼
Applies two functions recursively, until reaches the desired length
133
Wavelet Decomposition
◼
◼
◼
Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ =
[23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
Compression: many small detail coefficients can be
replaced by 0’s, and only the significant coefficients are
retained
134
Haar Wavelet Coefficients
Coefficient “Supports”
Hierarchical
2.75
decomposition
structure (a.k.a. +
“error tree”) + -1.25
0.5
+
+
2
0
+
2
0
+
-1
-1
2
3
0.5
0
-
-
+
+
0
0
- +
5
-
+
-1.25
- +
+
2.75
4
Original frequency distribution
-
0
4
-1
-1
0
+
-
+
-
+
-
+
135
Why Wavelet Transform?
◼
◼
◼
◼
◼
Use hat-shape filters
◼ Emphasize region where points cluster
◼ Suppress weaker information in their boundaries
Effective removal of outliers
◼ Insensitive to noise, insensitive to input order
Multi-resolution
◼ Detect arbitrary shaped clusters at different scales
Efficient
◼ Complexity O(N)
Only applicable to low dimensional data
136
Principal Component Analysis (PCA)
◼
◼
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
e
x1
137
Principal Component Analysis (Steps)
◼
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
◼
Normalize input data: Each attribute falls within the same range
◼
Compute k orthonormal (unit) vectors, i.e., principal components
◼
◼
◼
◼
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Works for numeric data only
138
Attribute Subset Selection
◼
Another way to reduce dimensionality of data
◼
Redundant attributes
◼
◼
◼
Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant attributes
◼
◼
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
139
Heuristic Search in Attribute Selection
◼
◼
There are 2d possible attribute combinations of d attributes
Typical heuristic attribute selection methods:
◼ Best single attribute under the attribute independence
assumption: choose by significance tests
◼ Best step-wise feature selection:
◼ The best single-attribute is picked first
◼ Then next best attribute condition to the first, ...
◼ Step-wise attribute elimination:
◼ Repeatedly eliminate the worst attribute
◼ Best combined attribute selection and elimination
◼ Optimal branch and bound:
◼ Use attribute elimination and backtracking
140
Attribute Creation (Feature
Generation)
◼
◼
Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
Three general methodologies
◼ Attribute extraction
◼ Domain-specific
◼ Mapping data to new space (see: data reduction)
◼ E.g., Fourier transformation, wavelet
transformation, manifold approaches (not covered)
◼ Attribute construction
◼ Combining features (see: discriminative frequent
patterns in Chapter 7)
◼ Data discretization
141
Data Reduction 2: Numerosity
Reduction
◼
◼
◼
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
◼ Ex.: Log-linear models—obtain value at a point in mD space as the product on appropriate marginal
subspaces
Non-parametric methods
◼ Do not assume models
◼ Major families: histograms, clustering, sampling, …
142
Parametric Data Reduction:
Regression and Log-Linear Models
◼
◼
◼
Linear regression
◼ Data modeled to fit a straight line
◼ Often uses the least-square method to fit the line
Multiple regression
◼ Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
Log-linear model
◼ Approximates discrete multidimensional probability
distributions
143
y
Regression Analysis
Y1
◼
Regression analysis: A collective name for
techniques for the modeling and analysis
Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and
of one or more independent variables (aka.
explanatory variables or predictors)
◼
◼
The parameters are estimated so as to give
a "best fit" of the data
◼
Most commonly the best fit is evaluated by
using the least squares method, but
other criteria have also been used
X1
x
Used for prediction
(including forecasting of
time-series data), inference,
hypothesis testing, and
modeling of causal
relationships
144
Regress Analysis and Log-Linear
Models
◼
Linear regression: Y = w X + b
◼
◼
Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
◼
Multiple regression: Y = b0 + b1 X1 + b2 X2
◼
◼
Many nonlinear functions can be transformed into the above
Log-linear models:
◼
◼
◼
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
Useful for dimensionality reduction and data smoothing
145
Histogram Analysis
◼
◼
Divide data into buckets and 40
store average (sum) for each
35
bucket
Partitioning rules:
30
25
◼
◼
Equal-width: equal bucket
20
range
Equal-frequency (or equal-15
depth)
10
5
0
10000
30000
50000
70000
90000
146
Clustering
◼
◼
◼
◼
◼
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multidimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 10
147
Sampling
◼
◼
◼
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Key principle: Choose a representative subset of the data
◼
◼
◼
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods, e.g., stratified
sampling:
Note: Sampling may not reduce database I/Os (page at a
time)
148
Types of Sampling
◼
◼
◼
◼
Simple random sampling
◼ There is an equal probability of selecting any particular
item
Sampling without replacement
◼ Once an object is selected, it is removed from the
population
Sampling with replacement
◼ A selected object is not removed from the population
Stratified sampling:
◼ Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
◼ Used in conjunction with skewed data
149
Sampling: With or without Replacement
Raw Data
150
Sampling: Cluster or Stratified
Sampling
Raw Data
Cluster/Stratified Sample
151
Data Cube Aggregation
◼
◼
The lowest level of a data cube (base cuboid)
◼
The aggregated data for an individual entity of interest
◼
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes
◼
◼
Reference appropriate levels
◼
◼
Further reduce the size of data to deal with
Use the smallest representation which is enough to
solve the task
Queries regarding aggregated information should be
answered using data cube, when possible
152
Data Reduction 3: Data Compression
◼
◼
◼
◼
String compression
◼ There are extensive theories and well-tuned algorithms
◼ Typically lossless, but only limited manipulation is
possible without expansion
Audio/video compression
◼ Typically lossy compression, with progressive refinement
◼ Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
◼ Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be
considered as forms of data compression
153
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
154
Chapter 3: Data Preprocessing
◼
Data Preprocessing: An Overview
◼
Data Quality
◼
Major Tasks in Data Preprocessing
◼
Data Cleaning
◼
Data Integration
◼
Data Reduction
◼
Data Transformation and Data Discretization
◼
Summary
155
Data Transformation
◼
◼
A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
Methods
◼
Smoothing: Remove noise from data
◼
Attribute/feature construction
◼
New attributes constructed from the given ones
◼
Aggregation: Summarization, data cube construction
◼
Normalization: Scaled to fall within a smaller, specified range
◼
◼
min-max normalization
◼
z-score normalization
◼
normalization by decimal scaling
Discretization: Concept hierarchy climbing
156
Normalization
◼
Min-max normalization: to [new_minA, new_maxA]
v' =
◼
◼
v − minA
(new _ maxA − new _ minA) + new _ minA
maxA − minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 − 12,000
1.0]. Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716
Z-score normalization (μ: mean, σ: standard deviation):
v' =
v − A

A
◼
◼
Ex. Let μ = 54,000, σ = 16,000. Then
73,600 − 54,000
= 1.225
16,000
Normalization by decimal scaling
v
v'= j
10
Where j is the smallest integer such that Max(|ν’|) < 1
157
Discretization
◼
Three types of attributes
◼
◼
◼
◼
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic
rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
◼
Interval labels can then be used to replace actual data values
◼
Reduce data size by discretization
◼
Supervised vs. unsupervised
◼
Split (top-down) vs. merge (bottom-up)
◼
Discretization can be performed recursively on an attribute
◼
Prepare for further analysis, e.g., classification
158
Data Discretization Methods
◼
Typical methods: All the methods can be applied recursively
◼
Binning
◼
◼
Histogram analysis
◼
◼
◼
◼
Top-down split, unsupervised
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
159
Simple Discretization: Binning
◼
Equal-width (distance) partitioning
◼
Divides the range into N intervals of equal size: uniform grid
◼
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
◼
◼
The most straightforward, but outliers may dominate presentation
◼
Skewed data is not handled well
Equal-depth (frequency) partitioning
◼
Divides the range into N intervals, each containing approximately
same number of samples
◼
Good data scaling
◼
Managing categorical attributes can be tricky
160
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
❑
161
Labels
(Binning vs. Clustering)
Data
Equal frequency (binning)
Equal interval width (binning)
K-means clustering leads to better results
162
Discretization by Classification &
Correlation Analysis
◼
◼
Classification (e.g., decision tree analysis)
◼
Supervised: Given class labels, e.g., cancerous vs. benign
◼
Using entropy to determine split point (discretization point)
◼
Top-down, recursive split
◼
Details to be covered in Chapter 7
Correlation analysis (e.g., Chi-merge: χ2-based discretization)
◼
Supervised: use class information
◼
Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to merge
◼
Merge performed recursively, until a predefined stopping condition
163
Concept Hierarchy Generation
◼
◼
◼
◼
◼
Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
164
Concept Hierarchy Generation
for Nominal Data
◼
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
◼
◼
Specification of a hierarchy for a set of values by explicit
data grouping
◼
◼
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
◼
◼
street < city < state < country
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
◼
E.g., for a set of attributes: {street, city, state, country}
165
Automatic Concept Hierarchy Generation
◼
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
◼ The attribute with the most distinct values is placed at
the lowest level of the hierarchy
◼ Exceptions, e.g., weekday, month, quarter, year
country
15 distinct values
province_or_ state
365 distinct values
city
3567 distinct values
street
674,339 distinct values
166
Chapter 3: Data Preprocessing
◼
Data Preprocessing: An Overview
◼
Data Quality
◼
Major Tasks in Data Preprocessing
◼
Data Cleaning
◼
Data Integration
◼
Data Reduction
◼
Data Transformation and Data Discretization
◼
Summary
167
Summary
◼
◼
◼
◼
◼
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
◼ Entity identification problem
◼ Remove redundancies
◼ Detect inconsistencies
Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
168
References
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
169
Data Mining:
Concepts and
Techniques
(3rd ed.)
— Chapter 4 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
170
Chapter 4: Data Warehousing and On-line
Analytical Processing
◼
Data Warehouse: Basic Concepts
◼
Data Warehouse Modeling: Data Cube and OLAP
◼
Data Warehouse Design and Usage
◼
Data Warehouse Implementation
◼
Data Generalization by Attribute-Oriented
Induction
◼
Summary
171
What is a Data Warehouse?
◼
Defined in many different ways, but not rigorously.
◼
A decision support database that is maintained separately from
the organization’s operational database
◼
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
◼
“A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
◼
Data warehousing:
◼
The process of constructing and using data warehouses
172
Data Warehouse—Subject-Oriented
◼
Organized around major subjects, such as customer,
product, sales
◼
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
◼
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process
173
Data Warehouse—Integrated
◼
◼
Constructed by integrating multiple, heterogeneous data
sources
◼ relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
◼ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
◼
◼
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
174
Data Warehouse—Time Variant
◼
The time horizon for the data warehouse is significantly
longer than that of operational systems
◼
◼
◼
Operational database: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
◼
◼
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain “time element”
175
Data Warehouse—Nonvolatile
◼
A physically separate store of data transformed from the
operational environment
◼
Operational update of data does not occur in the data
warehouse environment
◼
Does not require transaction processing, recovery,
and concurrency control mechanisms
◼
Requires only two operations in data accessing:
◼
initial loading of data and access of data
176
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
177
Why a Separate Data Warehouse?
◼
High performance for both systems
◼
◼
◼
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
◼
◼
◼
◼
DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAP
analysis directly on relational databases
178
Data Warehouse: A Multi-Tiered Architecture
Other
sources
Operational
DBs
Metadata
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
179
Three Data Warehouse Models
◼
◼
Enterprise warehouse
◼ collects all of the information about subjects spanning
the entire organization
Data Mart
◼ a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
◼
◼
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
◼ A set of views over operational databases
◼ Only some of the possible summary views may be
materialized
180
Extraction, Transformation, and Loading
(ETL)
◼
◼
◼
◼
◼
Data extraction
◼ get data from multiple, heterogeneous, and external
sources
Data cleaning
◼ detect errors in the data and rectify them when possible
Data transformation
◼ convert data from legacy or host format to warehouse
format
Load
◼ sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
◼ propagate the updates from the data sources to the
warehouse
181
Metadata Repository
◼
Meta data is the data defining warehouse objects. It stores:
◼
Description of the structure of the data warehouse
◼
◼
schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
Operational meta-data
◼
data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
◼
The algorithms used for summarization
◼
The mapping from operational environment to the data warehouse
◼
◼
Data related to system performance
◼ warehouse schema, view and derived data definitions
Business data
◼
business terms and definitions, ownership of data, charging policies
182
Chapter 4: Data Warehousing and On-line
Analytical Processing
◼
Data Warehouse: Basic Concepts
◼
Data Warehouse Modeling: Data Cube and OLAP
◼
Data Warehouse Design and Usage
◼
Data Warehouse Implementation
◼
Data Generalization by Attribute-Oriented
Induction
◼
Summary
183
From Tables and Spreadsheets to
Data Cubes
◼
A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
◼
A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
◼
Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
◼
Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
◼
In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
184
Cube: A Lattice of Cuboids
all
time
0-D (apex) cuboid
item
time,location
time,item
location
supplier
item,location
time,supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
item,location,supplier
4-D (base) cuboid
time, item, location, supplier
185
Conceptual Modeling of Data
Warehouses
◼
Modeling data warehouses: dimensions & measures
◼
Star schema: A fact table in the middle connected to a
set of dimension tables
◼
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
◼
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
186
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
Measures
187
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
state_or_province
country
188
Example of Fact
Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type 189
A Concept Hierarchy:
Dimension (location)
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Mexico
Toronto
M. Wind
190
Data Cube Measures: Three Categories
◼
Distributive: if the result derived by applying the function
to n aggregate values is the same as that derived by
applying the function on all the data without partitioning
◼
◼
Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
◼
◼
E.g., count(), sum(), min(), max()
E.g., avg(), min_N(), standard_deviation()
Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
◼
E.g., median(), mode(), rank()
191
View of Warehouses and
Hierarchies
Specification of hierarchies
◼
Schema hierarchy
day < {month <
quarter; week} < year
◼
Set_grouping hierarchy
{1..10} < inexpensive
192
Multidimensional Data
Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Category Country Quarter
Product
◼
Product
City
Office
Month Week
Day
Month
193
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TVs in U.S.A.
sum
194
Cuboids Corresponding to the Cube
all
0-D (apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
3-D (base) cuboid
product, date, country
195
Typical OLAP Operations
◼
Roll up (drill-up): summarize data
◼
by climbing up hierarchy or by dimension reduction
◼
Drill down (roll down): reverse of roll-up
◼
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
◼
Pivot (rotate):
◼
◼
◼
reorient the cube, visualization, 3D to series of 2D planes
Other operations
◼
◼
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
196
Fig. 3.10 Typical OLAP
Operations
197
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time
Product
ANNUALY QTRLY
DAILY
PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
Location
Each circle is
called a footprint
DIVISION
Promotion
Organization
198
Browsing a Data Cube
◼
◼
◼
Visualization
OLAP capabilities
Interactive manipulation
199
Chapter 4: Data Warehousing and On-line
Analytical Processing
◼
Data Warehouse: Basic Concepts
◼
Data Warehouse Modeling: Data Cube and OLAP
◼
Data Warehouse Design and Usage
◼
Data Warehouse Implementation
◼
Data Generalization by Attribute-Oriented
Induction
◼
Summary
200
Design of Data Warehouse: A
Business Analysis Framework
◼
Four views regarding the design of a data warehouse
◼
Top-down view
◼
◼
Data source view
◼
◼
exposes the information being captured, stored, and
managed by operational systems
Data warehouse view
◼
◼
allows selection of the relevant information necessary for the
data warehouse
consists of fact tables and dimension tables
Business query view
◼
sees the perspectives of data in the warehouse from the view
of end-user
201
Data Warehouse Design
Process
◼
◼
Top-down, bottom-up approaches or a combination of both
◼
Top-down: Starts with overall design and planning (mature)
◼
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
◼
◼
◼
Waterfall: structured and systematic analysis at each step before
proceeding to the next
Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
Typical data warehouse design process
◼
Choose a business process to model, e.g., orders, invoices, etc.
◼
Choose the grain (atomic level of data) of the business process
◼
Choose the dimensions that will apply to each fact table record
◼
Choose the measure that will populate each fact table record
202
Data Warehouse
Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Data
Mart
Data
Mart
Model refinement
Enterprise
Data
Warehouse
Model refinement
Define a high-level corporate data model
203
Data Warehouse Usage
◼
Three kinds of data warehouse applications
◼
Information processing
◼
◼
◼
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
◼
multidimensional analysis of data warehouse data
◼
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
◼
◼
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
204
From On-Line Analytical Processing
(OLAP)
to On Line Analytical Mining (OLAM)
◼
Why online analytical mining?
◼ High quality of data in data warehouses
◼ DW contains integrated, consistent, cleaned data
◼ Available information processing structure surrounding
data warehouses
◼ ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
◼ OLAP-based exploratory data analysis
◼ Mining with drilling, dicing, pivoting, etc.
◼ On-line selection of data mining functions
◼ Integration and swapping of multiple mining
functions, algorithms, and tasks
205
Chapter 4: Data Warehousing and On-line
Analytical Processing
◼
Data Warehouse: Basic Concepts
◼
Data Warehouse Modeling: Data Cube and OLAP
◼
Data Warehouse Design and Usage
◼
Data Warehouse Implementation
◼
Data Generalization by Attribute-Oriented
Induction
◼
Summary
206
Efficient Data Cube
Computation
◼
Data cube can be viewed as a lattice of cuboids
◼
The bottom-most cuboid is the base cuboid
◼
The top-most cuboid (apex) contains only one cell
◼
How many cuboids in an n-dimensional cube with L
n
levels?
T =  ( Li +1)
i =1
◼
Materialization of data cube
◼
◼
Materialize every (cuboid) (full materialization),
none (no materialization), or some (partial
materialization)
Selection of which cuboids to materialize
◼
Based on size, sharing, access frequency, etc.
207
The “Compute Cube” Operator
◼
Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
◼
Transform it into a SQL-like language (with a new operator cube
by, introduced by Gray et al.’96)
()
SELECT item, city, year, SUM (amount)
FROM SALES
◼
CUBE BY item, city, year
Need compute the following Group-Bys
(city)
(city, item)
(item)
(city, year)
(date, product, customer),
(date,product),(date, customer), (product, customer),
(city, item, year)
(date), (product), (customer)
()
(year)
(item, year)
208
Indexing OLAP Data: Bitmap Index
◼
◼
◼
◼
◼
◼
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value for
the indexed column
not suitable for high cardinality domains
A recent bit compression technique, Word-Aligned Hybrid (WAH),
makes it work for high cardinality domain as well [Wu, et al. TODS’06]
Base table
Cust
C1
C2
C3
C4
C5
Region
Asia
Europe
Asia
America
Europe
Index on Region
Index on Type
Type RecIDAsia Europe America RecID Retail Dealer
Retail
1
1
0
1
1
0
0
Dealer 2
2
0
1
0
1
0
Dealer 3
1
0
0
3
0
1
4
0
0
1
4
1
0
Retail
0
1
0
5
0
1
Dealer 5
209
Indexing OLAP Data: Join Indices
◼
◼
◼
Join index: JI(R-id, S-id) where R (R-id, …)  S
(S-id, …)
Traditional indices map the values to a list of
record ids
◼ It materializes relational join in JI file and
speeds up relational join
In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
◼ E.g. fact table: Sales and two dimensions city
and product
◼ A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
◼ Join indices can span multiple dimensions
210
Efficient Processing OLAP Queries
◼
Determine which operations should be performed on the available cuboids
◼
Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection
◼
Determine which materialized cuboid(s) should be selected for OLAP op.
◼
Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
◼
Explore indexing structures and compressed vs. dense array structs in MOLAP
211
OLAP Server Architectures
◼
Relational OLAP (ROLAP)
◼
◼
◼
◼
◼
Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
Greater scalability
Multidimensional OLAP (MOLAP)
◼
Sparse array-based multidimensional storage engine
◼
Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
◼
◼
Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
Flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers (e.g., Redbricks)
◼
Specialized support for SQL queries over star/snowflake schemas
212
Chapter 4: Data Warehousing and On-line
Analytical Processing
◼
Data Warehouse: Basic Concepts
◼
Data Warehouse Modeling: Data Cube and OLAP
◼
Data Warehouse Design and Usage
◼
Data Warehouse Implementation
◼
Data Generalization by Attribute-Oriented
Induction
◼
Summary
213
Attribute-Oriented
Induction
◼
Proposed in 1989 (KDD ‘89 workshop)
◼
Not confined to categorical data nor particular measures
◼
How it is done?
◼
◼
◼
◼
Collect the task-relevant data (initial relation) using a
relational database query
Perform generalization by attribute removal or
attribute generalization
Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts
Interaction with users for knowledge presentation
214
Attribute-Oriented Induction: An
Example
Example: Describe general characteristics of graduate
students in the University database
◼
Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
◼
Step 2. Perform attribute-oriented induction
◼
Step 3. Present results in generalized relation, cross-tab,
or rule forms
215
Class Characterization: An Example
Name
Gender
Jim
Initial
Woodman
Relation Scott
M
Major
M
F
…
Removed
Retained
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
…
…
…
3511 Main St.,
Richmond
345 1st Ave.,
Richmond
687-4598
3.67
253-9106
3.70
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
Gender Major
M
F
…
Birth_date
CS
Lachance
Laura Lee
…
Prime
Generalized
Relation
Birth-Place
Science
Science
…
Country
Age range
Birth_region
Age_range
Residence
GPA
Canada
Foreign
…
20-25
25-30
…
Richmond
Burnaby
…
Very-good
Excellent
…
Count
16
22
…
Birth_Region
Canada
Foreign
Total
Gender
M
16
14
30
F
10
22
32
Total
26
36
62
216
Basic Principles of Attribute-Oriented
Induction
◼
◼
◼
◼
◼
Data focusing: task-relevant data, including dimensions,
and the result is the initial relation
Attribute-removal: remove attribute A if there is a large set
of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher level concepts are
expressed in terms of other attributes
Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A
Attribute-threshold control: typical 2-8, specified/default
Generalized relation threshold control: control the final
relation/rule size
217
Attribute-Oriented Induction: Basic
Algorithm
◼
◼
◼
◼
InitialRel: Query processing of task-relevant data, deriving
the initial relation.
PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
218
Presentation of Generalized
Results
◼
Generalized relation:
◼
◼
Cross tabulation:
◼
◼
Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
Mapping results into cross tabulation form (similar to contingency
tables).
◼
Visualization techniques:
◼
Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x)  male( x) 
birth_ region( x) ="Canada"[t :53%] birth_ region( x) =" foreign"[t : 47%].
◼
219
Mining Class Comparisons
◼
Comparison: Comparing two or more classes
◼
Method:
◼
Partition the set of relevant data into the target class and the
contrasting class(es)
◼
Generalize both classes to the same high level concepts
◼
Compare tuples with the same high level descriptions
◼
Present for every tuple its description and two measures
◼
◼
◼
support - distribution within single class
◼
comparison - distribution between classes
Highlight the tuples with strong discriminant features
Relevance Analysis:
◼
Find attributes (features) which best distinguish different classes
220
Concept Description vs. Cube-Based
OLAP
◼
◼
Similarity:
◼ Data generalization
◼ Presentation of data summarization at multiple levels of
abstraction
◼ Interactive drilling, pivoting, slicing and dicing
Differences:
◼ OLAP has systematic preprocessing, query independent,
and can drill down to rather low level
◼ AOI has automated desired level allocation, and may
perform dimension relevance analysis/ranking when
there are many relevant dimensions
◼ AOI works on the data which are not in relational forms
221
Chapter 4: Data Warehousing and On-line
Analytical Processing
◼
Data Warehouse: Basic Concepts
◼
Data Warehouse Modeling: Data Cube and OLAP
◼
Data Warehouse Design and Usage
◼
Data Warehouse Implementation
◼
Data Generalization by Attribute-Oriented
Induction
◼
Summary
222
Summary
◼
Data warehousing: A multi-dimensional model of a data warehouse
◼
◼
◼
◼
A data cube consists of dimensions & measures
Star schema, snowflake schema, fact constellations
OLAP operations: drilling, rolling, slicing, dicing and pivoting
Data Warehouse Architecture, Design, and Usage
◼
Multi-tiered architecture
◼
Business analysis design framework
Information processing, analytical processing, data mining, OLAM (Online
Analytical Mining)
Implementation: Efficient computation of data cubes
◼
Partial vs. full vs. no materialization
◼
Indexing OALP data: Bitmap index and join index
◼
OLAP query processing
◼
OLAP servers: ROLAP, MOLAP, HOLAP
◼
◼
◼
Data generalization: Attribute-oriented induction
223
References (I)
◼
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
◼
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
◼
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
◼
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
◼
E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July
1993.
J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab
and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
◼
◼
A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and
Applications. MIT Press, 1999.
◼
J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
1998.
◼
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
◼
J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
224
References (II)
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. John Wiley, 2003
W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2ed. John Wiley, 2002
P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–
11, Sept. 1995.
P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
J. Widom. Research problems in data warehousing. CIKM’95
K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
on Database Systems (TODS), 31(1): 1-38, 2006
225
Surplus Slides
226
Compression of Bitmap Indices
◼
Bitmap indexes must be compressed to reduce I/O costs
and minimize CPU usage—majority of the bits are 0’s
◼
◼
Two compression schemes:
◼
Byte-aligned Bitmap Code (BBC)
◼
Word-Aligned Hybrid (WAH) code
Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
◼
Optimal on attributes of low cardinality as well as those of
high cardinality.
◼
WAH out performs BBC by about a factor of two
227
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 5 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2010 Han, Kamber & Pei. All rights reserved.
228
Chapter 5: Data Cube Technology
◼
Data Cube Computation: Preliminary Concepts
◼
Data Cube Computation Methods
◼
Processing Advanced Queries by Exploring Data
Cube Technology
◼
Multidimensional Data Analysis in Cube Space
◼
Summary
229
Data Cube: A Lattice of Cuboids
all
time
item
time,location
time,item
0-D(apex) cuboid
location
supplier
item,location
time,supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,locationtime,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplierc
230
Data Cube: A Lattice of Cuboids
all
time
item
0-D(apex) cuboid
location
supplier
1-D cuboids
time,item
time,location
item,location
location,supplier
item,supplier
time,supplier
2-D cuboids
time,location,supplier
time,item,location
time,item,supplier
item,location,supplier
time, item, location, supplier
◼
3-D cuboids
4-D(base) cuboid
Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land)
2. (9/15, milk, Urbana, *)
3. (*, milk, Urbana, *)
4. (*, milk, Urbana, *)
5. (*, milk, Chicago, *)
6. (*, milk, *, *)
231
Cube Materialization:
Full Cube vs. Iceberg Cube
◼
Full cube vs. iceberg cube
iceberg
condition
compute cube sales iceberg as
select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group
having count(*) >= min support
▪ Computing only the cuboid cells whose measure satisfies the
iceberg condition
▪ Only a small portion of cells may be “above the water’’ in a
sparse cube
◼ Avoid explosive growth: A cube with 100 dimensions
◼ 2 base cells: (a1, a2, …., a100), (b1, b2, …, b100)
◼ How many aggregate cells if “having count >= 1”?
◼ What about “having count >= 2”?
232
Iceberg Cube, Closed Cube & Cube Shell
◼
Is iceberg cube good enough?
◼
◼
◼
How many cells will the iceberg cube have if having count(*) >=
10? Hint: A huge but tricky number!
Close cube:
◼
◼
◼
◼
2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}
Closed cell c: if there exists no cell d, s.t. d is a descendant of c,
and d has the same measure value as c.
Closed cube: a cube consisting of only closed cells
What is the closed cube of the above base cuboid? Hint: only 3
cells
Cube Shell
◼
◼
Precompute only the cuboids involving a small # of dimensions,
e.g., 3
For (A1, A2, … A10), how many combinations to compute?
More dimension combinations will need to be computed on the fly
233
Roadmap for Efficient Computation
◼
General cube computation heuristics (Agarwal et al.’96)
◼
Computing full/iceberg cubes: 3 methodologies
◼
◼
◼
Bottom-Up: Multi-Way array aggregation (Zhao, Deshpande &
Naughton, SIGMOD’97)
Top-down:
◼
BUC (Beyer & Ramarkrishnan, SIGMOD’99)
◼
H-cubing technique (Han, Pei, Dong & Wang: SIGMOD’01)
Integrating Top-Down and Bottom-Up:
◼
Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03)
◼
High-dimensional OLAP: A Minimal Cubing Approach (Li, et al. VLDB’04)
◼
Computing alternative kinds of cubes:
◼
Partial cube, closed cube, approximate cube, etc.
234
General Heuristics (Agarwal et al. VLDB’96)
◼
◼
Sorting, hashing, and grouping operations are applied to the dimension
attributes in order to reorder and cluster related tuples
Aggregates may be computed from previously computed aggregates,
rather than from the base fact table
◼
◼
◼
◼
◼
Smallest-child: computing a cuboid from the smallest, previously
computed cuboid
Cache-results: caching results of a cuboid from which other
cuboids are computed to reduce disk I/Os
Amortize-scans: computing as many as possible cuboids at the
same time to amortize disk reads
Share-sorts: sharing sorting costs cross multiple cuboids when
sort-based method is used
Share-partitions: sharing the partitioning cost across multiple
cuboids when hash-based algorithms are used
235
Chapter 5: Data Cube Technology
◼
Data Cube Computation: Preliminary Concepts
◼
Data Cube Computation Methods
◼
Processing Advanced Queries by Exploring Data
Cube Technology
◼
Multidimensional Data Analysis in Cube Space
◼
Summary
236
Data Cube Computation Methods
◼
Multi-Way Array Aggregation
◼
BUC
◼
Star-Cubing
◼
High-Dimensional OLAP
237
Multi-Way Array Aggregation
◼
Array-based “bottom-up” algorithm
◼
Using multi-dimensional chunks
◼
No direct tuple comparisons
◼
◼
◼
Simultaneous aggregation on multiple
dimensions
Intermediate aggregate values are reused for computing ancestor cuboids
Cannot do Apriori pruning: No iceberg
optimization
238
Multi-way Array Aggregation for Cube
Computation (MOLAP)
◼
Partition arrays into chunks (a small subcube which fits in memory).
◼
Compressed sparse array addressing: (chunk_id, offset)
◼
Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
b3
B13
b2
9
b1
5
b0
14
15
16
1
2
3
4
a0
a1
a2
a3
A
60
44
28 56
40
24 52
36
20
What is the best
traversing order
to do multi-way
aggregation?
239
Multi-way Array Aggregation for Cube
Computation (3-D to 2-D)
all
A
B
AB
C
AC
BC
◼
ABC
The best order is
the one that
minimizes the
memory
requirement and
reduced I/Os
240
Multi-way Array Aggregation for Cube
Computation (2-D to 1-D)
241
Multi-Way Array Aggregation for Cube
Computation (Method Summary)
◼
Method: the planes should be sorted and computed
according to their size in ascending order
◼
◼
Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
Limitation of the method: computing well only for a small
number of dimensions
◼
If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods
can be explored
242
Data Cube Computation Methods
◼
Multi-Way Array Aggregation
◼
BUC
◼
Star-Cubing
◼
High-Dimensional OLAP
243
Bottom-Up Computation (BUC)
◼
◼
BUC (Beyer & Ramakrishnan,
SIGMOD’99)
Bottom-up cube computation
(Note: top-down in our view!)
Divides dimensions into partitions
and facilitates iceberg pruning
◼ If a partition does not satisfy
min_sup, its descendants can
be pruned
3 AB
◼ If minsup = 1  compute full
CUBE!
4 ABC
No simultaneous aggregation
AB
ABC
◼
◼
all
A
AC
B
AD
ABD
C
BC
D
CD
BD
ACD
BCD
ABCD
1 all
2A
7 AC
6 ABD
10 B
14 C
16 D
9 AD 11 BC 13 BD
8 ACD
15 CD
12 BCD
5 ABCD
244
BUC: Partitioning
◼
◼
◼
◼
Usually, entire data set
can’t fit in main memory
Sort distinct values
◼ partition into blocks that fit
Continue processing
Optimizations
◼ Partitioning
◼ External Sorting, Hashing, Counting Sort
◼ Ordering dimensions to encourage pruning
◼ Cardinality, Skew, Correlation
◼ Collapsing duplicates
◼ Can’t do holistic aggregates anymore!
245
Data Cube Computation Methods
◼
Multi-Way Array Aggregation
◼
BUC
◼
Star-Cubing
◼
High-Dimensional OLAP
246
Star-Cubing: An Integrating
Method
◼
◼
◼
D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes
by Top-Down and Bottom-Up Integration, VLDB'03
Explore shared dimensions
◼
E.g., dimension A is the shared dimension of ACD and AD
◼
ABD/AB means cuboid ABD has shared dimensions AB
Allows for shared computations
e.g., cuboid AB is computed simultaneously as ABD
Aggregate in a top-down
manner but with the bottom-up
AC/AC
AD/A
BC/BC
sub-layer underneath which will
allow Apriori pruning
ACD/A
◼
◼
◼
Shared dimensions grow in
bottom-up fashion
ABC/ABC
ABD/AB
C/C
D
BD/B
CD
BCD
ABCD/all
247
Iceberg Pruning in Shared Dimensions
◼
Anti-monotonic property of shared dimensions
◼
◼
◼
If the measure is anti-monotonic, and if the
aggregate value on a shared dimension does not
satisfy the iceberg condition, then all the cells
extended from this shared dimension cannot
satisfy the condition either
Intuition: if we can compute the shared dimensions
before the actual cuboid, we can use them to do
Apriori pruning
Problem: how to prune while still aggregate
simultaneously on multiple dimensions?
248
Cell Trees
◼
Use a tree structure similar
to H-tree to represent
cuboids
◼
Collapses common prefixes
to save memory
◼
Keep count at node
◼
Traverse the tree to retrieve
a particular tuple
249
Star Attributes and Star Nodes
◼
Intuition: If a single-dimensional
aggregate on an attribute value p
does not satisfy the iceberg
condition, it is useless to distinguish
them during the iceberg
computation
◼
◼
E.g., b2, b3, b4, c1, c2, c4, d1, d2,
d3
A
B
C
D
Count
a1
b1
c1
d1
1
a1
b1
c4
d3
1
a1
b2
c2
d2
1
a2
b3
c3
d4
1
a2
b4
c3
d4
1
Solution: Replace such attributes by
a *. Such attributes are star
attributes, and the corresponding
nodes in the cell tree are star nodes
250
Example: Star Reduction
◼
◼
◼
◼
Suppose minsup = 2
Perform one-dimensional
aggregation. Replace attribute
values whose count < 2 with *. And
collapse all *’s together
Resulting table has all such
attributes replaced with the starattribute
With regards to the iceberg
computation, this new table is a
lossless compression of the original
table
A
B
C
D
Count
a1
b1
*
*
1
a1
b1
*
*
1
a1
*
*
*
1
a2
*
c3
d4
1
a2
*
c3
d4
1
A
B
C
D
Count
a1
b1
*
*
2
a1
*
*
*
1
a2
*
c3
d4
2
251
Star Tree
◼
Given the new compressed
table, it is possible to
construct the corresponding
A
B
C
D
Count
a1
b1
*
*
2
a1
*
*
*
1
a2
*
c3
d4
2
cell tree—called star tree
◼
Keep a star table at the side
for easy lookup of star
attributes
◼
The star tree is a lossless
compression of the original
cell tree
252
Star-Cubing Algorithm—DFS on Lattice
Tree
all
BCD: 51
b*: 33
A /A
B/B
C/C
b1: 26
D/D
root: 5
c*: 14
AB/AB
d*: 15
ABC/ABC
c3: 211
AC/AC
d4: 212
ABD/AB
c*: 27
AD/A
BC/BC BD/B
CD
a1: 3
a2: 2
d*: 28
ACD/A
BCD
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
ABCD
253
BCD
ACD/A
ABD/AB
ABC/ABC
Multi-Way Aggregation
ABCD
254
BCD
ACD/A
ABD/AB
ABC/ABC
Star-Cubing Algorithm—DFS on Star-Tree
ABCD
255
BCD
ACD/A
ABD/AB
ABC/ABC
Multi-Way Star-Tree Aggregation
ABCD
◼
Start depth-first search at the root of the base star tree
◼
At each new node in the DFS, create corresponding star tree that are descendants of
the current tree according to the integrated traversal ordering
◼
E.g., in the base tree, when DFS reaches a1, the ACD/A tree is created
◼
When DFS reaches b*, the ABD/AD tree is created
◼
The counts in the base tree are carried over to the new trees
◼
When DFS reaches a leaf node (e.g., d*), start backtracking
◼
◼
On every backtracking branch, the count in the corresponding trees are output, the
tree is destroyed, and the node in the base tree is destroyed
Example
◼
◼
◼
When traversing from d* back to c*, the a1b*c*/a1b*c* tree is output and
destroyed
When traversing from c* back to b*, the a1b*D/a1b* tree is output and
destroyed
When at b*, jump to b1 and repeat similar process
256
Data Cube Computation Methods
◼
Multi-Way Array Aggregation
◼
BUC
◼
Star-Cubing
◼
High-Dimensional OLAP
257
The Curse of Dimensionality
◼
◼
None of the previous cubing method can handle high
dimensionality!
A database of 600k tuples. Each dimension has cardinality
of 100 and zipf of 2.
258
Motivation of High-D OLAP
◼
◼
◼
X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP:
A Minimal Cubing Approach, VLDB'04
Challenge to current cubing methods:
◼ The “curse of dimensionality’’ problem
◼ Iceberg cube and compressed cubes: only delay the
inevitable explosion
◼ Full materialization: still significant overhead in
accessing results on disk
High-D OLAP is needed in applications
◼ Science and engineering analysis
◼ Bio-data analysis: thousands of genes
◼ Statistical surveys: hundreds of variables
259
Fast High-D OLAP with Minimal Cubing
◼
Observation: OLAP occurs only on a small subset of
dimensions at a time
◼
Semi-Online Computational Model
1.
Partition the set of dimensions into shell fragments
2.
Compute data cubes for each shell fragment while
retaining inverted indices or value-list indices
3.
Given the pre-computed fragment cubes,
dynamically compute cube cells of the highdimensional data cube online
260
Properties of Proposed Method
◼
Partitions the data vertically
◼
Reduces high-dimensional cube into a set of lower
dimensional cubes
◼
Online re-construction of original high-dimensional space
◼
Lossless reduction
◼
Offers tradeoffs between the amount of pre-processing
and the speed of online computation
261
Example Computation
◼
◼
Let the cube aggregation function be count
tid
A
B
C
D
E
1
a1
b1
c1
d1
e1
2
a1
b2
c1
d2
e1
3
a1
b2
c1
d1
e2
4
a2
b1
c1
d1
e2
5
a2
b1
c1
d1
e3
Divide the 5 dimensions into 2 shell fragments:
◼ (A, B, C) and (D, E)
262
1-D Inverted Indices
◼
Build traditional invert index or RID list
Attribute Value
TID List
List Size
a1
123
3
a2
45
2
b1
145
3
b2
23
2
c1
12345
5
d1
1345
4
d2
2
1
e1
12
2
e2
34
2
e3
5
1
263
Shell Fragment Cubes: Ideas
◼
◼
◼
◼
Generalize the 1-D inverted indices to multi-dimensional
ones in the data cube sense
Compute all cuboids for data cubes ABC and DE while
retaining the inverted indices
Cell
For example, shell
fragment cube ABC
a1 b1
contains 7 cuboids:
a1 b2
◼ A, B, C
a2 b1
◼ AB, AC, BC
a2 b2
◼ ABC
This completes the offline 

computation stage
Intersection
TID List List Size
1 2 3 1 4 5
1
1
1 2 3 2 3
23
2
4 5 1 4 5
45
2
4 52 3

0

264
Shell Fragment Cubes: Size and Design
◼
Given a database of T tuples, D dimensions, and F shell
fragment size, the fragment cubes’ space requirement is:
◼
For F < 5, the growth is sub-linear
  D F

OT (2 −1)
 F 

◼
Shell fragments do not have to be disjoint
◼
Fragment groupings can be arbitrary to allow for
maximum online performance
◼
◼
Known common combinations
(e.g.,<city, state>)

should be grouped together.
Shell fragment sizes can be adjusted for optimal balance
between offline and online computation
265
ID_Measure Table
◼
If measures other than count are present, store in
ID_measure table separate from the shell fragments
tid
count
sum
1
5
70
2
3
10
3
8
20
4
5
40
5
2
30
266
The Frag-Shells Algorithm
1.
Partition set of dimension (A1,…,An) into a set of k fragments
(P1,…,Pk).
2.
Scan base table once and do the following
3.
insert <tid, measure> into ID_measure table.
4.
for each attribute value ai of each dimension Ai
5.
build inverted index entry <ai, tidlist>
6.
7.
For each fragment partition Pi
build local fragment cube Si by intersecting tid-lists in bottomup fashion.
267
Frag-Shells (2)
Dimensions
D Cuboid
EF Cuboid
DE Cuboid
A B C D E F …
ABC
Cube
Cell
Tuple-ID List
d1 e1
{1, 3, 8, 9}
d1 e2
{2, 4, 6, 7}
d2 e1
{5, 10}
…
…
DEF
Cube
268
Online Query Computation: Query
◼
A query has the general form
◼
Each ai has 3 possible values
1.
▪
a1,a2 , ,an : M
Instantiated value
2.

Aggregate * function
3.
Inquire ? function
For example, 3 ? ? * 1: count returns a 2-D
data cube.

269
Online Query Computation: Method
Given the fragment cubes, process a query as
◼
follows
1.
Divide the query into fragment, same as the shell
2.
Fetch the corresponding TID list for each
fragment from the fragment cube
3.
Intersect the TID lists from each fragment to
construct instantiated base table
4.
Compute the data cube using the base table with
any cubing algorithm
270
Online Query Computation: Sketch
A B C D E F G H I J K L M N …
Instantiated
Base Table
Online
Cube
271
Experiment: Size vs. Dimensionality (50
and 100 cardinality)
◼
◼
(50-C): 106 tuples, 0 skew, 50 cardinality, fragment size 3.
(100-C): 106 tuples, 2 skew, 100 cardinality, fragment size 2.
272
Experiments on Real World Data
◼
UCI Forest CoverType data set
◼
◼
◼
◼
54 dimensions, 581K tuples
Shell fragments of size 2 took 33 seconds and 325MB
to compute
3-D subquery with 1 instantiate D: 85ms~1.4 sec.
Longitudinal Study of Vocational Rehab. Data
◼
◼
◼
24 dimensions, 8818 tuples
Shell fragments of size 3 took 0.9 seconds and 60MB to
compute
5-D query with 0 instantiated D: 227ms~2.6 sec.
273
Chapter 5: Data Cube Technology
◼
Data Cube Computation: Preliminary Concepts
◼
Data Cube Computation Methods
◼
Processing Advanced Queries by Exploring Data Cube
Technology
◼
Sampling Cube
◼
Ranking Cube
◼
Multidimensional Data Analysis in Cube Space
◼
Summary
274
Processing Advanced Queries by
Exploring Data Cube Technology
◼
Sampling Cube
◼
◼
Ranking Cube
◼
◼
X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling
Cube: A Framework for Statistical OLAP over
Sampling Data”, SIGMOD’08
D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k
queries with multi-dimensional selections: The
ranking cube approach. VLDB’06
Other advanced cubes for processing data and queries
◼
Stream cube, spatial cube, multimedia cube, text
cube, RFID cube, etc. — to be studied in volume 2
275
Statistical Surveys and OLAP
▪
▪
▪
▪
▪
Statistical survey: A popular tool to collect information
about a population based on a sample
▪ Ex.: TV ratings, US Census, election polls
A common tool in politics, health, market research,
science, and many more
An efficient way of collecting information (Data collection
is expensive)
Many statistical tools available, to determine validity
▪ Confidence intervals
▪ Hypothesis tests
OLAP (multidimensional analysis) on survey data
▪ highly desirable but can it be done well?
276
Surveys: Sample vs. Whole Population
Data is only a sample of population
Age\Education
High-school
College
Graduate
18
19
20
…
277
Problems for Drilling in Multidim. Space
Data is only a sample of population but samples could be small
when drilling to certain multidimensional space
Age\Education
High-school
College
Graduate
18
19
20
…
278
OLAP on Survey (i.e., Sampling) Data
▪
▪
Semantics of query is unchanged
Input data has changed
Age/Education
High-school
College
Graduate
18
19
20
…
279
Challenges for OLAP on Sampling Data
▪
▪
▪
Computing confidence intervals in OLAP context
No data?
▪
Not exactly. No data in subspaces in cube
▪
Sparse data
▪
Causes include sampling bias and query
selection bias
Curse of dimensionality
▪
Survey data can be high dimensional
▪
Over 600 dimensions in real world example
▪
Impossible to fully materialize
280
Example 1: Confidence Interval
What is the average income of 19-year-old high-school students?
Return not only query result but also confidence interval
Age/Education
High-school
College
Graduate
18
19
20
…
281
Confidence Interval
▪
Confidence interval at
▪
▪
◼
▪
▪
:
x is a sample of data set;
is the mean of sample
tc is the critical t-value, calculated by a look-up
is the estimated standard error of the mean
Example: $50,000 ± $3,000 with 95% confidence
▪
Treat points in cube cell as samples
▪
Compute confidence interval as traditional sample set
Return answer in the form of confidence interval
▪
Indicates quality of query answer
▪
User selects desired confidence interval
282
Efficient Computing Confidence Interval Measures
▪
Efficient computation in all cells in data cube
▪
Both mean and confidence interval are algebraic
▪
Why confidence interval measure is algebraic?
is algebraic
where both s and l (count) are algebraic
▪
Thus one can calculate cells efficiently at more general
cuboids without having to start at the base cuboid each
time
283
Example 2: Query Expansion
What is the average income of 19-year-old college students?
Age/Education
High-school
College
Graduate
18
19
20
…
284
Boosting Confidence by Query Expansion
▪
▪
▪
From the example: The queried cell “19-year-old college
students” contains only 2 samples
Confidence interval is large (i.e., low confidence). why?
▪
Small sample size
▪
High standard deviation with samples
Small sample sizes can occur at relatively low dimensional
selections
▪
Collect more data?― expensive!
▪
Use data in other cells? Maybe, but have to be careful
285
Intra-Cuboid Expansion: Choice 1
Expand query to include 18 and 20 year olds?
Age/Education
High-school
College
Graduate
18
19
20
…
286
Intra-Cuboid Expansion: Choice 2
Expand query to include high-school and graduate students?
Age/Education
High-school
College
Graduate
18
19
20
…
287
Query Expansion
288
Intra-Cuboid Expansion
Combine other cells’ data into own to “boost”
confidence
▪ If share semantic and cube similarity
▪ Use only if necessary
▪ Bigger sample size will decrease confidence interval
◼
Cell segment similarity
◼ Some dimensions are clear: Age
◼ Some are fuzzy: Occupation
◼ May need domain knowledge
◼
Cell value similarity
◼ How to determine if two cells’ samples come from
the same population?
◼ Two-sample t-test (confidence-based)
▪
289
Inter-Cuboid Expansion
If a query dimension is
▪
▪
▪
▪
▪
▪
Not correlated with cube value
But is causing small sample size by drilling down too
much
Remove dimension (i.e., generalize to *) and move to a
more general cuboid
Can use two-sample t-test to determine similarity
between two cells across cuboids
Can also use a different method to be shown later
290
Query Expansion Experiments
▪
▪
Real world sample data: 600 dimensions and
750,000 tuples
0.05% to simulate “sample” (allows error checking)
291
Chapter 5: Data Cube Technology
◼
Data Cube Computation: Preliminary Concepts
◼
Data Cube Computation Methods
◼
Processing Advanced Queries by Exploring Data Cube
Technology
◼
Sampling Cube
◼
Ranking Cube
◼
Multidimensional Data Analysis in Cube Space
◼
Summary
292
Ranking Cubes – Efficient Computation of
Ranking queries
◼
◼
◼
Data cube helps not only OLAP but also ranked search
(top-k) ranking query: only returns the best k results
according to a user-specified preference, consisting of (1)
a selection condition and (2) a ranking function
Ex.: Search for apartments with expected price 1000 and
expected square feet 800
◼
◼
◼
◼
Select top 1 from Apartment
where City = “LA” and Num_Bedroom = 2
order by [price – 1000]^2 + [sq feet - 800]^2 asc
Efficiency question: Can we only search what we need?
◼ Build a ranking cube on both selection dimensions and
ranking dimensions
293
Ranking Cube: Partition Data on Both
Selection and Ranking Dimensions
One single data
partition as the template
Partition for
all data
Slice the data partition
by selection conditions
Sliced Partition
for city=“LA”
Sliced Partition
for BR=2
294
Materialize Ranking-Cube
Step 1: Partition Data on
Ranking Dimensions
tid
t1
t2
t3
t4
t5
t6
t7
t8
City
SEA
CLE
SEA
CLE
LA
LA
LA
CLE
BR
1
2
1
3
1
2
2
3
Price
500
700
800
1000
1100
1200
1200
1350
Sq feet
600
800
900
1000
200
500
560
1120
Step 2: Group data by
Selection Dimensions
City
SEA
LA
CLE
Block ID
5
5
2
6
15
11
11
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 16
Step 3: Compute Measures for each group
For the cell (LA)
Block-level: {11, 15}
Data-level: {11: t6, t7; 15: t5}
City & BR
BR
1
2
3
4
295
Search with Ranking-Cube:
Simultaneously Push Selection and Ranking
Select top 1 from Apartment
where city = “LA”
order by [price – 1000]^2 + [sq feet - 800]^2 asc
Bin boundary for price
[500, 600, 800, 1100,1350]
Bin boundary for sq feet
[200, 400, 600, 800, 1120]
Given the bin boundaries,
locate the block with top score
800
11
15
1000
Without ranking-cube: start
search from here
With ranking-cube:
start search from here
Measure for LA:
{11, 15}
{11: t6,t7; 15:t5}
296
Processing Ranking Query: Execution Trace
Select top 1 from Apartment
where city = “LA”
order by [price – 1000]^2 + [sq feet - 800]^2 asc
Bin boundary for price
[500, 600, 800, 1100,1350]
Bin boundary for sq feet
[200, 400, 600, 800, 1120]
f=[price-1000]^2 + [sq feet – 800]^2
Execution Trace:
800
1. Retrieve High-level measure for LA {11, 15}
2. Estimate lower bound score for block 11, 15
11
f(block 11) = 40,000, f(block 15) = 160,000
15
1000
3. Retrieve block 11
4. Retrieve low-level measure for block 11
5. f(t6) = 130,000, f(t7) = 97,600
With rankingcube: start search
from here
Measure for LA:
{11, 15}
{11: t6,t7; 15:t5}
Output t7, done!
297
Ranking Cube: Methodology and Extension
◼
◼
Ranking cube methodology
◼
Push selection and ranking simultaneously
◼
It works for many sophisticated ranking functions
How to support high-dimensional data?
◼
Materialize only those atomic cuboids that contain
single selection dimensions
◼
◼
Uses the idea similar to high-dimensional OLAP
Achieves low space overhead and high
performance in answering ranking queries with a
high number of selection dimensions
298
Chapter 5: Data Cube Technology
◼
Data Cube Computation: Preliminary Concepts
◼
Data Cube Computation Methods
◼
Processing Advanced Queries by Exploring Data
Cube Technology
◼
Multidimensional Data Analysis in Cube Space
◼
Summary
299
Multidimensional Data Analysis in
Cube Space
◼
Prediction Cubes: Data Mining in MultiDimensional Cube Space
◼
Multi-Feature Cubes: Complex Aggregation at
Multiple Granularities
◼
Discovery-Driven Exploration of Data Cubes
300
Data Mining in Cube Space
◼
◼
Data cube greatly increases the analysis bandwidth
Four ways to interact OLAP-styled analysis and data mining
◼ Using cube space to define data space for mining
◼ Using OLAP queries to generate features and targets for
mining, e.g., multi-feature cube
◼ Using data-mining models as building blocks in a multistep mining process, e.g., prediction cube
◼ Using data-cube computation techniques to speed up
repeated model construction
◼ Cube-space data mining may require building a
model for each candidate data space
◼ Sharing computation across model-construction for
different candidates may lead to efficient mining
301
Prediction Cubes
◼
◼
Prediction cube: A cube structure that stores prediction
models in multidimensional data space and supports
prediction in OLAP manner
Prediction models are used as building blocks to define
the interestingness of subsets of data, i.e., to answer
which subsets of data indicate better prediction
302
How to Determine the Prediction Power
of an Attribute?
◼
◼
◼
Ex. A customer table D:
◼ Two dimensions Z: Time (Month, Year ) and Location
(State, Country)
◼ Two features X: Gender and Salary
◼ One class-label attribute Y: Valued Customer
Q: “Are there times and locations in which the value of a
customer depended greatly on the customers gender
(i.e., Gender: predictiveness attribute V)?”
Idea:
◼ Compute the difference between the model built on
that using X to predict Y and that built on using X – V
to predict Y
◼ If the difference is large, V must play an important role
at predicting Y
303
Efficient Computation of Prediction Cubes
◼
◼
Naïve method: Fully materialize the prediction
cube, i.e., exhaustively build models and evaluate
them for each cell and for each granularity
Better approach: Explore score function
decomposition that reduces prediction cube
computation to data cube computation
304
Multidimensional Data Analysis in
Cube Space
◼
Prediction Cubes: Data Mining in MultiDimensional Cube Space
◼
Multi-Feature Cubes: Complex Aggregation at
Multiple Granularities
◼
Discovery-Driven Exploration of Data Cubes
305
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes
◼
◼
◼
Multi-feature cubes (Ross, et al. 1998): Compute complex queries
involving multiple dependent aggregates at multiple granularities
Ex. Grouping by all subsets of {item, region, month}, find the
maximum price in 2010 for each group, and the total sales among
all maximum price tuples
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 2010
cube by item, region, month: R
such that R.price = max(price)
Continuing the last example, among the max price tuples, find the
min and max shelf live, and find the fraction of the total sales due
to tuple that have min shelf life within the set of all max price
tuples
306
Multidimensional Data Analysis in
Cube Space
◼
Prediction Cubes: Data Mining in MultiDimensional Cube Space
◼
Multi-Feature Cubes: Complex Aggregation at
Multiple Granularities
◼
Discovery-Driven Exploration of Data Cubes
307
Discovery-Driven Exploration of Data Cubes
◼
Hypothesis-driven
◼
◼
exploration by user, huge search space
Discovery-driven (Sarawagi, et al.’98)
◼
◼
◼
◼
Effective navigation of large OLAP data cubes
pre-compute measures indicating exceptions, guide
user in the data analysis, at all levels of aggregation
Exception: significantly different from the value
anticipated, based on a statistical model
Visual cues such as background color are used to
reflect the degree of exception of each cell
308
Kinds of Exceptions and their Computation
◼
Parameters
◼
◼
◼
◼
◼
SelfExp: surprise of cell relative to other cells at same
level of aggregation
InExp: surprise beneath the cell
PathExp: surprise beneath cell for each drill-down
path
Computation of exception indicator (modeling fitting and
computing SelfExp, InExp, and PathExp values) can be
overlapped with cube construction
Exception themselves can be stored, indexed and
retrieved like precomputed aggregates
309
Examples: Discovery-Driven Data Cubes
310
Chapter 5: Data Cube Technology
◼
Data Cube Computation: Preliminary Concepts
◼
Data Cube Computation Methods
◼
Processing Advanced Queries by Exploring Data
Cube Technology
◼
Multidimensional Data Analysis in Cube Space
◼
Summary
311
Data Cube Technology: Summary
◼
Data Cube Computation: Preliminary Concepts
◼
Data Cube Computation Methods
◼
◼
◼
MultiWay Array Aggregation
◼
BUC
◼
Star-Cubing
◼
High-Dimensional OLAP with Shell-Fragments
Processing Advanced Queries by Exploring Data Cube Technology
◼
Sampling Cubes
◼
Ranking Cubes
Multidimensional Data Analysis in Cube Space
◼
Discovery-Driven Exploration of Data Cubes
◼
Multi-feature Cubes
◼
Prediction Cubes
312
Ref.(I) Data Cube Computation Methods
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the
computation of multidimensional aggregates. VLDB’96
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97
K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently.
VLDB’98
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube:
A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge
Discovery, 1:29–54, 1997.
J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01
L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to Summarize the Semantics of a Data Cube,
VLDB'02
X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional
aggregates. SIGMOD’97
K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97
D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration,
VLDB'03
D. Xin, J. Han, Z. Shao, H. Liu, C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking,
ICDE'06
313
Ref. (II) Advanced Applications with Data Cubes
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over
uncertain and imprecise data. VLDB’05
X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over
Sampling Data”, SIGMOD’08
C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR measures for
multidimensional text database analysis. ICDM’08
D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP operations in spatial data
warehouses. SSTD’01
N. Stefanovic, J. Han, and K. Koperski. Object-based selective materialization for efficient
implementation of spatial data cubes. IEEE Trans. Knowledge and Data Engineering, 12:938–
958, 2000.
T. Wu, D. Xin, Q. Mei, and J. Han. Promotion analysis in multidimensional space. VLDB’09
T. Wu, D. Xin, and J. Han. ARCube: Supporting ranking aggregate queries in partially materialized
data cubes. SIGMOD’08
D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional selections:
The ranking cube approach. VLDB’06
J. S. Vitter, M. Wang, and B. R. Iyer. Data cube approximation and histograms via wavelets.
CIKM’98
D. Zhang, C. Zhai, and J. Han. Topic cube: Topic modeling for OLAP on multi-dimensional text
databases. SDM’09
314
Ref. (III) Knowledge Discovery with Data Cubes
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. VLDB’05
B.-C. Chen, R. Ramakrishnan, J.W. Shavlik, and P. Tamma. Bellwether analysis: Predicting global
aggregates from local regions. VLDB’06
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional Regression Analysis of
Time-Series Data Streams, VLDB'02
G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data
Cubes. VLDB’ 01
R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. Multi-structural
databases. PODS’05
J. Han. Towards on-line analytical mining in large databases. SIGMOD Record, 27:97–107, 1998
T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Data
Mining & Knowledge Discovery, 6:219–258, 2002.
R. Ramakrishnan and B.-C. Chen. Exploratory mining in cube space. Data Mining and Knowledge
Discovery, 15:29–54, 2007.
K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities.
EDBT'98
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes.
EDBT'98
G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. VLDB'01
315
Surplus Slides
316
Chapter 5: Data Cube Technology
◼
◼
◼
◼
Efficient Methods for Data Cube Computation
◼
Preliminary Concepts and General Strategies for Cube Computation
◼
Multiway Array Aggregation for Full Cube Computation
◼
BUC: Computing Iceberg Cubes from the Apex Cuboid Downward
◼
H-Cubing: Exploring an H-Tree Structure
◼
Star-cubing: Computing Iceberg Cubes Using a Dynamic Star-tree
Structure
◼
Precomputing Shell Fragments for Fast High-Dimensional OLAP
Data Cubes for Advanced Applications
◼
Sampling Cubes: OLAP on Sampling Data
◼
Ranking Cubes: Efficient Computation of Ranking Queries
Knowledge Discovery with Data Cubes
◼
Discovery-Driven Exploration of Data Cubes
◼
Complex Aggregation at Multiple Granularity: Multi-feature Cubes
◼
Prediction Cubes: Data Mining in Multi-Dimensional Cube Space
Summary
317
H-Cubing: Using H-Tree Structure
all
◼
◼
◼
◼
Bottom-up computation
Exploring an H-tree
structure
If the current
computation of an H-tree
cannot pass min_sup, do
not proceed further
(pruning)
A
AB
ABC
AC
ABD
B
AD
ACD
C
BC
D
BD
CD
BCD
ABCD
No simultaneous
aggregation
318
H-tree: A Prefix Hyper-tree
Header
table
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
Side-link
root
bus
hhd
edu
Jan
Mar
Tor
Van
Tor
Mon
Quant-Info
Q.I.
Q.I.
Q.I.
Month
City
Cust_grp
Prod
Cost
Price
Jan
Tor
Edu
Printer
500
485
Jan
Tor
Hhd
TV
800
1200
Jan
Tor
Edu
Camera
1160
1280
Feb
Mon
Bus
Laptop
1500
2500
Sum: 1765
Cnt: 2
Mar
Van
Edu
HD
540
520
bins
…
…
…
…
…
…
Jan
Feb
319
Computing Cells Involving “City”
Header
Table
HTor
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
Q.I.
…
…
…
…
…
…
…
Side-link
From (*, *, Tor) to (*, Jan, Tor)
root
Hhd.
Edu.
Jan.
Side-link
Tor.
Quant-Info
Mar.
Jan.
Bus.
Feb.
Van.
Tor.
Mon.
Q.I.
Q.I.
Q.I.
Sum: 1765
Cnt: 2
bins
320
Computing Cells Involving Month But No City
1. Roll up quant-info
2. Compute cells involving
month but no city
Attr. Val.
Edu.
Hhd.
Bus.
…
Jan.
Feb.
Mar.
…
Tor.
Van.
Mont.
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
…
Side-link
root
Jan.
Q.I.
Tor.
Hhd.
Edu.
Mar.
Jan.
Q.I.
Q.I.
Van.
Tor.
Bus.
Feb.
Q.I.
Mont.
Top-k OK mark: if Q.I. in a child passes
top-k avg threshold, so does its parents.
No binning is needed!
321
Computing Cells Involving Only Cust_grp
root
Check header table directly
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
Mar
…
Tor
Van
Mon
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
…
hhd
edu
Side-link
Tor
bus
Jan
Mar
Jan
Feb
Q.I.
Q.I.
Q.I.
Q.I.
Van
Tor
Mon
322
Data Mining:
Concepts and
Techniques
(3rd ed.)
— Chapter 6 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
323
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
◼ Basic Concepts
◼ Frequent Itemset Mining Methods
◼ Which Patterns Are Interesting?—Pattern
Evaluation Methods
◼ Summary
324
What Is Frequent Pattern
Analysis?
◼
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
◼
◼
Motivation: Finding inherent regularities in data
◼
What products were often purchased together?— Beer and diapers?!
◼
What are the subsequent purchases after buying a PC?
◼
What kinds of DNA are sensitive to this new drug?
◼
Can we automatically classify web documents?
Applications
◼
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
325
Why Is Freq. Pattern Mining
Important?
◼
◼
Freq. pattern: An intrinsic and important property of
datasets
Foundation for many essential data mining tasks
◼ Association, correlation, and causality analysis
◼ Sequential, structural (e.g., sub-graph) patterns
◼ Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data
◼ Classification: discriminative, frequent pattern analysis
◼ Cluster analysis: frequent pattern-based clustering
◼ Data warehousing: iceberg cube and cube-gradient
◼ Semantic data compression: fascicles
◼ Broad applications
326
Basic Concepts: Frequent
Patterns
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
Nuts, Eggs, Milk
50
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys diaper
◼
◼
◼
◼
◼
Customer
buys beer
itemset: A set of one or more
items
k-itemset X = {x1, …, xk}
(absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
(relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
An itemset X is frequent if X’s
support is no less than a minsup
threshold
327
Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
50
Nuts, Eggs, Milk
◼
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys beer
Customer
buys
diaper
Find all the rules X → Y with
minimum support and confidence
◼
support, s, probability that a
transaction contains X  Y
◼
confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
◼
Association rules: (many more!)
◼
Beer → Diaper (60%, 100%)
◼
Diaper → Beer (60%, 75%)
328
Closed Patterns and MaxPatterns
◼
◼
◼
◼
◼
A long pattern contains a combinatorial number of subpatterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X (proposed by
Bayardo @ SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns
◼
Reducing the # of patterns and rules
329
Closed Patterns and MaxPatterns
◼
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
◼
◼
◼
What is the set of closed itemset?
◼
<a1, …, a100>: 1
◼
< a1, …, a50>: 2
What is the set of max-pattern?
◼
◼
Min_sup = 1.
<a1, …, a100>: 1
What is the set of all patterns?
◼
!!
330
Computational Complexity of Frequent
Itemset Mining
◼
How many itemsets are potentially to be generated in the worst case?
◼
◼
◼
◼
The number of frequent itemsets to be generated is senstive to the
minsup threshold
When minsup is low, there exist potentially an exponential number of
frequent itemsets
The worst case: MN where M: # distinct items, and N: max length of
transactions
The worst case complexty vs. the expected probability
◼
Ex. Suppose Walmart has 104 kinds of products
◼
The chance to pick up one product 10-4
◼
The chance to pick up a particular set of 10 products: ~10-40
◼
What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
331
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
◼ Basic Concepts
◼ Frequent Itemset Mining Methods
◼ Which Patterns Are Interesting?—Pattern
Evaluation Methods
◼ Summary
332
Scalable Frequent Itemset Mining
Methods
◼
Apriori: A Candidate Generation-and-Test
Approach
◼
Improving the Efficiency of Apriori
◼
FPGrowth: A Frequent Pattern-Growth Approach
◼
ECLAT: Frequent Pattern Mining with Vertical Data
Format
333
The Downward Closure Property and
Scalable Mining Methods
◼
◼
The downward closure property of frequent patterns
◼ Any subset of a frequent itemset must be frequent
◼ If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
◼ i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Scalable mining methods: Three major approaches
◼ Apriori (Agrawal & Srikant@VLDB’94)
◼ Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
◼ Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
334
Apriori: A Candidate Generation & Test
Approach
◼
◼
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Method:
◼
◼
◼
◼
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
335
The Apriori Algorithm—An Example
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
Supmin = 2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
C1
1st scan
C2
L2
Itemset
sup
2
2
3
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
L1
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
3rd scan
L3
Itemset
sup
{B, C, E}
2
336
The Apriori Algorithm (PseudoCode)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
337
Implementation of Apriori
◼
◼
How to generate candidates?
◼
Step 1: self-joining Lk
◼
Step 2: pruning
Example of Candidate-generation
◼
◼
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
◼
◼
◼
Pruning:
◼
◼
abcd from abc and abd
acde from acd and ace
acde is removed because ade is not in L3
C4 = {abcd}
338
How to Count Supports of Candidates?
◼
Why counting supports of candidates a problem?
◼
◼
◼
The total number of candidates can be very huge
One transaction may contain many candidates
Method:
◼
Candidate itemsets are stored in a hash-tree
◼
Leaf node of hash-tree contains a list of itemsets and
counts
◼
◼
Interior node contains a hash table
Subset function: finds all the candidates contained in
a transaction
339
Counting Supports of Candidates Using Hash
Tree
Subset function
3,6,9
1,4,7
Transaction: 1 2 3 5 6
2,5,8
1+2356
234
567
13+56
145
136
12+356
124
457
125
458
345
356
357
689
367
368
159
340
Candidate Generation: An SQL
Implementation
◼
SQL Implementation of candidate generation
◼
Suppose the items in Lk-1 are listed in an order
◼
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Use object-relational extensions like UDFs, BLOBs, and Table functions for
efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal.
Integrating association rule mining with relational database systems:
Alternatives and implications. SIGMOD’98]
◼
◼
341
Scalable Frequent Itemset Mining
Methods
◼
Apriori: A Candidate Generation-and-Test Approach
◼
Improving the Efficiency of Apriori
◼
FPGrowth: A Frequent Pattern-Growth Approach
◼
ECLAT: Frequent Pattern Mining with Vertical Data
Format
◼
Mining Close Frequent Patterns and Maxpatterns
342
Further Improvement of the Apriori Method
◼
◼
Major computational challenges
◼
Multiple scans of transaction database
◼
Huge number of candidates
◼
Tedious workload of support counting for candidates
Improving Apriori: general ideas
◼
Reduce passes of transaction database scans
◼
Shrink number of candidates
◼
Facilitate support counting of candidates
343
Partition: Scan Database Only
Twice
◼
◼
Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
◼ Scan 1: partition database and find local frequent
patterns
◼ Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
DB1
sup1(i) < σDB1
+
DB2
sup2(i) < σDB2
+
+
DBk
supk(i) < σDBk
=
DB
sup(i) < σDB
DHP: Reduce the Number of Candidates
A k-itemset whose corresponding hashing bucket count is below the
◼
Candidates: a, b, c, d, e
◼
Hash entries
◼
{ab, ad, ae}
◼
{bd, be, de}
◼
…
count
itemsets
35
88
{ab, ad, ae}
.
.
.
threshold cannot be frequent
102
{bd, be, de}
.
.
.
◼
{yz, qs, wt}
Hash Table
◼
Frequent 1-itemset: a, b, d, e
◼
ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae}
is below support threshold
◼
J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD’95
345
Sampling for Frequent Patterns
◼
Select a sample of original database, mine frequent
patterns within sample using Apriori
◼
Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
◼
Example: check abcd instead of ab, ac, …, etc.
◼
Scan database again to find missed frequent patterns
◼
H. Toivonen. Sampling large databases for association
rules. In VLDB’96
346
DIC: Reduce Number of Scans
ABCD
◼
ABC ABD ACD BCD
AB
AC
BC
AD
BD
◼
Once both A and D are determined
frequent, the counting of AD begins
Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
CD
Transactions
A
B
C
D
Apriori
{}
Itemset lattice
S. Brin R. Motwani, J. Ullman,
DIC
and S. Tsur. Dynamic itemset
counting and implication rules for
market basket data. SIGMOD’97
1-itemsets
2-itemsets
…
1-itemsets
2-items
3-items
347
Scalable Frequent Itemset Mining
Methods
◼
Apriori: A Candidate Generation-and-Test Approach
◼
Improving the Efficiency of Apriori
◼
FPGrowth: A Frequent Pattern-Growth Approach
◼
ECLAT: Frequent Pattern Mining with Vertical Data
Format
◼
Mining Close Frequent Patterns and Maxpatterns
348
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
◼
Bottlenecks of the Apriori approach
◼
Breadth-first (i.e., level-wise) search
◼
Candidate generation and test
◼
◼
◼
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
◼
Depth-first search
◼
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
◼
“abc” is a frequent pattern
◼
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
◼
“d” is a local frequent item in DB|abc → abcd is a frequent pattern
349
Construct FP-tree from a Transaction
Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
min_support = 3
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
{}
Header Table
1. Scan DB once, find
f:4
c:1
Item frequency head
frequent 1-itemset (single
f
4
item pattern)
c
4
c:3 b:1 b:1
2. Sort frequent items in
a
3
b
3
frequency descending
a:3
p:1
m
3
order, f-list
p
3
m:2 b:1
3. Scan DB again, construct
FP-tree
p:2 m:1
F-list = f-c-a-b-m-p
350
Partition Patterns and Databases
◼
◼
Frequent patterns can be partitioned into subsets
according to f-list
◼ F-list = f-c-a-b-m-p
◼ Patterns containing p
◼ Patterns having m but no p
◼ …
◼ Patterns having c but no a nor b, m, p
◼ Pattern f
Completeness and non-redundency
351
Find Patterns Having P From P-conditional
Database
◼
◼
◼
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
c:1
b:1
a:3
Conditional pattern bases
item
cond. pattern base
b:1
c
f:3
p:1
a
fc:3
b
fca:1, f:1, c:1
m:2
b:1
m
fca:2, fcab:1
p:2
m:1
p
fcam:2, cb:1
352
From Conditional Pattern-bases to Conditional
FP-trees
◼
For each pattern-base
◼ Accumulate the count for each item in the base
◼ Construct the FP-tree for the frequent items of the
pattern base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
m-conditional pattern base:
fca:2, fcab:1
All frequent
patterns relate to m
{}
m,

f:3  fm, cm, am,
fcm, fam, cam,
c:3
fcam
a:3
m-conditional FP-tree
353
Recursion: Mining Each Conditional FPtree
{}
{}
Cond. pattern base of “am”: (fc:3)
c:3
f:3
c:3
a:3
f:3
am-conditional FP-tree
Cond. pattern base of “cm”: (f:3)
{}
f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree
354
A Special Case: Single Prefix Path in FPtree
◼
◼
{}
a1:n1
a2:n2
Suppose a (conditional) FP-tree T has a shared
single prefix-path P
Mining can be decomposed into two parts
◼
◼
Reduction of the single prefix path into one node
Concatenation of the mining results of the two
parts
a3:n3
b1:m1
C2:k2
r1
{}
C1:k1
C3:k3

r1
=
a1:n1
a2:n2
a3:n3
+
b1:m1
C2:k2
C1:k1
C3:k3
355
Benefits of the FP-tree Structure
◼
Completeness
◼
◼
◼
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
◼
◼
◼
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not count
node-links and the count field)
356
The Frequent Pattern Growth Mining
Method
◼
◼
Idea: Frequent pattern growth
◼ Recursively grow frequent patterns by pattern and
database partition
Method
◼ For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
◼ Repeat the process on each newly created conditional
FP-tree
◼ Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
357
Scaling FP-growth by Database
Projection
◼
What about if FP-tree cannot fit in memory?
◼
DB projection
◼
First partition a database into a set of projected DBs
◼
Then construct and mine FP-tree for each projected DB
◼
Parallel projection vs. partition projection techniques
◼
◼
Parallel projection
◼
Project the DB in parallel for each frequent item
◼
Parallel projection is space costly
◼
All the partitions can be processed in parallel
Partition projection
◼
Partition the DB based on the ordered frequent items
◼
Passing the unprocessed parts to the subsequent partitions
358
Partition-Based Projection
◼
◼
Parallel projection needs a lot
of disk space
Partition projection saves it
p-proj DB
fcam
cb
fcam
m-proj DB
fcab
fca
fca
am-proj DB
fc
fc
fc
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
b-proj DB
f
cb
…
a-proj DB
fc
…
cm-proj DB
f
f
f
c-proj DB
f
…
f-proj DB
…
…
359
Performance of FPGrowth in Large
Datasets
100
140
120
D1 Apriori runtime
80
Runtime (sec.)
70
Run time(sec.)
D2 FP-growth
D1 FP-grow th runtime
90
60
Data set T25I20D10K
50
40
30
20
D2 TreeProjection
100
80
Data set T25I20D100K
60
40
20
10
0
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
FP-Growth vs. Apriori
3
0
0.5
1
1.5
2
Support threshold (%)
FP-Growth vs. Tree-Projection
360
Advantages of the Pattern Growth
Approach
◼
Divide-and-conquer:
◼
◼
◼
Lead to focused search of smaller databases
Other factors
◼
No candidate generation, no candidate test
◼
Compressed database: FP-tree structure
◼
No repeated scan of entire database
◼
◼
Decompose both the mining task and DB according to the
frequent patterns obtained so far
Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
A good open-source implementation and refinement of FPGrowth
◼
FPGrowth+ (Grahne and J. Zhu, FIMI'03)
361
Further Improvements of Mining
Methods
◼
AFOPT (Liu, et al. @ KDD’03)
◼
A “push-right” method for mining condensed frequent pattern
(CFP) tree
◼
◼
Carpenter (Pan, et al. @ KDD’03)
◼
Mine data sets with small rows but numerous columns
◼
Construct a row-enumeration tree for efficient mining
FPgrowth+ (Grahne and Zhu, FIMI’03)
◼
Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
ICDM'03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI'03), Melbourne, FL, Nov. 2003
◼
TD-Close (Liu, et al, SDM’06)
362
Extension of Pattern Growth Mining
Methodology
◼
◼
◼
◼
◼
◼
◼
Mining closed frequent itemsets and max-patterns
◼ CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)
Mining sequential patterns
◼ PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
Mining graph patterns
◼ gSpan (ICDM’02), CloseGraph (KDD’03)
Constraint-based mining of frequent patterns
◼ Convertible constraints (ICDE’01), gPrune (PAKDD’03)
Computing iceberg data cubes with complex measures
◼ H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)
Pattern-growth-based Clustering
◼ MaPle (Pei, et al., ICDM’03)
Pattern-Growth-Based Classification
◼ Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)
363
Scalable Frequent Itemset Mining
Methods
◼
Apriori: A Candidate Generation-and-Test Approach
◼
Improving the Efficiency of Apriori
◼
FPGrowth: A Frequent Pattern-Growth Approach
◼
ECLAT: Frequent Pattern Mining with Vertical Data Format
◼
Mining Close Frequent Patterns and Maxpatterns
364
ECLAT: Mining by Exploring Vertical Data
Format
◼
Vertical format: t(AB) = {T11, T25, …}
◼
◼
◼
◼
◼
tid-list: list of trans.-ids containing an itemset
Deriving frequent patterns based on vertical intersections
◼
t(X) = t(Y): X and Y always happen together
◼
t(X)  t(Y): transaction having X always has Y
Using diffset to accelerate mining
◼
Only keep track of differences of tids
◼
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
◼
Diffset (XY, X) = {T2}
Eclat (Zaki et al. @KDD’97)
Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)
365
Scalable Frequent Itemset Mining
Methods
◼
Apriori: A Candidate Generation-and-Test Approach
◼
Improving the Efficiency of Apriori
◼
FPGrowth: A Frequent Pattern-Growth Approach
◼
ECLAT: Frequent Pattern Mining with Vertical Data
Format
◼
Mining Close Frequent Patterns and Maxpatterns
366
Mining Frequent Closed Patterns:
CLOSET
◼
Flist: list of all frequent items in support ascending order
◼
◼
◼
Divide search space
◼
Patterns having d
◼
Patterns having d but no a, etc.
Find frequent closed pattern recursively
◼
◼
Flist: d-a-f-e-c
Min_sup=2
TID
10
20
30
40
50
Items
a, c, d, e, f
a, b, e
c, e, f
a, c, d, f
c, e, f
Every transaction having d also has cfa → cfad is a
frequent closed pattern
J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets", DMKD'00.
CLOSET+: Mining Closed Itemsets by PatternGrowth
◼
◼
◼
◼
◼
Itemset merging: if Y appears in every occurrence of X, then Y
is merged with X
Sub-itemset pruning: if Y ‫ כ‬X, and sup(X) = sup(Y), X and all of
X’s descendants in the set enumeration tree can be pruned
Hybrid tree projection
◼
Bottom-up physical tree-projection
◼
Top-down pseudo tree-projection
Item skipping: if a local frequent item has the same support in
several header tables at different levels, one can prune it from
the header table at higher levels
Efficient subset checking
MaxMiner: Mining Max-Patterns
◼
1st scan: find frequent items
◼
◼
◼
◼
A, B, C, D, E
2nd scan: find support for
◼
AB, AC, AD, AE, ABCDE
◼
BC, BD, BE, BCDE
◼
CD, CE, CDE, DE
Tid
Items
10
A, B, C, D, E
20
B, C, D, E,
30
A, C, D, F
Potential
max-patterns
Since BCDE is a max-pattern, no need to check BCD, BDE,
CDE in later scan
R. Bayardo. Efficiently mining long patterns from
databases. SIGMOD’98
CHARM: Mining by Exploring Vertical Data
Format
◼
Vertical format: t(AB) = {T11, T25, …}
◼
◼
◼
◼
tid-list: list of trans.-ids containing an itemset
Deriving closed patterns based on vertical intersections
◼
t(X) = t(Y): X and Y always happen together
◼
t(X)  t(Y): transaction having X always has Y
Using diffset to accelerate mining
◼
Only keep track of differences of tids
◼
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
◼
Diffset (XY, X) = {T2}
Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et
al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)
Visualization of Association Rules: Plane Graph
371
Visualization of Association Rules: Rule Graph
372
Visualization of Association
Rules
(SGI/MineSet 3.0)
373
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
◼ Basic Concepts
◼ Frequent Itemset Mining Methods
◼ Which Patterns Are Interesting?—Pattern
Evaluation Methods
◼ Summary
374
Interestingness Measure:
Correlations (Lift)
◼
play basketball  eat cereal [40%, 66.7%] is misleading
◼
◼
The overall % of students eating cereal is 75% > 66.7%.
play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
◼
Measure of dependent/correlated events: lift
P ( A B )
lift =
P ( A) P ( B )
lift ( B, C ) =
2000 / 5000
= 0.89
3000 / 5000 * 3750 / 5000
lift ( B, C ) =
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
1000 / 5000
= 1.33
3000 / 5000 *1250 / 5000
375
Are lift and 2 Good Measures of Correlation?
◼
“Buy walnuts  buy
milk [1%, 80%]” is
misleading if 85% of
customers buy milk
◼
Support and confidence
are not good to indicate
correlations
◼
Over 20 interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
◼
Which are good ones?
376
Null-Invariant Measures
377
Comparison of Interestingness Measures
◼
◼
◼
Null-(transaction) invariance is crucial for correlation analysis
Lift and 2 are not null-invariant
5 null-invariant measures
Milk
No Milk
Sum (row)
Coffee
m, c
~m, c
c
No Coffee
m, ~c
~m, ~c
~c
Sum(col.)
m
~m

Null-transactions
w.r.t. m and c
June 7, 2020
Kulczynski
measure (1927)
Data Mining: Concepts and Techniques
Null-invariant
Subtle: They disagree378
Analysis of DBLP Coauthor Relationships
Recent DB conferences, removing balanced associations, low sup, etc.
Advisor-advisee relation: Kulc: high,
coherence: low, cosine: middle
◼
Tianyi Wu, Yuguo Chen and Jiawei Han, “Association Mining in Large
Databases: A Re-Examination of Its Measures”, Proc. 2007 Int. Conf.
Principles and Practice of Knowledge Discovery in Databases
(PKDD'07), Sept. 2007
379
Which Null-Invariant Measure Is Better?
◼
◼
IR (Imbalance Ratio): measure the imbalance of two
itemsets A and B in rule implications
Kulczynski and Imbalance Ratio (IR) together present a
clear picture for all the three datasets D4 through D6
◼ D4 is balanced & neutral
◼ D5 is imbalanced & neutral
◼ D6 is very imbalanced & neutral
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
◼ Basic Concepts
◼ Frequent Itemset Mining Methods
◼ Which Patterns Are Interesting?—Pattern
Evaluation Methods
◼ Summary
381
Summary
◼
◼
Basic concepts: association rules, supportconfident framework, closed and max-patterns
Scalable frequent pattern mining methods
◼
Apriori (Candidate generation & test)
◼
Projection-based (FPgrowth, CLOSET+, ...)
◼
Vertical format approach (ECLAT, CHARM, ...)
▪ Which patterns are interesting?
▪ Pattern evaluation methods
382
Ref: Basic Concepts of Frequent Pattern Mining
◼
(Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases. SIGMOD'93
◼
(Max-pattern) R. J. Bayardo. Efficiently mining long patterns from
databases. SIGMOD'98
◼
(Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Discovering frequent closed itemsets for association rules. ICDT'99
◼
(Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns.
ICDE'95
383
Ref: Apriori and Its Improvements
◼
R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94
◼
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94
◼
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining
association rules in large databases. VLDB'95
◼
J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95
◼
H. Toivonen. Sampling large databases for association rules. VLDB'96
◼
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97
◼
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: Alternatives and implications. SIGMOD'98
384
Ref: Depth-First, Projection-Based FP Mining
◼
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation
of frequent itemsets. J. Parallel and Distributed Computing, 2002.
◼
G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
FIMI'03
◼
B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining
implementations. Proc. ICDM’03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI’03), Melbourne, FL, Nov. 2003
◼
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00
◼
J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic
Projection. KDD'02
◼
J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without
Minimum Support. ICDM'02
◼
J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining
Frequent Closed Itemsets. KDD'03
385
Ref: Vertical Format and Row Enumeration Methods
◼
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for
discovery of association rules. DAMI:97.
◼
M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for Closed Itemset
Mining, SDM'02.
◼
C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning
Algorithm for Itemsets with Constraints. KDD’02.
◼
F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding
Closed Patterns in Long Biological Datasets. KDD'03.
◼
H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High
Dimensional Data: A Top-Down Row Enumeration Approach, SDM'06.
386
Ref: Mining Correlations and Interesting Rules
◼
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97.
◼
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94.
◼
R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest.
Kluwer Academic, 2001.
◼
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining
causal structures. VLDB'98.
◼
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure
for Association Patterns. KDD'02.
◼
E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
◼
T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework", Data Mining and Knowledge Discovery, 21(3):371397, 2010
387
Data Mining:
Concepts and
Techniques
(3rd ed.)
— Chapter 7 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2010 Han, Kamber & Pei. All rights reserved.
388
June 7, 2020
Data Mining: Concepts and Techniques
389
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Constraint-Based Frequent Pattern Mining
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
390
Research on Pattern Mining: A Road Map
391
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
Mining Multi-Level Association
◼
Mining Multi-Dimensional Association
◼
Mining Quantitative Association Rules
◼
Mining Rare Patterns and Negative Patterns
◼ Constraint-Based Frequent Pattern Mining
◼
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
392
Mining Multiple-Level Association
Rules
◼
◼
◼
Items often form hierarchies
Flexible support settings
◼ Items at the lower level are expected to have lower
support
Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)
uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
reduced support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 3%
393
Multi-level Association: Flexible Support
and Redundancy filtering
◼
Flexible min-support thresholds: Some items are more valuable but
less frequent
◼
◼
Use non-uniform, group-based min-support
◼
E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …
Redundancy Filtering: Some rules may be redundant due to
“ancestor” relationships between items
◼
milk  wheat bread [support = 8%, confidence = 70%]
◼
2% milk  wheat bread [support = 2%, confidence = 72%]
The first rule is an ancestor of the second rule
◼
A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor
394
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
Mining Multi-Level Association
◼
Mining Multi-Dimensional Association
◼
Mining Quantitative Association Rules
◼
Mining Rare Patterns and Negative Patterns
◼ Constraint-Based Frequent Pattern Mining
◼
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
395
Mining Multi-Dimensional
Association
◼
Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
◼
Multi-dimensional rules:  2 dimensions or predicates
◼
Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
◼
hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
◼
◼
Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
Quantitative Attributes: Numeric, implicit ordering among
values—discretization, clustering, and gradient approaches
396
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
Mining Multi-Level Association
◼
Mining Multi-Dimensional Association
◼
Mining Quantitative Association Rules
◼
Mining Rare Patterns and Negative Patterns
◼ Constraint-Based Frequent Pattern Mining
◼
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
397
Mining Quantitative Associations
Techniques can be categorized by how numerical attributes,
such as age or salary are treated
1. Static discretization based on predefined concept
hierarchies (data cube methods)
2. Dynamic discretization based on data distribution
(quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)
3. Clustering: Distance-based association (e.g., Yang &
Miller@SIGMOD97)
◼
One dimensional clustering then association
4. Deviation: (such as Aumann and Lindell@KDD99)
Sex = female => Wage: mean=$7/hr (overall mean = $9)
398
Static Discretization of Quantitative
Attributes
◼
Discretized prior to mining using concept hierarchy.
◼
Numeric values are replaced by ranges
◼
In relational database, finding all frequent k-predicate sets
will require k or k+1 table scans
◼
Data cube is well suited for mining
◼
The cells of an n-dimensional
cuboid correspond to the
(age)
()
(income)
(buys)
predicate sets
◼
Mining from data cubes
can be much faster
(age, income)
(age,buys) (income,buys)
(age,income,buys)
399
Quantitative Association Rules Based on Statistical
Inference Theory [Aumann and Lindell@DMKD’03]
◼
Finding extraordinary and therefore interesting phenomena, e.g.,
(Sex = female) => Wage: mean=$7/hr (overall mean = $9)
◼
◼
◼
LHS: a subset of the population
◼
RHS: an extraordinary behavior of this subset
The rule is accepted only if a statistical test (e.g., Z-test) confirms the
inference with high confidence
Subrule: highlights the extraordinary behavior of a subset of the pop.
of the super rule
◼
◼
◼
E.g., (Sex = female) ^ (South = yes) => mean wage = $6.3/hr
Two forms of rules
◼
Categorical => quantitative rules, or Quantitative => quantitative rules
◼
E.g., Education in [14-18] (yrs) => mean wage = $11.64/hr
Open problem: Efficient methods for LHS containing two or more
quantitative attributes
400
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
Mining Multi-Level Association
◼
Mining Multi-Dimensional Association
◼
Mining Quantitative Association Rules
◼
Mining Rare Patterns and Negative Patterns
◼ Constraint-Based Frequent Pattern Mining
◼
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
401
Negative and Rare Patterns
◼
Rare patterns: Very low support but interesting
◼
◼
◼
Mining: Setting individual-based or special group-based
support threshold for valuable items
Negative patterns
◼
◼
E.g., buying Rolex watches
Since it is unlikely that one buys Ford Expedition (an
SUV car) and Toyota Prius (a hybrid car) together, Ford
Expedition and Toyota Prius are likely negatively
correlated patterns
Negatively correlated patterns that are infrequent tend to
be more interesting than those that are frequent
402
Defining Negative Correlated Patterns
(I)
◼
Definition 1 (support-based)
◼
If itemsets X and Y are both frequent but rarely occur together, i.e.,
sup(X U Y) < sup (X) * sup(Y)
◼
◼
Then X and Y are negatively correlated
Problem: A store sold two needle 100 packages A and B, only one
transaction containing both A and B.
◼
When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
◼
When there are 105 transactions, we have
s(A U B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A U B) > s(A) * s(B)
◼
Where is the problem? —Null transactions, i.e., the support-based
definition is not null-invariant!
403
Defining Negative Correlated Patterns
(II)
◼
Definition 2 (negative itemset-based)
◼
◼
◼
◼
◼
X is a negative itemset if (1) X = Ā U B, where B is a set of positive
items, and Ā is a set of negative items, |Ā|≥ 1, and (2) s(X) ≥ μ
Itemsets X is negatively correlated, if
This definition suffers a similar null-invariant problem
Definition 3 (Kulzynski measure-based) If itemsets X and Y are
frequent, but (P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern
threshold, then X and Y are negatively correlated.
Ex. For the same needle package problem, when no matter there are
200 or 105 transactions, if є = 0.01, we have
(P(A|B) + P(B|A))/2 = (0.01 + 0.01)/2 < є
404
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Constraint-Based Frequent Pattern Mining
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
405
Constraint-based (Query-Directed)
Mining
◼
Finding all the patterns in a database autonomously? — unrealistic!
◼
◼
Data mining should be an interactive process
◼
◼
The patterns could be too many but not focused!
User directs what to be mined using a data mining query
language (or a graphical user interface)
Constraint-based mining
◼
◼
◼
User flexibility: provides constraints on what to be mined
Optimization: explores such constraints for efficient mining —
constraint-based mining: constraint-pushing, similar to push
selection first in DB query processing
Note: still find all the answers satisfying constraints, not finding
some answers in “heuristic search”
406
Constraints in Data Mining
◼
◼
◼
◼
◼
Knowledge type constraint:
◼ classification, association, etc.
Data constraint — using SQL-like queries
◼ find product pairs sold together in stores in Chicago this
year
Dimension/level constraint
◼ in relevance to region, price, brand, customer category
Rule (or pattern) constraint
◼ small sales (price < $10) triggers big sales (sum >
$200)
Interestingness constraint
◼ strong rules: min_support  3%, min_confidence 
60%
407
Meta-Rule Guided Mining
◼
Meta-rule can be in the rule form with partially instantiated predicates
and constants
P1(X, Y) ^ P2(X, W) => buys(X, “iPad”)
◼
The resulting rule derived can be
age(X, “15-25”) ^ profession(X, “student”) => buys(X, “iPad”)
◼
In general, it can be in the form of
P1 ^ P2 ^ … ^ Pl => Q1 ^ Q2 ^ … ^ Qr
◼
Method to find meta-rules
◼
◼
◼
Find frequent (l+r) predicates (based on min-support threshold)
Push constants deeply when possible into the mining process (see
the remaining discussions on constraint-push techniques)
Use confidence, correlation, and other measures when possible
408
Constraint-Based Frequent Pattern Mining
◼
Pattern space pruning constraints
◼
◼
◼
◼
◼
Anti-monotonic: If constraint c is violated, its further mining can
be terminated
Monotonic: If c is satisfied, no need to check c again
Succinct: c must be satisfied, so one can start with the data sets
satisfying c
Convertible: c is not monotonic nor anti-monotonic, but it can be
converted into it if items in the transaction can be properly
ordered
Data space pruning constraint
◼
◼
Data succinct: Data space can be pruned at the initial pattern
mining process
Data anti-monotonic: If a transaction t does not satisfy c, t can be
pruned from its further mining
409
Pattern Space Pruning with Anti-Monotonicity
Constraints
◼
◼
A constraint C is anti-monotone if the super
pattern satisfies C, all of its sub-patterns do so
too
In other words, anti-monotonicity: If an itemset
S violates the constraint, so does any of its
superset
◼
Ex. 1. sum(S.price)  v is anti-monotone
◼
Ex. 2. range(S.profit)  15 is anti-monotone
◼
◼
◼
Itemset ab violates C
◼
So does every superset of ab
Ex. 3. sum(S.Price)  v is not anti-monotone
Ex. 4. support count is anti-monotone: core
property used in Apriori
TDB (min_sup=2)
TID
Transaction
10
a, b, c, d, f
20
30
40
b, c, d, f, g, h
a, c, d, e, f
c, e, f, g
Item
Profit
a
40
b
0
c
-20
d
10
e
-30
f
30
g
20
h
-10
410
Pattern Space Pruning with Monotonicity
Constraints
TDB (min_sup=2)
◼
◼
A constraint C is monotone if the pattern
satisfies C, we do not need to check C in
subsequent mining
Alternatively, monotonicity: If an itemset S
satisfies the constraint, so does any of its
superset
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Item
Profit
◼
Ex. 1. sum(S.Price)  v is monotone
a
40
◼
Ex. 2. min(S.Price)  v is monotone
b
0
◼
Ex. 3. C: range(S.profit)  15
c
-20
d
10
e
-30
f
30
g
20
h
-10
◼
Itemset ab satisfies C
◼
So does every superset of ab
411
Data Space Pruning with Data Antimonotonicity
TDB (min_sup=2)
◼
◼
A constraint c is data anti-monotone if for a pattern
p cannot satisfy a transaction t under c, p’s
superset cannot satisfy t under c either
TID
Transaction
10
a, b, c, d, f, h
20
b, c, d, f, g, h
The key for data anti-monotone is recursive data
30
b, c, d, f, g
reduction
40
c, e, f, g
◼
Ex. 1. sum(S.Price)  v is data anti-monotone
Item
Profit
◼
Ex. 2. min(S.Price)  v is data anti-monotone
a
40
Ex. 3. C: range(S.profit)  25 is data antimonotone
b
0
c
-20
d
-15
e
-30
f
-10
g
20
h
-5
◼
◼
Itemset {b, c}’s projected DB:
◼
◼
T10’: {d, f, h}, T20’: {d, f, g, h}, T30’: {d, f, g}
since C cannot satisfy T10’, T10’ can be pruned
412
Pattern Space Pruning with
Succinctness
◼
Succinctness:
◼
◼
◼
Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
◼
min(S.Price)  v is succinct
◼
sum(S.Price)  v is not succinct
Optimization: If C is succinct, C is pre-counting pushable
413
Naïve Algorithm: Apriori + Constraint
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Constraint:
Sum{S.price} < 5
414
Constrained Apriori : Push a Succinct
Constraint Deep
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
not immediately
to be used
Constraint:
min{S.price } <= 1
415
Constrained FP-Growth: Push a Succinct
Constraint Deep
TID
100
200
300
400
Items
134
235
1235
25
Remove
infrequent
length 1
TID
100
200
300
400
Items
13
235
1235
25
FP-Tree
1-Projected DB
TID Items
100 3 4
300 2 3 5
No Need to project on 2, 3, or 5
Constraint:
min{S.price } <= 1
416
Data Anti-monotonic Constraint
Deep
Remove from data
TID
100
200
300
400
Items
134
235
1235
25
TID Items
100 1 3
300 1 3
FP-Tree
Single branch, we are done
Constraint:
min{S.price } <= 1
417
Constrained FP-Growth: Push a
Data Anti-monotonic Constraint
Deep
TID
Transaction
10
a, b, c, d, f, h
20
b, c, d, f, g, h
30
b, c, d, f, g
40
a, c, e, f, g
B-Projected DB
TID
Transaction
10
20
30
a, c, d, f, h
c, d, f, g, h
c, d, f, g
Single branch:
bcdfg: 2
FP-Tree
Recursive
Data
Pruning
B
FP-Tree
TID
Transaction
10
a, b, c, d, f, h
20
b, c, d, f, g, h
30
b, c, d, f, g
40
a, c, e, f, g
Item
Profit
a
40
b
0
c
-20
d
-15
e
-30
f
-10
g
20
h
-5
Constraint:
range{S.price } > 25
min_sup >= 2
418
Convertible Constraints: Ordering Data in
Transactions
TDB (min_sup=2)
◼
◼
Convert tough constraints into antimonotone or monotone by properly
ordering items
Examine C: avg(S.profit)  25
◼
Order items in value-descending
order
◼
◼
<a, f, g, d, b, h, c, e>
If an itemset afb violates C
◼
So does afbh, afb*
◼
It becomes anti-monotone!
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Item
Profit
a
b
c
d
e
f
g
h
40
0
-20
10
-30
30
20
-10
419
Strongly Convertible Constraints
◼
avg(X)  25 is convertible anti-monotone w.r.t.
item value descending order R: <a, f, g, d, b,
h, c, e>
◼ If an itemset af violates a constraint C, so
does every itemset with af as prefix, such as
afd
◼
◼
avg(X)  25 is convertible monotone w.r.t. item
value ascending order R-1: <e, c, h, b, d, g, f,
a>
◼ If an itemset d satisfies a constraint C, so
does itemsets df and dfa, which having d as
a prefix
Thus, avg(X)  25 is strongly convertible
Item
Profit
a
40
b
0
c
-20
d
10
e
-30
f
30
g
20
h
-10
420
Can Apriori Handle Convertible
Constraints?
◼
A convertible, not monotone nor anti-monotone
nor succinct constraint cannot be pushed deep
into the an Apriori mining algorithm
◼
◼
◼
◼
Within the level wise framework, no direct
pruning based on the constraint can be made
Itemset df violates constraint C: avg(X) >=
25
Since adf satisfies C, Apriori needs df to
assemble adf, df cannot be pruned
But it can be pushed into frequent-pattern
growth framework!
Item
Value
a
40
b
0
c
-20
d
10
e
-30
f
30
g
20
h
-10
421
Pattern Space Pruning w. Convertible
Constraints
◼
◼
◼
◼
Item
Value
C: avg(X) >= 25, min_sup=2
a
40
List items in every transaction in value
f
30
descending order R: <a, f, g, d, b, h, c, e>
g
20
◼ C is convertible anti-monotone w.r.t. R
d
10
Scan TDB once
b
0
◼ remove infrequent items
h
-10
◼ Item h is dropped
c
-20
e
-30
◼ Itemsets a and f are good, …
TDB (min_sup=2)
Projection-based mining
TID
Transaction
◼ Imposing an appropriate order on item
10
a, f, d, b, c
projection
20
f, g, d, b, c
◼ Many tough constraints can be converted into
30
a, f, d, c, e
(anti)-monotone
40
f, g, h, c, e
422
Handling Multiple Constraints
◼
◼
◼
Different constraints may require different or even
conflicting item-ordering
If there exists an order R s.t. both C1 and C2 are
convertible w.r.t. R, then there is no conflict between
the two convertible constraints
If there exists conflict on order of items
◼
◼
Try to satisfy one constraint first
Then using the order for the other constraint to
mine frequent itemsets in the corresponding
projected database
423
What Constraints Are Convertible?
Constraint
Convertible antimonotone
Convertible
monotone
Strongly
convertible
avg(S)  ,  v
Yes
Yes
Yes
median(S)  ,  v
Yes
Yes
Yes
sum(S)  v (items could be of any value,
v  0)
Yes
No
No
sum(S)  v (items could be of any value,
v  0)
No
Yes
No
sum(S)  v (items could be of any value,
v  0)
No
Yes
No
sum(S)  v (items could be of any value,
v  0)
Yes
No
No
……
424
Constraint-Based Mining — A General
Picture
Constraint
Anti-monotone
Monotone
Succinct
vS
no
yes
yes
SV
no
yes
yes
SV
yes
no
yes
min(S)  v
no
yes
yes
min(S)  v
yes
no
yes
max(S)  v
yes
no
yes
max(S)  v
no
yes
yes
count(S)  v
yes
no
weakly
count(S)  v
no
yes
weakly
sum(S)  v ( a  S, a  0 )
yes
no
no
sum(S)  v ( a  S, a  0 )
no
yes
no
range(S)  v
yes
no
no
range(S)  v
no
yes
no
avg(S)  v,   { =, ,  }
convertible
convertible
no
support(S)  
yes
no
no
support(S)  
no
yes
no
425
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Constraint-Based Frequent Pattern Mining
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
426
Mining Colossal Frequent Patterns
◼
◼
◼
F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal
Frequent Patterns by Core Pattern Fusion”, ICDE'07.
We have many algorithms, but can we mine large (i.e., colossal)
patterns? ― such as just size around 50 to 100? Unfortunately, not!
Why not? ― the curse of “downward closure” of frequent patterns
◼
The “downward closure” property
◼
◼
◼
◼
Any sub-pattern of a frequent pattern is frequent.
Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1,
a2), (a1, a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There
are about 2100 such frequent itemsets!
No matter using breadth-first search (e.g., Apriori) or depth-first
search (FPgrowth), we have to examine so many patterns
Thus the downward closure property leads to explosion!
427
Colossal Patterns: A Motivating Example
Let’s make a set of 40 transactions
T1 = 1 2 3 4 ….. 39 40
T2 = 1 2 3 4 ….. 39 40
:
.
:
.
:
.
:
.
T40=1 2 3 4 ….. 39 40
Then delete the items on the diagonal
T1 = 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40
:
.
:
.
:
.
:
.
T40=1 2 3 4 …… 39
Closed/maximal patterns may
partially alleviate the problem but not
really solve it: We often need to mine
scattered large patterns!
Let the minimum support threshold
σ= 20
 40 
There are   frequent patterns of
 20 
size 20
Each is closed and maximal
# patterns =
 n 
2n
   2 / 
n
 n / 2
The size of the answer set is
exponential to n
428
Colossal Pattern Set: Small but Interesting
◼
◼
It is often the case that
only a small number of
patterns are colossal,
i.e., of large size
Colossal patterns are
usually attached with
greater importance than
those of small pattern
sizes
429
Mining Colossal Patterns: Motivation and
Philosophy
◼
◼
Motivation: Many real-world tasks need mining colossal patterns
◼ Micro-array analysis in bioinformatics (when support is low)
◼ Biological sequence patterns
◼ Biological/sociological/information graph pattern mining
No hope for completeness
◼
◼
Jumping out of the swamp of the mid-sized results
◼
◼
If the mining of mid-sized patterns is explosive in size, there is no
hope to find colossal patterns efficiently by insisting “complete set”
mining philosophy
What we may develop is a philosophy that may jump out of the
swamp of mid-sized results that are explosive in size and jump to
reach colossal patterns
Striving for mining almost complete colossal patterns
◼
The key is to develop a mechanism that may quickly reach colossal
patterns and discover most of them
430
Alas, A Show of Colossal Pattern Mining!
T1 = 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40
:
.
:
.
:
.
:
.
T40=1 2 3 4 …… 39
T41= 41 42 43 ….. 79
T42= 41 42 43 ….. 79
:
.
:
.
T60= 41 42 43 … 79
Let the min-support threshold σ= 20
 40 

 20 
 closed/maximal


Then there are
frequent patterns of size 20
However, there is only one with size
greater than 20, (i.e., colossal):
α= {41,42,…,79} of size 39
The existing fastest mining algorithms
(e.g., FPClose, LCM) fail to complete
running
Our algorithm outputs this colossal
pattern in seconds
431
Methodology of Pattern-Fusion Strategy
◼
Pattern-Fusion traverses the tree in a bounded-breadth way
◼
Always pushes down a frontier of a bounded-size candidate
pool
◼
Only a fixed number of patterns in the current candidate pool
will be used as the starting nodes to go down in the pattern tree
― thus avoids the exponential search space
◼
Pattern-Fusion identifies “shortcuts” whenever possible
◼
Pattern growth is not performed by single-item addition but by
leaps and bounded: agglomeration of multiple patterns in the
pool
◼
These shortcuts will direct the search down the tree much more
rapidly towards the colossal patterns
432
Observation: Colossal Patterns and Core Patterns
Transaction Database D
A colossal pattern α
α
α1
D
Dαk
α2
D
Dα1
α
Dα2
αk
Subpatterns α1 to αk cluster tightly around the colossal pattern α by
sharing a similar support. We call such subpatterns core patterns of α
433
Robustness of Colossal Patterns
◼
Core Patterns
Intuitively, for a frequent pattern α, a subpattern β is a τ-core
pattern of α if β shares a similar support set with α, i.e.,
| D |

| D |
0  1
where τ is called the core ratio
◼
Robustness of Colossal Patterns
A colossal pattern is robust in the sense that it tends to have much
more core patterns than small patterns
434
Example: Core Patterns
◼
◼
◼
◼
A colossal pattern has far more core patterns than a small-sized pattern
A colossal pattern has far more core descendants of a smaller size c
A random draw from a complete set of pattern of size c would more
likely to pick a core descendant of a colossal pattern
A colossal pattern can be generated by merging a set of core patterns
Transaction (# of Ts) Core Patterns (τ = 0.5)
(abe) (100)
(abe), (ab), (be), (ae), (e)
(bcf) (100)
(bcf), (bc), (bf)
(acf) (100)
(acf), (ac), (af)
(abcef) (100)
(ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e),
(abc), (abf), (abe), (ace), (acf), (afe), (bcf), (bce),
(bfe), (cfe), (abcf), (abce), (bcfe), (acfe), (abfe), (abcef)
435
Colossal Patterns Correspond to Dense Balls
◼
Due to their robustness,
colossal patterns correspond to
dense balls
◼
◼
Ω( 2^d) in population
A random draw in the pattern
space will hit somewhere in the
ball with high probability
437
Idea of Pattern-Fusion Algorithm
◼
◼
◼
◼
Generate a complete set of frequent patterns up to a small
size
Randomly pick a pattern β, and β has a high probability to
be a core-descendant of some colossal pattern α
Identify all α’s descendants in this complete set, and
merge all of them ― This would generate a much larger
core-descendant of α
In the same fashion, we select K patterns. This set of
larger core-descendants will be the candidate pool for the
next iteration
438
Pattern-Fusion: The Algorithm
◼
◼
◼
Initialization (Initial pool): Use an existing algorithm to
mine all frequent patterns up to a small size, e.g., 3
Iteration (Iterative Pattern Fusion):
◼ At each iteration, k seed patterns are randomly picked
from the current pattern pool
◼ For each seed pattern thus picked, we find all the
patterns within a bounding ball centered at the seed
pattern
◼ All these patterns found are fused together to generate
a set of super-patterns. All the super-patterns thus
generated form a new pool for the next iteration
Termination: when the current pool contains no more
than K patterns at the beginning of an iteration
439
Why Is Pattern-Fusion Efficient?
◼
◼
A bounded-breadth pattern
tree traversal
◼ It avoids explosion in
mining mid-sized ones
◼ Randomness comes to help
to stay on the right path
Ability to identify “short-cuts”
and take “leaps”
◼ fuse small patterns
together in one step to
generate new patterns of
significant sizes
◼ Efficiency
440
Pattern-Fusion Leads to Good
Approximation
◼
Gearing toward colossal patterns
◼
◼
The larger the pattern, the greater the chance it will
be generated
Catching outliers
◼
The more distinct the pattern, the greater the chance
it will be generated
441
Experimental Setting
◼
Synthetic data set
◼
◼
Diagn an n x (n-1) table where ith row has integers from 1 to n
except i. Each row is taken as an itemset. min_support is n/2.
Real data set
◼
◼
Replace: A program trace data set collected from the “replace”
program, widely used in software engineering research
ALL: A popular gene expression data set, a clinical data on ALL-AML
leukemia (www.broad.mit.edu/tools/data.html).
◼
◼
Each item is a column, representing the activitiy level of
gene/protein in the same
Frequent pattern would reveal important correlation between
gene expression patterns and disease outcomes
442
Experiment Results on Diagn
◼
◼
◼
LCM run time increases
exponentially with pattern
size n
Pattern-Fusion finishes
efficiently
The approximation error of
Pattern-Fusion (with min-sup
20) in comparison with the
complete set) is rather close
to uniform sampling (which
randomly picks K patterns
from the complete answer
set)
443
Experimental Results on ALL
◼
ALL: A popular gene expression data set with 38
transactions, each with 866 columns
◼ There are 1736 items in total
◼ The table shows a high frequency threshold of 30
444
Experimental Results on REPLACE
◼
REPLACE
◼ A program trace data set, recording 4395 calls
and transitions
◼ The data set contains 4395 transactions with
57 items in total
◼ With support threshold of 0.03, the largest
patterns are of size 44
◼ They are all discovered by Pattern-Fusion with
different settings of K and τ, when started with
an initial pool of 20948 patterns of size <=3
445
Experimental Results on REPLACE
◼
◼
◼
Approximation error when
compared with the complete
mining result
Example. Out of the total 98
patterns of size >=42, when
K=100, Pattern-Fusion returns
80 of them
A good approximation to the
colossal patterns in the sense
that any pattern in the
complete set is on average at
most 0.17 items away from one
of these 80 patterns
446
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Constraint-Based Frequent Pattern Mining
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
447
Mining Compressed Patterns: δclustering
◼
◼
◼
◼
◼
Why compressed patterns?
◼ too many, but less meaningful
Pattern distance measure
ID
Item-Sets
Support
P1
{38,16,18,12}
205227
P2
{38,16,18,12,17}
205211
P3
{39,38,16,18,12,17} 101758
P4
{39,16,18,12,17}
161563
P5
{39,16,18,12}
161576
δ-clustering: For each pattern P, ◼ Closed frequent pattern
find all patterns which can be
◼ Report P1, P2, P3, P4, P5
expressed by P and their distance
◼ Emphasize too much on
to P are within δ (δ-cover)
support
All patterns in the cluster can be
◼ no compression
represented by P
◼
Max-pattern, P3: info loss
Xin et al., “Mining Compressed
◼
A desirable output: P2, P3, P4
Frequent-Pattern Sets”, VLDB’05
448
Redundancy-Award Top-k Patterns
◼
◼
◼
◼
Why redundancy-aware top-k patterns?
Desired patterns: high
significance & low
redundancy
Propose the MMS
(Maximal Marginal
Significance) for
measuring the
combined significance
of a pattern set
Xin et al., Extracting
Redundancy-Aware
Top-K Patterns, KDD’06
449
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Constraint-Based Frequent Pattern Mining
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
450
How to Understand and Interpret Patterns?
◼
diaper
beer
◼
◼
Do they all make sense?
What do they mean?
How are they useful?
female sterile (2) tekele
morphological info. and simple statistics
Semantic Information
Not all frequent patterns are useful, only meaningful ones …
Annotate patterns with semantic information
A Dictionary Analogy
Word: “pattern” – from Merriam-Webster
Non-semantic info.
Definitions indicating
semantics
Synonyms
Related Words
Examples of Usage
Semantic Analysis with Context
Models
◼
Task1: Model the context of a frequent pattern
Based on the Context Model…
◼
Task2: Extract strongest context indicators
◼
Task3: Extract representative transactions
◼
Task4: Extract semantically similar patterns
Annotating DBLP Co-authorship & Title
Pattern
Database:
Frequent Patterns
Authors
Title
X.Yan, P. Yu, J. Han
Substructure Similarity Search
in Graph Databases
…
…
…
…
P1: { x_yan, j_han }
Frequent Itemset
P2: “substructure search”
Semantic Annotations
Pattern
{ x_yan, j_han}
Non
Sup = …
CI
{p_yu}, graph pattern, …
Trans.
gSpan: graph-base……
SSPs
{ j_wang }, {j_han, p_yu}, …
Pattern = {xifeng_yan, jiawei_han}
Context Units
< { p_yu, j_han}, { d_xin }, … , “graph pattern”,
… “substructure similarity”, … >
Annotation Results:
Context Indicator (CI)
graph; {philip_yu}; mine close; graph pattern; sequential pattern; …
Representative
Transactions (Trans)
> gSpan: graph-base substructure pattern mining;
> mining close relational graph connect constraint; …
Semantically Similar
Patterns (SSP)
{jiawei_han, philip_yu}; {jian_pei, jiawei_han}; {jiong_yang, philip_yu,
wei_wang}; …
Chapter 7 : Advanced Frequent Pattern
Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Constraint-Based Frequent Pattern Mining
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
455
Summary
◼
Roadmap: Many aspects & extensions on pattern mining
◼
Mining patterns in multi-level, multi dimensional space
◼
Mining rare and negative patterns
◼
Constraint-based pattern mining
◼
Specialized methods for mining high-dimensional data
and colossal patterns
◼
Mining compressed or approximate patterns
◼
Pattern exploration and understanding: Semantic
annotation of frequent patterns
456
Ref: Mining Multi-Level and Quantitative
Rules
◼
◼
◼
◼
◼
◼
◼
◼
Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association
Rules, KDD'99
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using
two-dimensional optimized association rules: Scheme, algorithms, and
visualization. SIGMOD'96.
J. Han and Y. Fu. Discovery of multiple-level association rules from large
databases. VLDB'95.
R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97.
R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95.
R. Srikant and R. Agrawal. Mining quantitative association rules in large
relational tables. SIGMOD'96.
K. Wang, Y. He, and J. Han. Mining frequent itemsets using support
constraints. VLDB'00
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing
optimized rectilinear regions for association rules. KDD'97.
457
Ref: Mining Other Kinds of Rules
◼
◼
◼
◼
◼
◼
◼
F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
paradigm for fast, quantifiable data mining. VLDB'98
Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient Discovery of
Functional and Approximate Dependencies Using Partitions. ICDE’98.
H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern
Extraction with Fascicles. VLDB'99
B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining
association rules. VLDB'96.
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative
associations in a large database of customer transactions. ICDE'98.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov.
Query flocks: A generalization of association-rule mining. SIGMOD'98.
458
Ref: Constraint-Based Pattern Mining
◼
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item
constraints. KDD'97
◼
R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning
optimizations of constrained association rules. SIGMOD’98
◼
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained
correlated sets. ICDE'00
◼
J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with
Convertible Constraints. ICDE'01
◼
J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in
Large Databases, CIKM'02
◼
F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. ExAnte: Anticipated
Data Reduction in Constrained Pattern Mining, PKDD'03
◼
F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing
Framework for Graph Pattern Mining”, PAKDD'07
459
Ref: Mining Sequential Patterns
◼
X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns with
gap constraints. ICDM'05
◼
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. DAMI:97.
◼
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01.
◼
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. EDBT’96.
◼
X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large
Datasets. SDM'03.
◼
M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine
Learning:01.
460
Mining Graph and Structured Patterns
◼
◼
◼
◼
◼
◼
A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for
mining frequent substructures from graph data. PKDD'00
M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.
X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. ICDM'02
X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns.
KDD'03
X. Yan, P. S. Yu, and J. Han. Graph indexing based on discriminative frequent
structure analysis. ACM TODS, 30:960–993, 2005
X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based substructure similarity
search. ACM Trans. Database Systems, 31:1418–1453, 2006
461
Ref: Mining Spatial, Spatiotemporal, Multimedia
Data
◼
H. Cao, N. Mamoulis, and D. W. Cheung. Mining frequent spatiotemporal
sequential patterns. ICDM'05
◼
D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal Patterns.
SSTD'01
◼
K. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic
Information Databases, SSD’95
◼
H. Xiong, S. Shekhar, Y. Huang, V. Kumar, X. Ma, and J. S. Yoo. A framework
for discovering co-location patterns in data sets with extended spatial
objects. SDM'04
◼
J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: From visual
words to visual phrases. CVPR'07
O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Multimedia with
Progressive Resolution Refinement. ICDE'00
◼
462
Ref: Mining Frequent Patterns in Time-Series
Data
◼
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98.
◼
J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series
Database, ICDE'99.
◼
J. Shieh and E. Keogh. iSAX: Indexing and mining terabyte sized time series. KDD'08
◼
B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online
Data Mining for Co-Evolving Time Sequences. ICDE'00.
◼
W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on Evolving Numerical
Attributes. ICDE’01.
◼
J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data.
TKDE’03
◼
L. Ye and E. Keogh. Time series shapelets: A new primitive for data mining. KDD'09
463
Ref: FP for Classification and
Clustering
◼
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. KDD'99.
◼
B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule
Mining. KDD’98.
◼
W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based
on Multiple Class-Association Rules. ICDM'01.
◼
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in
large data sets. SIGMOD’ 02.
◼
J. Yang and W. Wang. CLUSEQ: efficient and effective sequence
clustering. ICDE’03.
◼
X. Yin and J. Han. CPAR: Classification based on Predictive Association
Rules. SDM'03.
◼
H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern
Analysis for Effective Classification”, ICDE'07
464
Ref: Privacy-Preserving FP Mining
◼
A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining
of Association Rules. KDD’02.
◼
A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches in
Privacy Preserving Data Mining. PODS’03
◼
J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining in
Vertically Partitioned Data. KDD’02
465
Mining Compressed Patterns
◼
◼
◼
D. Xin, H. Cheng, X. Yan, and J. Han. Extracting redundancyaware top-k patterns. KDD'06
D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed
frequent-pattern sets. VLDB'05
X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset
patterns: A profile-based approach. KDD'05
466
Mining Colossal Patterns
◼
◼
F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal
frequent patterns by core pattern fusion. ICDE'07
F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han. P. S. Yu, Mining Top-K Large
Structural Patterns in a Massive Network. VLDB’11
467
Ref: FP Mining from Data Streams
◼
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional
Regression Analysis of Time-Series Data Streams. VLDB'02.
◼
R. M. Karp, C. H. Papadimitriou, and S. Shenker. A simple algorithm for
finding frequent elements in streams and bags. TODS 2003.
◼
G. Manku and R. Motwani. Approximate Frequency Counts over Data
Streams. VLDB’02.
◼
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of
frequent and top-k elements in data streams. ICDT'05
468
Ref: Freq. Pattern Mining Applications
◼
◼
◼
◼
◼
◼
◼
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or
How to Build a Data Quality Browser. SIGMOD'02
M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han. DustMiner: Troubleshooting
interactive complexity bugs in sensor networks., SenSys'08
Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related
bugs in operating system code. In Proc. 2004 Symp. Operating Systems Design and
Implementation (OSDI'04)
Z. Li and Y. Zhou. PR-Miner: Automatically extracting implicit programming rules and
detecting violations in large software code. FSE'05
D. Lo, H. Cheng, J. Han, S. Khoo, and C. Sun. Classification of software behaviors for failure
detection: A discriminative pattern mining approach. KDD'09
Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent patterns.
ACM TKDD, 2007.
K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT’02.
469
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 8 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
470
Chapter 8. Classification: Basic
Concepts
◼
Classification: Basic Concepts
◼
Decision Tree Induction
◼
Bayes Classification Methods
◼
Rule-Based Classification
◼
Model Evaluation and Selection
◼
Techniques to Improve Classification Accuracy:
Ensemble Methods
◼
Summary
472
Supervised vs. Unsupervised
Learning
◼
Supervised learning (classification)
◼
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
◼
◼
New data is classified based on the training set
Unsupervised learning (clustering)
◼
The class labels of training data is unknown
◼
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
473
Prediction Problems: Classification vs.
Numeric Prediction
◼
◼
◼
Classification
◼ predicts categorical class labels (discrete or nominal)
◼ classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Numeric Prediction
◼ models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
◼ Credit/loan approval:
◼ Medical diagnosis: if a tumor is cancerous or benign
◼ Fraud detection: if a transaction is fraudulent
◼ Web page categorization: which category it is
474
Classification—A Two-Step
Process
◼
◼
◼
Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
◼ The set of tuples used for model construction is training set
◼ The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ The known label of test sample is compared with the classified
result from the model
◼ Accuracy rate is the percentage of test set samples that are
correctly classified by the model
◼ Test set is independent of training set (otherwise overfitting)
◼ If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation
(test) set
475
Process (1): Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
476
Process (2): Using the Model in
Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Tenured?
477
Chapter 8. Classification: Basic
Concepts
◼
Classification: Basic Concepts
◼
Decision Tree Induction
◼
Bayes Classification Methods
◼
Rule-Based Classification
◼
Model Evaluation and Selection
◼
Techniques to Improve Classification Accuracy:
Ensemble Methods
◼
Summary
478
Decision Tree Induction: An Example
Training data set: Buys_computer
❑ The data set follows an example of
Quinlan’s ID3 (Playing Tennis)
❑ Resulting tree:
age?
❑
<=30
31..40
overcast
student?
no
no
yes
yes
yes
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating buys_computer
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
credit rating?
excellent
fair
yes
479
Algorithm for Decision Tree Induction
◼
◼
Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-andconquer manner
◼ At start, all the training examples are at the root
◼ Attributes are categorical (if continuous-valued, they are
discretized in advance)
◼ Examples are partitioned recursively based on selected
attributes
◼ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
◼ All samples for a given node belong to the same class
◼ There are no remaining attributes for further partitioning
– majority voting is employed for classifying the leaf
◼ There are no samples left
480
Brief Review of Entropy
◼
m=2
481
Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼
Select the attribute with the highest information gain
◼
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
◼
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = − pi log 2 ( pi )
◼
◼
i =1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
j
Info A ( D) = 
 Info( D j )
j =1 | D |
Information gained by branching on attribute A
Gain(A) = Info(D) − InfoA(D)
482
Attribute Selection: Information
Gain


Class P: buys_computer = “yes”
Class N: buys_computer = “no”
Info( D) = I (9,5) = −
age
<=30
31…40
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
483
Infoage ( D) =
9
9
5
5
log 2 ( ) − log 2 ( ) =0.940
14
14 14
14
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
+
5
4
I (2,3) +
I (4,0)
14
14
5
I (3,2) = 0.694
14
5
I (2,3) means “age <=30” has 5
14 out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain (age) = Info( D ) − Infoage ( D ) = 0.246
Similarly,
Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
Computing Information-Gain for
Continuous-Valued Attributes
◼
Let attribute A be a continuous-valued attribute
◼
Must determine the best split point for A
◼
◼
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
◼
◼
◼
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
◼
D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
484
485
◼
◼
Gain Ratio for Attribute Selection
(C4.5)
Information gain measure is biased towards attributes
with a large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v
SplitInfo A ( D) = −
j =1
| Dj |
|D|
 log 2 (
| Dj |
|D|
)
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
◼
◼
gain_ratio(income) = 0.029/1.557 = 0.019
The attribute with the maximum gain ratio is selected as
the splitting attribute
◼
◼
486
◼
Gini Index (CART, IBM
IntelligentMiner)
If a data set D contains examples from n classes, gini
n
index, gini(D) is defined as
2
gini( D) = 1−  p j
j =1
where pj is the relative frequency of class j in D
◼
If a data set D is split on A into two subsets D1 and D2,
the gini index gini(D) is defined as
gini A ( D) =
◼
◼
Reduction in Impurity:
|D1|
|D |
gini( D1) + 2 gini( D 2)
|D|
|D|
gini( A) = gini(D) − giniA (D)
The attribute provides the smallest ginisplit(D) (or the
largest reduction in impurity) is chosen to split the node
(need to enumerate all the possible splitting points for
each attribute)
487
Computation of Gini Index
◼
◼
Ex. D has 9 tuples in buys_computer2 = “yes”
and 5 in “no”
2
9 5
gini ( D) = 1 −   −   = 0.459
 14   14 
Suppose the attribute income partitions D into 10 in D1:
{low, medium} and 4gini
inincome
D2{low,medium} ( D) =  10 Gini( D1 ) +  4 Gini( D2 )
 14 
◼
◼
 14 
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split
on the {low,medium} (and {high}) since it has the
lowest Gini index
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible
split values
Comparing Attribute Selection Measures
◼
The three measures, in general, return good results but
◼
Information gain:
◼
◼
Gain ratio:
◼
◼
biased towards multivalued attributes
tends to prefer unbalanced splits in which one
partition is much smaller than the others
Gini index:
◼
biased to multivalued attributes
◼
has difficulty when # of classes is large
◼
tends to favor tests that result in equal-sized
partitions and purity in both partitions
488
Other Attribute Selection Measures
◼
CHAID: a popular decision tree algorithm, measure based on χ2 test
for independence
◼
C-SEP: performs better than info. gain and gini index in certain cases
◼
G-statistic: has a close approximation to χ2 distribution
◼
MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
◼
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
◼
Multivariate splits (partition based on multiple variable combinations)
◼
◼
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
◼
Most give good results, none is significantly superior than others
489
Overfitting and Tree Pruning
◼
◼
Overfitting: An induced tree may overfit the training data
◼ Too many branches, some may reflect anomalies due
to noise or outliers
◼ Poor accuracy for unseen samples
Two approaches to avoid overfitting
◼ Prepruning: Halt tree construction early ̵ do not split a
node if this would result in the goodness measure
falling below a threshold
◼ Difficult to choose an appropriate threshold
◼ Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
◼ Use a set of data different from the training data to
decide which is the “best pruned tree”
490
Enhancements to Basic Decision Tree
Induction
◼
Allow for continuous-valued attributes
◼
◼
◼
Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values
◼
Assign the most common value of the attribute
◼
Assign probability to each of the possible values
Attribute construction
◼
◼
Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
491
Classification in Large Databases
◼
◼
◼
◼
Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why is decision tree induction popular?
◼ relatively faster learning speed (than other classification
methods)
◼ convertible to simple and easy to understand
classification rules
◼ can use SQL queries for accessing databases
◼ comparable classification accuracy with other methods
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
◼ Builds an AVC-list (attribute, value, class label)
492
Scalability Framework for
RainForest
◼
Separates the scalability aspects from the criteria that
determine the quality of the tree
◼
Builds an AVC-list: AVC (Attribute, Value, Class_label)
◼
AVC-set (of an attribute X )
◼
Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
◼
AVC-group (of a node n )
◼
Set of AVC-sets of all predictor attributes at the node n
493
Rainforest: Training Set and Its AVC
Sets
Training Examples
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
AVC-set on Age
income studentcredit_rating
buys_computerAge Buy_Computer
high
no fair
no
yes
no
high
no excellent no
<=30
2
3
high
no fair
yes
31..40
4
0
medium
no fair
yes
>40
3
2
low
yes fair
yes
low
yes excellent no
low
yes excellent yes
AVC-set on Student
medium
no fair
no
low
yes fair
yes
student
Buy_Computer
medium yes fair
yes
yes
no
medium yes excellent yes
medium
no excellent yes
yes
6
1
high
yes fair
yes
no
3
4
medium
no excellent no
AVC-set on income
income
Buy_Computer
yes
no
high
2
2
medium
4
2
low
3
1
AVC-set on
credit_rating
Buy_Computer
Credit
rating
yes
no
fair
6
2
excellent
3
3
494
BOAT (Bootstrapped Optimistic
Algorithm for Tree Construction)
◼
Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
◼
Each subset is used to create a tree, resulting in several
trees
◼
These trees are examined and used to construct a new
tree T’
◼
It turns out that T’ is very close to the tree that would
be generated using the whole data set together
◼
Adv: requires only two scans of DB, an incremental alg.
495
Presentation of Classification Results
June 7, 2020
Data Mining: Concepts and Techniques
496
SGI/MineSet 3.0
June 7, 2020
Data Mining: Concepts and Techniques
497
Perception-Based Classification
(PBC)
498
Data Mining: Concepts and Techniques
Chapter 8. Classification: Basic
Concepts
◼
Classification: Basic Concepts
◼
Decision Tree Induction
◼
Bayes Classification Methods
◼
Rule-Based Classification
◼
Model Evaluation and Selection
◼
Techniques to Improve Classification Accuracy:
Ensemble Methods
◼
Summary
499
Bayesian Classification: Why?
◼
◼
◼
◼
◼
A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
500
Bayes’ Theorem: Basics
M
◼
Total probability Theorem:P(B) =
◼
Bayes’ Theorem: P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)

i =1
P( B | A ) P ( A )
i
i
P(X)
◼
◼
◼
◼
◼
◼
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori probability):
the probability that the hypothesis holds given the observed data
sample X
P(H) (prior probability): the initial probability
◼ E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelihood): the probability of observing the sample X, given
that the hypothesis holds
◼ E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
501
Prediction Based on Bayes’ Theorem
◼
Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes’ theorem
P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)
P(X)
◼
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
◼
◼
Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
502
503Classification
◼
◼
◼
◼
Is to Derive the Maximum
Posteriori
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
i
i
P(C | X) =
i
P(X)
◼
Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i
i
i
needs to be maximized
504
◼
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
n
attributes):
P(X | C i) =  P( x | C i) = P( x | C i)  P( x | C i)  ...  P( x | C i)
k
1
2
n
k =1
◼
◼
◼
This greatly reduces the computation cost: Only counts
the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
( x− )
standard deviation σ
−
1
2
g ( x,  ,  ) =
and P(xk|Ci) is
2 
e
2 2
P ( X | C i ) = g ( xk ,  Ci ,  Ci )
Naïve Bayes Classifier: Training
Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income studentcredit_rating
buys_compu
high
no fair
no
high
no excellent
no
high
no fair
yes
medium no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent yes
medium no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent yes
medium no excellent yes
high
yes fair
yes
medium no excellent
no
505
Naïve Bayes Classifier: An Example
income studentcredit_rating
buys_comp
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
◼
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
◼
X = (age <= 30 , income = medium, student = yes,
credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) 506
◼
P(Ci):
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
Avoiding the Zero-Probability
Problem
507
◼
Naïve Bayesian prediction requires each conditional prob.
be non-zero. Otherwise, the predicted prob. will be zero
P( X | C i )
◼
◼
=
n
 P( x k | C i )
k =1
Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
Use Laplacian correction (or Laplacian estimator)
◼
◼
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Naïve Bayes Classifier: Comments
◼
◼
◼
Advantages
◼ Easy to implement
◼ Good results obtained in most of the cases
Disadvantages
◼ Assumption: class conditional independence, therefore
loss of accuracy
◼ Practically, dependencies exist among variables
◼ E.g., hospitals: patients: Profile: age, family history,
etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
◼ Dependencies among these cannot be modeled by
Naïve Bayes Classifier
How to deal with these dependencies? Bayesian Belief
508
Chapter 8. Classification: Basic
Concepts
◼
Classification: Basic Concepts
◼
Decision Tree Induction
◼
Bayes Classification Methods
◼
Rule-Based Classification
◼
Model Evaluation and Selection
◼
Techniques to Improve Classification Accuracy:
Ensemble Methods
◼
Summary
509
Using IF-THEN Rules for
Classification
◼
◼
◼
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
◼ Rule antecedent/precondition vs. rule consequent
Assessment of a rule: coverage and accuracy
◼ ncovers = # of tuples covered by R
◼ ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
If more than one rule are triggered, need conflict resolution
◼ Size ordering: assign the highest priority to the triggering rules that
has the “toughest” requirement (i.e., with the most attribute tests)
◼ Class-based ordering: decreasing order of prevalence or
misclassification cost per class
◼
Rule-based ordering (decision list): rules are organized into one
long priority list, according to some measure of rule quality or by
510
Rule Extraction from a Decision Tree
◼
◼
◼
◼
◼
Rules are easier to understand than large
trees
age?
One rule is created for each path from the
<=30
31..40
root to a leaf
student?
yes
Each attribute-value pair along a path forms a
no
yes
conjunction: the leaf holds the class
no
yes
prediction
Rules are mutually exclusive and exhaustive
>40
credit rating?
excellent
fair
yes
Example: Rule extraction from our buys_computer decisiontree
IF age = young AND student = no
no
IF age = young AND student = yes
yes
IF age = mid-age
THEN buys_computer =
THEN buys_computer =
THEN buys_computer = yes
511
Rule Induction: Sequential Covering
Method
◼
◼
◼
◼
Sequential covering algorithm: Extracts rules directly from
training data
Typical sequential covering algorithms: FOIL, AQ, CN2,
RIPPER
Rules are learned sequentially, each for a given class Ci will
cover many tuples of Ci but none (or few) of the tuples of
other classes
Steps:
◼ Rules are learned one at a time
◼ Each time a rule is learned, the tuples covered by the rules
are removed
◼ Repeat the process on the remaining tuples until
termination condition, e.g., when no more training
examples or when the quality of a rule returned is below a
user-specified threshold
512
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
Examples coveredby Rule 2
Examples covered
by Rule 1
by Rule 3
Positive
example
s
513
Rule Generation
◼
To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5
A3=1
Positiv
e
exampl
Negativ
e
exampl
514
515
How to Learn-One-Rule?
◼
◼
◼
Start with the most general rule possible: condition = empty
Adding new attributes by adopting a greedy depth-first
strategy
◼ Picks the one that most improves the rule quality
Rule-Quality measures: consider both coverage and
accuracy
pos '
pos
FOIL _ Gain = pos '(log 2
− log 2
)
◼ Foil-gain (in FOIL & RIPPER): pos
assesses
info_gain
'+ neg '
pos + neg by
extending condition
◼
◼
favors rules that have high accuracy
and cover many positive
pos − neg
FOIL
_
Prune
(
R
)
=
tuples
pos + neg
Rule pruning based on an independent set of test tuples
Chapter 8. Classification: Basic
Concepts
◼
Classification: Basic Concepts
◼
Decision Tree Induction
◼
Bayes Classification Methods
◼
Rule-Based Classification
◼
Model Evaluation and Selection
◼
Techniques to Improve Classification Accuracy:
Ensemble Methods
◼
Summary
516
Model Evaluation and Selection
◼
◼
◼
◼
Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
Use validation test set of class-labeled tuples instead of
training set when assessing accuracy
Methods for estimating a classifier’s accuracy:
◼
Holdout method, random subsampling
◼
Cross-validation
◼
Bootstrap
Comparing classifiers:
◼
Confidence intervals
◼
Cost-benefit analysis and ROC Curves
517
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class
C1
¬ C1
C1
True Positives (TP)
False Negatives (FN)
¬ C1
False Positives (FP)
True Negatives (TN)
Example of Confusion Matrix:
Actual class\Predicted buy_computer buy_computer
class
= yes
= no
◼
◼
Total
buy_computer = yes
6954
46
7000
buy_computer = no
412
2588
3000
Total
7366
2634
10000
Given m classes, an entry, CMi,j in a confusion matrix
indicates # of tuples in class i that were labeled by the
classifier as class j
May have extra rows/columns to provide totals
518
Accuracy, Error Rate, Sensitivity
and Specificity
A\P
◼
◼
C
¬C
Class Imbalance Problem:
C TP FN P
◼ One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
◼ Significant majority of the
negative class and minority of
Classifier Accuracy, or
recognition rate: percentage of the positive class
◼ Sensitivity: True Positive
test set tuples that are
recognition rate
correctly classified
◼ Sensitivity = TP/P
Accuracy = (TP + TN)/All
Error rate: 1 – accuracy, or ◼ Specificity: True Negative
Error rate = (FP + FN)/All recognition rate
◼ Specificity = TN/N
◼
519
Precision and Recall, and Fmeasures
◼
◼
◼
◼
◼
◼
Precision: exactness – what % of tuples that the
classifier labeled as positive are actually positive
Recall: completeness – what % of positive tuples did the
classifier label as positive?
Perfect score is 1.0
Inverse relationship between precision & recall
F measure (F1 or F-score): harmonic mean of precision
and recall,
Fß: weighted measure of precision and recall
◼
assigns ß times as much weight to recall as to precision
520
Classifier Evaluation Metrics: Example
◼
Actual Class\Predicted class
cancer = yes
cancer = no
Total
Recognition(%)
cancer = yes
90
210
300
30.00 (sensitivity
cancer = no
140
9560
9700
98.56 (specificity)
Total
230
9770
10000
96.40 (accuracy)
Precision = 90/230 = 39.13%
30.00%
Recall = 90/300 =
521
Holdout & Cross-Validation
Methods
◼
◼
Holdout method
◼ Given data is randomly partitioned into two independent
sets
◼ Training set (e.g., 2/3) for model construction
◼ Test set (e.g., 1/3) for accuracy estimation
◼ Random sampling: a variation of holdout
◼ Repeat holdout k times, accuracy = avg. of the
accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
◼ Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
◼ At i-th iteration, use Di as test set and others as training
set
◼ Leave-one-out: k folds where k = # of tuples, for small
sized data
◼ *Stratified cross-validation*: folds are stratified so
522
Evaluating Classifier Accuracy:
Bootstrap
◼
Bootstrap
◼
Works well with small data sets
◼
Samples the given training tuples uniformly with replacement
◼
◼
i.e., each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
◼
◼
A data set with d tuples is sampled d times, with replacement,
resulting in a training set of d samples. The data tuples that did not
make it into the training set end up forming the test set. About
63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the
model:
523
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
◼
Suppose we have 2 classifiers, M1 and M2, which one is
better?
◼
Use 10-fold cross-validation to obtain
and
◼
These mean error rates are just estimates of error on the
true population of future data cases
◼
What if the difference between the 2 error rates is just
attributed to chance?
◼
Use a test of statistical significance
◼
Obtain confidence limits for our error estimates
524
Estimating Confidence Intervals:
Null Hypothesis
◼
Perform 10-fold cross-validation
◼
Assume samples follow a t distribution with k–1
degrees of freedom (here, k=10)
◼
Use t-test (or Student’s t-test)
◼
Null Hypothesis: M1 & M2 are the same
◼
If we can reject null hypothesis, then
◼
we conclude that the difference between M1 & M2 is
statistically significant
◼
Chose model with lower error rate
525
Estimating Confidence Intervals: t-test
◼
If only 1 test set available: pairwise
comparison
◼
◼
◼
◼
For ith round of 10-fold cross-validation, the same cross
and
partitioning is used to obtain err(M1)i and
err(M2)i
Average over 10 rounds to get
t-test computes t-statistic with k-1
degrees
where
of freedom:
whesets available: use non-paired t-test
If two test
re
where k1 & k2 are # of cross-validation samples used for M1 &
526
Estimating Confidence Intervals:
Table for t-distribution
◼
◼
Symmetric
Significance
level, e.g., sig =
0.05 or 5% means
M1 & M2 are
significantly
different for 95%
◼
of population
Confidence
limit, z = sig/2
527
Estimating Confidence Intervals:
Statistical Significance
◼
Are M1 & M2 significantly different?
◼ Compute t. Select significance level (e.g. sig = 5%)
◼ Consult table for t-distribution: Find t value
corresponding to k-1 degrees of freedom (here, 9)
◼ t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence
limit z=sig/2 (here, 0.025)
◼ If t > z or t < -z, then t value lies in rejection region:
◼ Reject null hypothesis that mean error rates of
M1 & M2 are same
◼ Conclude: statistically significant difference between
M1 & M2
◼ Otherwise, conclude that any difference is chance
528
Model Selection: ROC Curves
◼
◼
◼
◼
◼
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection
theory
Shows the trade-off between the
true positive rate and the false
positive rate
The area under the ROC curve is a
measure of the accuracy of the
model
Rank the test tuples in decreasing
order: the one that is most likely to
belong to the positive class appears
at the top of the list
◼
◼
◼
◼
Vertical axis
represents the true
positive rate
Horizontal axis rep.
the false positive rate
The plot also shows a
diagonal line
A model with perfect
accuracy will have an
area of 1.0
529
Issues Affecting Model Selection
◼
Accuracy
◼
◼
classifier accuracy: predicting class label
Speed
◼
time to construct the model (training time)
◼
time to use the model (classification/prediction time)
◼
Robustness: handling noise and missing values
◼
Scalability: efficiency in disk-resident databases
◼
Interpretability
◼
◼
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
530
Chapter 8. Classification: Basic Concepts
◼
Classification: Basic Concepts
◼
Decision Tree Induction
◼
Bayes Classification Methods
◼
Rule-Based Classification
◼
Model Evaluation and Selection
◼
Techniques to Improve Classification Accuracy:
Ensemble Methods
◼
Summary
531
Ensemble Methods: Increasing the
Accuracy
◼
◼
Ensemble methods
◼ Use a combination of models to increase accuracy
◼ Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
Popular ensemble methods
◼ Bagging: averaging the prediction over a collection of
classifiers
◼ Boosting: weighted vote with a collection of classifiers
◼ Ensemble: combining a set of heterogeneous classifiers
532
Bagging: Boostrap Aggregation
◼
◼
◼
◼
◼
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
◼ Given a set D of d tuples, at each iteration i, a training set Di of d
tuples is sampled with replacement from D (i.e., bootstrap)
◼ A classifier model Mi is learned for each training set Di
Classification: classify an unknown sample X
◼ Each classifier Mi returns its class prediction
◼ The bagged classifier M* counts the votes and assigns the class
with the most votes to X
Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
Accuracy
◼ Often significantly better than a single classifier derived from D
◼ For noise data: not considerably worse, more robust
◼ Proved improved accuracy in prediction
533
Boosting
◼
◼
◼
◼
Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
How boosting works?
◼
Weights are assigned to each training tuple
◼
A series of k classifiers is iteratively learned
◼
After a classifier Mi is learned, the weights are updated
to allow the subsequent classifier, Mi+1, to pay more
attention to the training tuples that were
misclassified by Mi
◼
The final M* combines the votes of each individual
classifier, where the weight of each classifier's vote is a
function of its accuracy
Boosting algorithm can be extended for numeric prediction
Comparing with bagging: Boosting tends to have greater
534
535
Adaboost (Freund and Schapire, 1997)
◼
◼
◼
◼
Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
Initially, all the weights of tuples are set the same (1/d)
Generate k classifiers in k rounds. At round i,
◼
Tuples from D are sampled (with replacement) to form a
training set Di of the same size
◼
Each tuple’s chance of being selected is based on its weight
◼
A classification model Mi is derived from Di
◼
Its error rate is calculated using Di as a test set
◼
If a tuple is misclassified, its weight is increased, o.w. it is
decreased
Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier
Mi error rate is the sum of the dweights of the misclassified tuples:
error ( M i ) =  w j  err ( X j )
j
◼
The weight of classifier Mi’s vote is log
1 − error ( M i )
error (M i )
Random Forest (Breiman 2001)
◼
◼
◼
◼
Random Forest:
◼ Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
◼ During classification, each tree votes and the most popular class is
returned
Two Methods to construct Random Forest:
◼ Forest-RI (random input selection): Randomly select, at each
node, F attributes as candidates for the split at the node. The
CART methodology is used to grow the trees to maximum size
◼ Forest-RC (random linear combinations): Creates new attributes
(or features) that are a linear combination of the existing
attributes (reduces the correlation between individual classifiers)
Comparable in accuracy to Adaboost, but more robust to errors and
outliers
Insensitive to the number of attributes selected for consideration at
each split, and faster than bagging or boosting
536
Classification of Class-Imbalanced Data Sets
◼
◼
◼
Class-imbalance problem: Rare positive example but
numerous negative ones, e.g., medical diagnosis, fraud, oilspill, fault, etc.
Traditional methods assume a balanced distribution of
classes and equal error costs: not suitable for classimbalanced data
Typical methods for imbalance data in 2-class classification:
◼ Oversampling: re-sampling of data from positive class
◼ Under-sampling: randomly eliminate tuples from
negative class
◼ Threshold-moving: moves the decision threshold, t, so
that the rare class tuples are easier to classify, and
hence, less chance of costly false negative errors
◼ Ensemble techniques: Ensemble multiple classifiers
introduced above
537
Chapter 8. Classification: Basic Concepts
◼
Classification: Basic Concepts
◼
Decision Tree Induction
◼
Bayes Classification Methods
◼
Rule-Based Classification
◼
Model Evaluation and Selection
◼
Techniques to Improve Classification Accuracy:
Ensemble Methods
◼
Summary
538
Summary (I)
◼
◼
◼
◼
Classification is a form of data analysis that extracts models
describing important data classes.
Effective and scalable methods have been developed for
decision tree induction, Naive Bayesian classification, rulebased classification, and many other classification methods.
Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
Stratified k-fold cross-validation is recommended for
accuracy estimation. Bagging and boosting can be used to
increase overall accuracy by learning and combining a series
of individual models.
539
Summary (II)
◼
Significance tests and ROC curves are useful for model
selection.
◼
There have been numerous comparisons of the different
classification methods; the matter remains a research topic
◼
No single method has been found to be superior over all
others for all data sets
◼
Issues such as accuracy, training time, robustness,
scalability, and interpretability must be considered and can
involve trade-offs, further complicating the quest for an
overall superior method
540
References (1)
◼
◼
◼
◼
◼
◼
◼
◼
◼
C. Apte and S. Weiss. Data mining with decision trees and decision
rules. Future Generation Computer Systems, 13, 1997
C. M. Bishop, Neural Networks for Pattern Recognition. Oxford
University Press, 1995
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth International Group, 1984
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from
partitioned data for scaling machine learning. KDD'95
H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern
Analysis for Effective Classification, ICDE'07
H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern
Mining for Effective Classification, ICDE'08
W. Cohen. Fast effective rule induction. ICML'95
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule
groups for gene expression data. SIGMOD'05
541
References (2)
◼
◼
◼
◼
◼
◼
◼
◼
◼
A. J. Dobson. An Introduction to Generalized Linear Models. Chapman &
Hall, 1990.
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. KDD'99.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John
Wiley, 2001
U. M. Fayyad. Branching on attribute values in decision tree generation.
AAAI’94.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. J. Computer and System Sciences,
1997.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast
decision tree construction of large datasets. VLDB’98.
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic
Decision Tree Construction. SIGMOD'99.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001.
D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian
542
References (3)
◼
◼
◼
◼
◼
◼
◼
◼
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new
classification algorithms. Machine Learning, 2000.
J. Magidson. The Chaid approach to segmentation modeling: Chisquared automatic interaction detection. In R. P. Bagozzi, editor,
Advanced Methods of Marketing Research, Blackwell Business, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for
data mining. EDBT'96.
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A
Multi-Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345389, 1998
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106,
1986.
J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,
1993.
543
References (4)
◼
◼
◼
◼
◼
◼
◼
◼
◼
R. Rastogi and K. Shim. Public: A decision tree classifier that integrates
building and pruning. VLDB’98.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel
classifier for data mining. VLDB’96.
J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan
Kaufmann, 1990.
P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison
Wesley, 2005.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn:
Classification and Prediction Methods from Statistics, Neural Nets,
Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann,
1997.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools
and Techniques, 2ed. Morgan Kaufmann, 2005.
X. Yin and J. Han. CPAR: Classification based on predictive association
rules. SDM'03
H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with
544
CS412 Midterm Exam Statistics
◼
◼
◼
Opinion Question Answering:
◼ Like the style: 70.83%, dislike: 29.16%
◼ Exam is hard: 55.75%, easy: 0.6%, just right:
43.63%
◼ Time: plenty:3.03%, enough: 36.96%, not:
60%
◼ <40: 2
◼ 60-69: 37
Score distribution: # of students (Total: 180)
◼ 50-59: 15
◼ >=90: 24
◼ 40-49: 2
◼ 80-89: 54
◼ 70-79: 46
Final grading are based on overall score
546
Issues: Evaluating Classification Methods
◼
◼
◼
◼
◼
◼
Accuracy
◼ classifier accuracy: predicting class label
◼ predictor accuracy: guessing value of predicted
attributes
Speed
◼ time to construct the model (training time)
◼ time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
◼ understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
547
Predictor Error Measures
◼
◼
◼
Measure predictor accuracy: measure how far off the predicted value is
from the actual known value
Loss function: measures the error betw. yi and the predicted value yi’
◼
Absolute error: | yi – yi’|
◼
Squared error: (yi – yi’)2
Test error (generalization error):
the average loss over the test
set
d
d
◼
Mean absolute error:
| y
i =1
i
− yi ' |
Mean squared error:
(y
i =1
d
Relative absolute error:  | y
i =1
d
i
| y
i =1
− yi ' ) 2
d
 ( yi − yi ' ) 2
d
d
◼
i
i
− yi ' |
−y|
Relative squared error:
i =1
d
 ( y − y)
The mean squared-error exaggerates the presence of outliers
i =1
2
i
Popularly use (square) root mean-square error, similarly, root relative
squared error
548
Scalable Decision Tree Induction
Methods
◼
◼
◼
◼
◼
SLIQ (EDBT’96 — Mehta et al.)
◼ Builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
◼ Constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
◼ Integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
◼ Builds an AVC-list (attribute, value, class label)
BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
◼ Uses bootstrapping to create several small samples
549
Data Cube-Based Decision-Tree
Induction
◼
◼
Integration of generalization with decision-tree induction
(Kamber et al.’97)
Classification at primitive concept levels
◼
◼
◼
◼
E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy
classification-trees
Semantic interpretation problems
Cube-based multi-level classification
◼
Relevance analysis at multi-levels
◼
Information-gain analysis with dimension + level
550
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
◼
Cluster Analysis: Basic Concepts
◼
Partitioning Methods
◼
Hierarchical Methods
◼
Density-Based Methods
◼
Grid-Based Methods
◼
Evaluation of Clustering
◼
Summary
551
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 9 —
Classification: Advanced Methods
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
552
Chapter 9. Classification: Advanced Methods
◼
Bayesian Belief Networks
◼
Classification by Backpropagation
◼
Support Vector Machines
◼
Classification by Using Frequent Patterns
◼
Lazy Learners (or Learning from Your Neighbors)
◼
Other Classification Methods
◼
Additional Topics Regarding Classification
◼
Summary
553
Bayesian Belief Networks
◼
Bayesian belief networks (also known as Bayesian
networks, probabilistic networks): allow class
conditional independencies between subsets of variables
◼
A (directed acyclic) graphical model of causal relationships
◼
◼
Represents dependency among the variables
Gives a specification of joint probability distribution
❑ Nodes: random variables
❑ Links: dependency
Y
X
Z
❑ X and Y are the parents of Z, and Y is
the parent of P
P
❑ No dependency between Z and P
❑ Has no loops/cycles
554
Bayesian Belief Network: An Example
Family
History (FH)
CPT: Conditional Probability Table
Smoker (S) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LungCancer
(LC)
Emphysema
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
shows the conditional probability for
each possible combination of its parents
PositiveXRay
Dyspnea
Bayesian Belief Network
Derivation of the probability of a
particular combination of values of X,
from CPT:
n
P( x1 ,..., xn ) =  P( xi | Parents(Y i ))
i =1
555
Training Bayesian Networks: Several
Scenarios
◼
◼
◼
◼
◼
Scenario 1: Given both the network structure and all variables
observable: compute only the CPT entries
Scenario 2: Network structure known, some variables hidden: gradient
descent (greedy hill-climbing) method, i.e., search for a solution along
the steepest descent of a criterion function
◼ Weights are initialized to random probability values
◼ At each iteration, it moves towards what appears to be the best
solution at the moment, w.o. backtracking
◼ Weights are updated at each iteration & converge to local optimum
Scenario 3: Network structure unknown, all variables observable:
search through the model space to reconstruct network topology
Scenario 4: Unknown structure, all hidden variables: No good
algorithms known for this purpose
D. Heckerman. A Tutorial on Learning with Bayesian Networks. In
Learning in Graphical Models, M. Jordan, ed.. MIT Press, 1999.
556
Chapter 9. Classification: Advanced Methods
◼
Bayesian Belief Networks
◼
Classification by Backpropagation
◼
Support Vector Machines
◼
Classification by Using Frequent Patterns
◼
Lazy Learners (or Learning from Your Neighbors)
◼
Other Classification Methods
◼
Additional Topics Regarding Classification
◼
Summary
557
Classification by Backpropagation
◼
◼
◼
◼
◼
Backpropagation: A neural network learning algorithm
Started by psychologists and neurobiologists to develop
and test computational analogues of neurons
A neural network: A set of connected input/output units
where each connection has a weight associated with it
During the learning phase, the network learns by
adjusting the weights so as to be able to predict the
correct class label of the input tuples
Also referred to as connectionist learning due to the
connections between units
558
Neural Network as a Classifier
◼
Weakness
◼
◼
◼
◼
Long training time
Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure.”
Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of “hidden units” in the network
Strength
◼
High tolerance to noisy data
◼
Ability to classify untrained patterns
◼
Well-suited for continuous-valued inputs and outputs
◼
Successful on an array of real-world data, e.g., hand-written letters
◼
Algorithms are inherently parallel
◼
Techniques have recently been developed for the extraction of
rules from trained neural networks
559
A Multi-Layer Feed-Forward Neural Network
Output vector
w(jk +1) = w(jk ) +  ( yi − yˆi( k ) ) xij
Output layer
Hidden layer
wij
Input layer
Input vector: X
560
How A Multi-Layer Neural Network Works
◼
◼
The inputs to the network correspond to the attributes measured
for each training tuple
Inputs are fed simultaneously into the units making up the input
layer
◼
They are then weighted and fed simultaneously to a hidden layer
◼
The number of hidden layers is arbitrary, although usually only one
◼
◼
◼
The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction
The network is feed-forward: None of the weights cycles back to
an input unit or to an output unit of a previous layer
From a statistical point of view, networks perform nonlinear
regression: Given enough hidden units and enough training
samples, they can closely approximate any function
561
Defining a Network Topology
◼
◼
◼
◼
◼
Decide the network topology: Specify # of units in the
input layer, # of hidden layers (if > 1), # of units in each
hidden layer, and # of units in the output layer
Normalize the input values for each attribute measured in
the training tuples to [0.0—1.0]
One input unit per domain value, each initialized to 0
Output, if for classification and more than two classes,
one output unit per class is used
Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
562
Backpropagation
◼
Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
◼
For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the actual
target value
◼
Modifications are made in the “backwards” direction: from the output
layer, through each hidden layer down to the first hidden layer, hence
“backpropagation”
◼
Steps
◼
Initialize weights to small random numbers, associated with biases
◼
Propagate the inputs forward (by applying activation function)
◼
Backpropagate the error (by updating weights and biases)
◼
Terminating condition (when error is very small, etc.)
563
Neuron: A Hidden/Output Layer Unit
bias
x0
w0
x1
w1
xn
k

f
wn
output y
For Example
n
Input
weight
vector x vector w
◼
◼
weighted
sum
Activation
function
y = sign(  wi xi −  k )
i =0
An n-dimensional input vector x is mapped into variable y by means of the
scalar product and a nonlinear function mapping
The inputs to unit are outputs from the previous layer. They are multiplied by
their corresponding weights to form a weighted sum, which is added to the
bias associated with unit. Then a nonlinear activation function is applied to it.
564
Efficiency and Interpretability
◼
◼
Efficiency of backpropagation: Each epoch (one iteration through the
training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in worst case
For easier comprehension: Rule extraction by network pruning
◼
◼
◼
◼
Simplify the network structure by removing weighted links that
have the least effect on the trained network
Then perform link, unit, or activation value clustering
The set of input and activation values are studied to derive rules
describing the relationship between the input and hidden unit
layers
Sensitivity analysis: assess the impact that a given input variable
has on a network output. The knowledge gained from this analysis
can be represented in rules
565
Chapter 9. Classification: Advanced Methods
◼
Bayesian Belief Networks
◼
Classification by Backpropagation
◼
Support Vector Machines
◼
Classification by Using Frequent Patterns
◼
Lazy Learners (or Learning from Your Neighbors)
◼
Other Classification Methods
◼
Additional Topics Regarding Classification
◼
Summary
566
Classification: A Mathematical Mapping
◼
◼
◼
Classification: predicts categorical class labels
◼ E.g., Personal homepage classification
◼ xi = (x1, x2, x3, …), yi = +1 or –1
◼ x1 : # of word “homepage”
x
◼ x2 : # of word “welcome”
x
x
x
x
n
Mathematically, x  X =  , y  Y = {+1, –1},
x
o
x x x
◼ We want to derive a function f: X → Y
o
o o
x
Linear Classification
ooo
o
o
◼ Binary Classification problem
o o o
o
◼ Data above the red line belongs to class ‘x’
◼ Data below red line belongs to class ‘o’
◼ Examples: SVM, Perceptron, Probabilistic Classifiers
567
Discriminative Classifiers
◼
◼
Advantages
◼ Prediction accuracy is generally high
◼ As compared to Bayesian methods – in general
◼ Robust, works when training examples contain errors
◼ Fast evaluation of the learned target function
◼ Bayesian networks are normally slow
Criticism
◼ Long training time
◼ Difficult to understand the learned function (weights)
◼ Bayesian networks can be used easily for pattern
discovery
◼ Not easy to incorporate domain knowledge
◼ Easy in the form of priors on the data or distributions
568
SVM—Support Vector Machines
◼
◼
◼
◼
◼
A relatively new classification method for both linear and
nonlinear data
It uses a nonlinear mapping to transform the original
training data into a higher dimension
With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently
high dimension, data from two classes can always be
separated by a hyperplane
SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
569
SVM—History and Applications
◼
Vapnik and colleagues (1992)—groundwork from Vapnik
& Chervonenkis’ statistical learning theory in 1960s
◼
Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
◼
Used for: classification and numeric prediction
◼
Applications:
◼
handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
570
SVM—General Philosophy
Small Margin
Large Margin
Support Vectors
571
SVM—Margins and Support Vectors
Data Mining: Concepts and Techniques
June 7, 2020
572
SVM—When Data Is Linearly Separable
m
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of
training tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two
classes but we want to find the best one (the one that
minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin,
573
SVM—Linearly Separable
◼
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
◼
For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
◼
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1
for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
◼
◼
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints →
Quadratic Programming (QP) → Lagrangian multipliers
574
Why Is SVM Effective on High Dimensional Data?
◼
The complexity of trained classifier is characterized by the # of
support vectors rather than the dimensionality of the data
◼
The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
◼
If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
◼
The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
◼
Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
575
SVM—Linearly Inseparable
◼
◼
A2
Transform the original input data into a higher dimensional
space
A1
Search for a linear separating hyperplane in the new space
576
SVM: Different Kernel functions
◼
◼
◼
Instead of computing the dot product on the transformed
data, it is math. equivalent to applying a kernel function
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
Typical Kernel Functions
SVM can also be used for classifying multiple (> 2) classes
and for regression analysis (with additional parameters)
577
Scaling SVM by Hierarchical Micro-Clustering
◼
SVM is not scalable to the number of data objects in terms of training
time and memory usage
◼
H. Yu, J. Yang, and J. Han, “Classifying Large Data Sets Using SVM
with Hierarchical Clusters”, KDD'03)
◼
CB-SVM (Clustering-Based SVM)
◼
Given limited amount of system resources (e.g., memory),
maximize the SVM performance in terms of accuracy and the
training speed
◼
◼
Use micro-clustering to effectively reduce the number of points to
be considered
At deriving support vectors, de-cluster micro-clusters near
“candidate vector” to ensure high classification accuracy
578
CF-Tree: Hierarchical Micro-cluster
◼
◼
Read the data set once, construct a statistical summary of the data
(i.e., hierarchical clusters) given a limited amount of memory
Micro-clustering: Hierarchical indexing structure
◼
provide finer samples closer to the boundary and coarser
samples farther from the boundary
579
Selective Declustering: Ensure High Accuracy
◼
CF tree is a suitable base structure for selective declustering
◼
De-cluster only the cluster Ei such that
◼
◼
Di – Ri < Ds, where Di is the distance from the boundary to the
center point of Ei and Ri is the radius of Ei
Decluster only the cluster whose subclusters have possibilities to be
the support cluster of the boundary
◼
“Support cluster”: The cluster whose centroid is a support vector
580
CB-SVM Algorithm: Outline
◼
◼
◼
◼
◼
Construct two CF-trees from positive and negative data
sets independently
◼ Need one scan of the data set
Train an SVM from the centroids of the root entries
De-cluster the entries near the boundary into the next
level
◼ The children entries de-clustered from the parent
entries are accumulated into the training set with the
non-declustered parent entries
Train an SVM again from the centroids of the entries in
the training set
Repeat until nothing is accumulated
581
Accuracy and Scalability on Synthetic Dataset
◼
Experiments on large synthetic data sets shows better
accuracy than random sampling approaches and far more
scalable than the original SVM algorithm
582
SVM vs. Neural Network
◼
SVM
◼
◼
◼
◼
Deterministic algorithm
Nice generalization
properties
Hard to learn – learned
in batch mode using
quadratic programming
techniques
Using kernels can learn
very complex functions
◼
Neural Network
◼
◼
◼
◼
Nondeterministic
algorithm
Generalizes well but
doesn’t have strong
mathematical foundation
Can easily be learned in
incremental fashion
To learn complex
functions—use multilayer
perceptron (nontrivial)
583
SVM Related Links
◼
SVM Website: http://www.kernel-machines.org/
◼
Representative implementations
◼
LIBSVM: an efficient implementation of SVM, multi-
class classifications, nu-SVM, one-class SVM, including
also various interfaces with java, python, etc.
◼
SVM-light: simpler but performance is not better than
LIBSVM, support only binary classification and only in C
◼
SVM-torch: another recent implementation also
written in C
584
Chapter 9. Classification: Advanced Methods
◼
Bayesian Belief Networks
◼
Classification by Backpropagation
◼
Support Vector Machines
◼
Classification by Using Frequent Patterns
◼
Lazy Learners (or Learning from Your Neighbors)
◼
Other Classification Methods
◼
Additional Topics Regarding Classification
◼
Summary
585
Associative Classification
◼
Associative classification: Major steps
◼
Mine data to find strong associations between frequent patterns
(conjunctions of attribute-value pairs) and class labels
◼
Association rules are generated in the form of
P1 ^ p2 … ^ pl → “Aclass = C” (conf, sup)
◼
◼
Organize the rules to form a rule-based classifier
Why effective?
◼
It explores highly confident associations among multiple attributes
and may overcome some constraints introduced by decision-tree
induction, which considers only one attribute at a time
◼
Associative classification has been found to be often more accurate
than some traditional classification methods, such as C4.5
586
Typical Associative Classification Methods
◼
CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98)
◼
Mine possible association rules in the form of
◼
◼
◼
Build classifier: Organize rules according to decreasing precedence
based on confidence and then support
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei,
ICDM’01)
◼
◼
Cond-set (a set of attribute-value pairs) → class label
Classification: Statistical analysis on multiple rules
CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03)
◼
Generation of predictive rules (FOIL-like analysis) but allow covered
rules to retain with reduced weight
◼
Prediction using best k rules
◼
High efficiency, accuracy similar to CMAR
587
Frequent Pattern-Based Classification
◼
◼
◼
H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “Discriminative
Frequent Pattern Analysis for Effective Classification”,
ICDE'07
Accuracy issue
◼ Increase the discriminative power
◼ Increase the expressive power of the feature space
Scalability issue
◼ It is computationally infeasible to generate all feature
combinations and filter them with an information gain
threshold
◼ Efficient method (DDPMine: FPtree pruning): H. Cheng,
X. Yan, J. Han, and P. S. Yu, "Direct Discriminative
Pattern Mining for Effective Classification", ICDE'08
588
Frequent Pattern vs. Single Feature
The discriminative power of some frequent patterns is
higher than that of single features.
(a) Austral
(b) Cleve
(c) Sonar
Fig. 1. Information Gain vs. Pattern Length
589
Empirical Results
1
InfoGain
IG_UpperBnd
0.9
0.8
Information Gain
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
100
200
300
400
500
600
700
Support
(a) Austral
(b) Breast
(c) Sonar
Fig. 2. Information Gain vs. Pattern Frequency
590
Feature Selection
◼
◼
◼
Given a set of frequent patterns, both non-discriminative
and redundant patterns exist, which can cause overfitting
We want to single out the discriminative patterns and
remove redundant ones
The notion of Maximal Marginal Relevance (MMR) is
borrowed
◼
A document has high marginal relevance if it is both
relevant to the query and contains minimal marginal
similarity to previously selected documents
591
Experimental Results
592
592
Scalability Tests
593
DDPMine: Branch-and-Bound Search
sup( child )  sup( parent )
sup( b)  sup( a )
a
b
a: constant, a
parent node
b: variable, a
descendent
Association between information
gain and frequency
594
DDPMine Efficiency: Runtime
PatClass
Harmony
PatClass:
ICDE’07
Pattern
Classification
Alg.
DDPMine
595
Chapter 9. Classification: Advanced Methods
◼
Bayesian Belief Networks
◼
Classification by Backpropagation
◼
Support Vector Machines
◼
Classification by Using Frequent Patterns
◼
Lazy Learners (or Learning from Your Neighbors)
◼
Other Classification Methods
◼
Additional Topics Regarding Classification
◼
Summary
596
Lazy vs. Eager Learning
◼
◼
◼
Lazy vs. eager learning
◼ Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and
waits until it is given a test tuple
◼ Eager learning (the above discussed methods): Given
a set of training tuples, constructs a classification model
before receiving new (e.g., test) data to classify
Lazy: less time in training but more time in predicting
Accuracy
◼ Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form an
implicit global approximation to the target function
◼ Eager: must commit to a single hypothesis that covers
the entire instance space
597
Lazy Learner: Instance-Based Methods
◼
◼
Instance-based learning:
◼ Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
Typical approaches
◼ k-nearest neighbor approach
◼ Instances represented as points in a Euclidean
space.
◼ Locally weighted regression
◼ Constructs local approximation
◼ Case-based reasoning
◼ Uses symbolic representations and knowledgebased inference
598
The k-Nearest Neighbor Algorithm
◼
◼
◼
◼
◼
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples
.
_
_ _
_
+
.+
_
xq +
_
+
.
.
.
.
599
Discussion on the k-NN Algorithm
◼
◼
k-NN for real-valued prediction for a given unknown tuple
◼ Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
◼
Weight the contribution of each of the k neighbors
according to their distance to the query xq
1
◼
◼
◼
Give greater weight to closer neighbors
w
d ( xq , x )2
i
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes
◼
To overcome it, axes stretch or elimination of the least
relevant attributes
600
Case-Based Reasoning (CBR)
◼
◼
CBR: Uses a database of problem solutions to solve new problems
Store symbolic description (tuples or cases)—not points in a Euclidean
space
◼
Applications: Customer-service (product-related diagnosis), legal ruling
◼
Methodology
◼
◼
◼
◼
Instances represented by rich symbolic descriptions (e.g., function
graphs)
Search for similar cases, multiple retrieved cases may be combined
Tight coupling between case retrieval, knowledge-based reasoning,
and problem solving
Challenges
◼
◼
Find a good similarity metric
Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
601
Chapter 9. Classification: Advanced Methods
◼
Bayesian Belief Networks
◼
Classification by Backpropagation
◼
Support Vector Machines
◼
Classification by Using Frequent Patterns
◼
Lazy Learners (or Learning from Your Neighbors)
◼
Other Classification Methods
◼
Additional Topics Regarding Classification
◼
Summary
602
Genetic Algorithms (GA)
◼
Genetic Algorithm: based on an analogy to biological evolution
◼
An initial population is created consisting of randomly generated rules
◼
◼
◼
Each rule is represented by a string of bits
◼
E.g., if A1 and ¬A2 then C2 can be encoded as 100
◼
If an attribute has k > 2 values, k bits can be used
Based on the notion of survival of the fittest, a new population is
formed to consist of the fittest rules and their offspring
The fitness of a rule is represented by its classification accuracy on a
set of training examples
◼
Offspring are generated by crossover and mutation
◼
The process continues until a population P evolves when each rule in P
satisfies a prespecified threshold
◼
Slow but easily parallelizable
603
Rough Set Approach
◼
◼
◼
Rough sets are used to approximately or “roughly” define
equivalent classes
A rough set for a given class C is approximated by two sets: a lower
approximation (certain to be in C) and an upper approximation
(cannot be described as not belonging to C)
Finding the minimal subsets (reducts) of attributes for feature
reduction is NP-hard but a discernibility matrix (which stores the
differences between attribute values for each pair of data tuples) is
used to reduce the computation intensity
604
Fuzzy Set
Approaches
◼
◼
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the
degree of membership (such as in a fuzzy membership graph)
Attribute values are converted to fuzzy values. Ex.:
◼
◼
◼
◼
Income, x, is assigned a fuzzy membership value to each of the
discrete categories {low, medium, high}, e.g. $49K belongs to
“medium income” with fuzzy value 0.15 but belongs to “high
income” with fuzzy value 0.96
Fuzzy membership values do not have to sum to 1.
Each applicable rule contributes a vote for membership in the
categories
Typically, the truth values for each predicted category are summed,
and these sums are combined
605
Chapter 9. Classification: Advanced Methods
◼
Bayesian Belief Networks
◼
Classification by Backpropagation
◼
Support Vector Machines
◼
Classification by Using Frequent Patterns
◼
Lazy Learners (or Learning from Your Neighbors)
◼
Other Classification Methods
◼
Additional Topics Regarding Classification
◼
Summary
606
Multiclass Classification
◼
Classification involving more than two classes (i.e., > 2 Classes)
◼
Method 1. One-vs.-all (OVA): Learn a classifier one at a time
◼
◼
Given m classes, train m classifiers: one for each class
◼
Classifier j: treat tuples in class j as positive & all others as negative
◼
To classify a tuple X, the set of classifiers vote as an ensemble
Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes
◼
Given m classes, construct m(m-1)/2 binary classifiers
◼
A classifier is trained using tuples of the two classes
◼
◼
To classify a tuple X, each classifier votes. X is assigned to the
class with maximal vote
Comparison
◼
◼
All-vs.-all tends to be superior to one-vs.-all
Problem: Binary classifier is sensitive to errors, and errors affect
vote count
607
Error-Correcting Codes for Multiclass Classification
◼
◼
Originally designed to correct errors during data
transmission for communication tasks by exploring
data redundancy
Example
◼ A 7-bit codeword associated with classes 1-4
Class
Error-Corr. Codeword
C1
1 1 1 1 1
1
1
C2
0 0 0 0 1
1
1
C3
0 0 1 1 0
0
1
C4
0 1 0 1 0
1
0
Given a unknown tuple X, the 7-trained classifiers output: 0001010
◼ Hamming distance: # of different bits between two codewords
◼ H(X, C1) = 5, by checking # of bits between [1111111] & [0001010]
◼ H(X, C2) = 3, H(X, C3) = 3, H(X, C4) = 1, thus C4 as the label for X
Error-correcting codes can correct up to (h-1)/h 1-bit error, where h is
the minimum Hamming distance between any two codewords
If we use 1-bit per class, it is equiv. to one-vs.-all approach, the code
are insufficient to self-correct
When selecting error-correcting codes, there should be good row-wise
and col.-wise separation between the codewords
◼
◼
◼
◼
608
Semi-Supervised Classification
◼
◼
◼
◼
Semi-supervised: Uses labeled and unlabeled data to build a classifier
Self-training:
◼ Build a classifier using the labeled data
◼ Use it to label the unlabeled data, and those with the most confident
label prediction are added to the set of labeled data
◼ Repeat the above process
◼ Adv: easy to understand; disadv: may reinforce errors
Co-training: Use two or more classifiers to teach each other
◼ Each learner uses a mutually independent set of features of each
tuple to train a good classifier, say f1
◼ Then f1 and f2 are used to predict the class label for unlabeled data
X
◼ Teach each other: The tuple having the most confident prediction
from f1 is added to the set of labeled data for f2, & vice versa
Other methods, e.g., joint probability distribution of features and labels
609
Active Learning
◼
◼
◼
◼
◼
Class labels are expensive to obtain
Active learner: query human (oracle) for labels
Pool-based approach: Uses a pool of unlabeled data
◼ L: a small subset of D is labeled, U: a pool of unlabeled data in D
◼ Use a query function to carefully select one or more tuples from U
and request labels from an oracle (a human annotator)
◼ The newly labeled samples are added to L, and learn a model
◼ Goal: Achieve high accuracy using as few labeled data as possible
Evaluated using learning curves: Accuracy as a function of the number
of instances queried (# of tuples to be queried should be small)
Research issue: How to choose the data tuples to be queried?
◼ Uncertainty sampling: choose the least certain ones
◼ Reduce version space, the subset of hypotheses consistent w. the
training data
◼ Reduce expected entropy over U: Find the greatest reduction in
the total number of incorrect predictions
610
Transfer Learning: Conceptual Framework
◼
◼
◼
Transfer learning: Extract knowledge from one or more source tasks
and apply the knowledge to a target task
Traditional learning: Build a new classifier for each new task
Transfer learning: Build new classifier by applying existing knowledge
learned from source tasks
Traditional Learning
Transfer Learning
611
Transfer Learning: Methods and Applications
◼
◼
◼
◼
Applications: Especially useful when data is outdated or distribution
changes, e.g., Web document classification, e-mail spam filtering
Instance-based transfer learning: Reweight some of the data from
source tasks and use it to learn the target task
TrAdaBoost (Transfer AdaBoost)
◼ Assume source and target data each described by the same set of
attributes (features) & class labels, but rather diff. distributions
◼ Require only labeling a small amount of target data
◼ Use source data in training: When a source tuple is misclassified,
reduce the weight of such tupels so that they will have less effect on
the subsequent classifier
Research issues
◼ Negative transfer: When it performs worse than no transfer at all
◼ Heterogeneous transfer learning: Transfer knowledge from different
feature space or multiple source domains
◼ Large-scale transfer learning
612
Chapter 9. Classification: Advanced Methods
◼
Bayesian Belief Networks
◼
Classification by Backpropagation
◼
Support Vector Machines
◼
Classification by Using Frequent Patterns
◼
Lazy Learners (or Learning from Your Neighbors)
◼
Other Classification Methods
◼
Additional Topics Regarding Classification
◼
Summary
613
Summary
◼
Effective and advanced classification methods
◼
Bayesian belief network (probabilistic networks)
◼
Backpropagation (Neural networks)
◼
Support Vector Machine (SVM)
◼
Pattern-based classification
◼
◼
Other classification methods: lazy learners (KNN, case-based
reasoning), genetic algorithms, rough set and fuzzy set approaches
Additional Topics on Classification
◼
Multiclass classification
◼
Semi-supervised classification
◼
Active learning
◼
Transfer learning
614
References
◼
Please see the references of Chapter 8
615
Surplus Slides
What Is Prediction?
◼
◼
◼
◼
(Numerical) prediction is similar to classification
◼ construct a model
◼ use model to predict continuous or ordered value for a given input
Prediction is different from classification
◼ Classification refers to predict categorical class label
◼ Prediction models continuous-valued functions
Major method for prediction: regression
◼ model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis
◼ Linear and multiple regression
◼ Non-linear regression
◼ Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
617
Linear Regression
◼
Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
◼
Method of least squares: estimates the best-fitting straight line
| D|
 (x
− x )( yi − y )
w =
1
 (x − x)
i =1
i
| D|
i =1
◼
2
w = y −w x
0
1
i
Multiple linear regression: involves more than one predictor variable
◼
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
◼
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
◼
Solvable by extension of least square method or using SAS, S-Plus
◼
Many nonlinear functions can be transformed into the above
618
Nonlinear Regression
◼
◼
◼
◼
Some nonlinear models can be modeled by a polynomial
function
A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
Other functions, such as power function, can also be
transformed to linear model
Some models are intractable nonlinear (e.g., sum of
exponential terms)
◼ possible to obtain least square estimates through
extensive calculation on more complex formulae
619
Other Regression-Based Models
◼
Generalized linear model:
◼
◼
◼
◼
◼
◼
Foundation on which linear regression can be applied to modeling
categorical response variables
Variance of y is a function of the mean value of y, not a constant
Logistic regression: models the prob. of some event occurring as a
linear function of a set of predictor variables
Poisson regression: models the data that exhibit a Poisson
distribution
Log-linear models: (for categorical data)
◼
Approximate discrete multidimensional prob. distributions
◼
Also useful for data compression and smoothing
Regression trees and model trees
◼
Trees to predict continuous values rather than class labels
620
Regression Trees and Model Trees
◼
Regression tree: proposed in CART system (Breiman et al. 1984)
◼
CART: Classification And Regression Trees
◼
Each leaf stores a continuous-valued prediction
◼
It is the average value of the predicted attribute for the training
tuples that reach the leaf
◼
Model tree: proposed by Quinlan (1992)
◼
Each leaf holds a regression model—a multivariate linear equation
for the predicted attribute
◼
◼
A more general case than regression tree
Regression and model trees tend to be more accurate than linear
regression when the data are not represented well by a simple linear
model
621
Predictive Modeling in Multidimensional Databases
◼
◼
◼
◼
◼
Predictive modeling: Predict data values or construct
generalized linear models based on the database data
One can only predict value ranges or category distributions
Method outline:
◼
Minimal generalization
◼
Attribute relevance analysis
◼
Generalized linear model construction
◼
Prediction
Determine the major factors which influence the prediction
◼ Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
622
Prediction: Numerical Data
623
Prediction: Categorical Data
624
SVM—Introductory Literature
◼
“Statistical Learning Theory” by Vapnik: extremely hard to
understand, containing many errors too.
◼
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
◼
Better than the Vapnik’s book, but still written too hard for
introduction, and the examples are so not-intuitive
◼
The book “An Introduction to Support Vector Machines” by N.
Cristianini and J. Shawe-Taylor
◼
Also written hard for introduction, but the explanation about the
mercer’s theorem is better than above literatures
◼
The neural network book by Haykins
◼
Contains one nice chapter of SVM introduction
625
Notes about SVM—
Introductory Literature
◼
“Statistical Learning Theory” by Vapnik: difficult to understand,
containing many errors.
◼
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
◼
Easier than Vapnik’s book, but still not introductory level; the
examples are not so intuitive
◼
The book An Introduction to Support Vector Machines by Cristianini
and Shawe-Taylor
◼
Not introductory level, but the explanation about Mercer’s
Theorem is better than above literatures
◼
Neural Networks and Learning Machines by Haykin
◼
Contains a nice chapter on SVM introduction
626
Associative Classification Can Achieve High
Accuracy and Efficiency (Cong et al. SIGMOD05)
627
A Closer Look at CMAR
◼
◼
◼
◼
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
◼ Given two rules, R1 and R2, if the antecedent of R1 is more general
than that of R2 and conf(R1) ≥ conf(R2), then prune R2
◼ Prunes rules for which the rule antecedent and class are not
positively correlated, based on a χ2 test of statistical significance
Classification based on generated/pruned rules
◼ If only one rule satisfies tuple X, assign the class label of the rule
◼ If a rule set S satisfies X, CMAR
◼ divides S into groups according to class labels
2
◼ uses a weighted χ measure to find the strongest group of rules,
based on the statistical correlation of rules within a group
◼ assigns X the class label of the strongest group
628
Perceptron & Winnow
• Vector: x, w
x2
• Scalar: x, y, w
Input: {(x1, y1), …}
Output:
classification
function f(x)
f(xi) > 0 for yi = +1
x1
f(xi) < 0 for yi = -1
• Perceptron:
f(x) update
=>
+b=0
Wwx
additively
or w1x1+w
2x2+b = 0
• Winnow:
update
W multiplicatively
629
What is Cluster Analysis?
◼
◼
◼
◼
Cluster: A collection of data objects
◼ similar (or related) to one another within the same group
◼ dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
◼ Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
Typical applications
◼ As a stand-alone tool to get insight into data distribution
◼ As a preprocessing step for other algorithms
630
Clustering for Data Understanding
and Applications
◼
◼
◼
◼
◼
◼
◼
◼
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market resarch
631
Clustering as a Preprocessing Tool
(Utility)
◼
Summarization:
◼
◼
Compression:
◼
◼
Image processing: vector quantization
Finding K-nearest Neighbors
◼
◼
Preprocessing for regression, PCA, classification, and
association analysis
Localizing search to one or a small number of clusters
Outlier detection
◼
Outliers are often viewed as those “far away” from any
cluster
632
Quality: What Is Good
Clustering?
◼
A good clustering method will produce high quality
clusters
◼
◼
high intra-class similarity: cohesive within clusters
◼
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
◼
the similarity measure used by the method
◼
its implementation, and
◼
Its ability to discover some or all of the hidden
patterns
633
Measure the Quality of
Clustering
◼
◼
Dissimilarity/Similarity metric
◼
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
◼
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
◼
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
◼
There is usually a separate “quality” function that
measures the “goodness” of a cluster.
◼
It is hard to define “similar enough” or “good enough”
◼
The answer is typically highly subjective
634
Considerations for Cluster Analysis
◼
Partitioning criteria
◼
◼
Separation of clusters
◼
◼
Exclusive (e.g., one customer belongs to only one region) vs. nonexclusive (e.g., one document may belong to more than one class)
Similarity measure
◼
◼
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
Clustering space
◼
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
635
Requirements and Challenges
◼
◼
◼
◼
◼
Scalability
◼ Clustering all the data instead of only on samples
Ability to deal with different types of attributes
◼ Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
◼
User may give inputs on constraints
◼
Use domain knowledge to determine input parameters
Interpretability and usability
Others
◼ Discovery of clusters with arbitrary shape
◼ Ability to deal with noisy data
◼ Incremental clustering and insensitivity to input order
◼ High dimensionality
636
Major Clustering Approaches
(I)
◼
◼
◼
◼
Partitioning approach:
◼ Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
◼ Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
◼ Create a hierarchical decomposition of the set of data (or objects)
using some criterion
◼ Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
◼ Based on connectivity and density functions
◼ Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
◼ based on a multiple-level granularity structure
◼ Typical methods: STING, WaveCluster, CLIQUE
637
Major Clustering Approaches
(II)
◼
◼
◼
◼
Model-based:
◼ A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
◼ Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
◼ Based on the analysis of frequent patterns
◼ Typical methods: p-Cluster
User-guided or constraint-based:
◼ Clustering by considering user-specified or application-specific
constraints
◼ Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
◼ Objects are often linked together in various ways
◼ Massive links can be used to cluster objects: SimRank, LinkClus
638
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
◼
Cluster Analysis: Basic Concepts
◼
Partitioning Methods
◼
Hierarchical Methods
◼
Density-Based Methods
◼
Grid-Based Methods
◼
Evaluation of Clustering
◼
Summary
639
Partitioning Algorithms: Basic
Concept
◼
Partitioning method: Partitioning a database D of n objects into a set
of k clusters, such that the sum of squared distances is minimized
(where ci is the centroid or medoid of cluster Ci)
E = ik=1 pCi ( p − ci ) 2
◼
Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
◼
Global optimal: exhaustively enumerate all partitions
◼
Heuristic methods: k-means and k-medoids algorithms
◼
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
◼
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
640
The K-Means Clustering Method
◼
Given k, the k-means algorithm is implemented in
four steps:
◼
◼
◼
◼
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when the assignment does
not change
641
An Example of K-Means Clustering
K=2
Arbitrarily
partition
objects into
k groups
The initial data set
◼
Partition objects into k nonempty
subsets
◼
Repeat
◼
◼
Compute centroid (i.e., mean
point) for each partition
◼
Assign each object to the
cluster of its nearest centroid
Update the
cluster
centroids
Loop if
needed
Reassign objects
Update the
cluster
centroids
Until no change
642
Comments on the K-Means Method
◼
Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
◼
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
◼
Comment: Often terminates at a local optimal.
◼
Weakness
◼
Applicable only to objects in a continuous n-dimensional space
◼
◼
◼
◼
Using the k-modes method for categorical data
In comparison, k-medoids can be applied to a wide range of
data
Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al.,
2009)
Sensitive to noisy data and outliers
643
Variations of the K-Means Method
◼
◼
Most of the variants of the k-means which differ in
◼
Selection of the initial k means
◼
Dissimilarity calculations
◼
Strategies to calculate cluster means
Handling categorical data: k-modes
◼
Replacing means of clusters with modes
◼
Using new dissimilarity measures to deal with categorical objects
◼
Using a frequency-based method to update modes of clusters
◼
A mixture of categorical and numerical data: k-prototype method
644
What Is the Problem of the K-Means
Method?
◼
The k-means algorithm is sensitive to outliers !
◼
Since an object with an extremely large value may substantially
distort the distribution of the data
◼
K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
645
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
1
7
6
5
4
3
2
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
K=2
Until no
change
10
If quality is
improved.
3
4
5
6
7
8
9
10
10
Compute
total cost of
swapping
9
Swapping O
and Oramdom
2
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
1
8
7
6
9
8
7
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
646
The K-Medoid Clustering Method
◼
K-Medoids Clustering: Find representative objects (medoids) in clusters
◼
PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
◼
Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
◼
PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
◼
Efficiency improvement on PAM
◼
CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
◼
CLARANS (Ng & Han, 1994): Randomized re-sampling
647
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
◼
Cluster Analysis: Basic Concepts
◼
Partitioning Methods
◼
Hierarchical Methods
◼
Density-Based Methods
◼
Grid-Based Methods
◼
Evaluation of Clustering
◼
Summary
648
Hierarchical Clustering
◼
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
649
AGNES (Agglomerative Nesting)
◼
Introduced in Kaufmann and Rousseeuw (1990)
◼
Implemented in statistical packages, e.g., Splus
◼
Use the single-link method and the dissimilarity matrix
◼
Merge nodes that have the least dissimilarity
◼
Go on in a non-descending fashion
◼
Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
650
Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster
651
DIANA (Divisive Analysis)
◼
Introduced in Kaufmann and Rousseeuw (1990)
◼
Implemented in statistical analysis packages, e.g., Splus
◼
Inverse order of AGNES
◼
Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
652
Distance between
Clusters
◼
X
X
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
◼
Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
◼
Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
◼
Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
◼
Medoid: distance between the medoids of two clusters, i.e., dist(Ki,
Kj) = dist(Mi, Mj)
◼
Medoid: a chosen, centrally located object in the cluster
653
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
◼
◼
Centroid: the “middle” of a cluster
Cm =
iN= 1(t
ip
)
N
Radius: square root of average distance from any point
of the cluster to its centroid
◼
 N (t − cm ) 2
Rm = i =1 ip
N
Diameter: square root of average mean squared
distance between all pairs of points in the cluster
 N  N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
654
Extensions to Hierarchical Clustering
◼
Major weakness of agglomerative clustering methods
◼
Can never undo what was done previously
◼
Do not scale well: time complexity of at least O(n2),
where n is the number of total objects
◼
Integration of hierarchical & distance-based clustering
◼
BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
◼
CHAMELEON (1999): hierarchical clustering using
dynamic modeling
655
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
◼
◼
Zhang, Ramakrishnan & Livny, SIGMOD’96
Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
◼
◼
◼
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent
clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
◼
Weakness: handles only numeric data, and sensitive to the order of the
data record
656
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
N
LS: linear sum of N points:  X i
i =1
CF = (5, (16,30),(54,190))
SS: square sum of N points
N
 Xi
i =1
2
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
657
CF-Tree in BIRCH
◼
◼
Clustering feature:
◼ Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point
of view
◼ Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
◼ A nonleaf node in a tree has descendants or “children”
◼ The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
◼ Branching factor: max # of children
◼ Threshold: max diameter of sub-clusters stored at the leaf
nodes
658
The CF Tree Structure
Root
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
659
The Birch Algorithm
◼
◼
◼
◼
Cluster Diameter
1
2
 ( xi − x j )
n( n − 1)
For each point in the input
◼ Find closest leaf entry
◼ Add point to leaf entry and update CF
◼ If entry diameter > max_diameter, then split leaf, and possibly
parents
Algorithm is O(n)
Concerns
◼ Sensitive to insertion order of data points
◼ Since we fix the size of leaf nodes, so clusters may not be so
natural
◼ Clusters tend to be spherical given the radius and diameter
measures
660
CHAMELEON: Hierarchical Clustering
Using Dynamic Modeling (1999)
◼
CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
◼
Measures the similarity based on a dynamic model
◼
◼
Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
661
Overall Framework of CHAMELEON
Construct (K-NN)
Partition the Graph
Sparse Graph
Data Set
K-NN Graph
P and q are connected if
q is among the top k
closest neighbors of p
Merge Partition
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness
662
CHAMELEON (Clustering Complex
Objects)
663
Probabilistic Hierarchical Clustering
◼
◼
Algorithmic hierarchical clustering
◼
Nontrivial to choose a good distance measure
◼
Hard to handle missing attribute values
◼
Optimization goal not clear: heuristic, local search
Probabilistic hierarchical clustering
◼
◼
◼
◼
Use probabilistic models to measure distances between clusters
Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
664
Generative Model
◼
◼
◼
◼
Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a Gaussian
distribution:
The probability that a point xi ∈ X is generated by the
model
The likelihood that X is generated by the model:
The task of learning the generative model: find the
the maximum likelihood
parameters μ and σ2 such that
665
A Probabilistic Hierarchical Clustering
Algorithm
◼
◼
◼
For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality
can be measured by,
where P() is the maximum likelihood
Distance between clusters C1 and C2:
Algorithm: Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi ≠ j {log (P(Ci∪Cj )/(P(Ci)P(Cj ))};
If log (P(Ci∪Cj )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
666
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
◼
Cluster Analysis: Basic Concepts
◼
Partitioning Methods
◼
Hierarchical Methods
◼
Density-Based Methods
◼
Grid-Based Methods
◼
Evaluation of Clustering
◼
Summary
667
Density-Based Clustering Methods
◼
◼
◼
Clustering based on density (local cluster criterion), such
as density-connected points
Major features:
◼ Discover clusters of arbitrary shape
◼ Handle noise
◼ One scan
◼ Need density parameters as termination condition
Several interesting studies:
◼ DBSCAN: Ester, et al. (KDD’96)
◼ OPTICS: Ankerst, et al (SIGMOD’99).
◼ DENCLUE: Hinneburg & D. Keim (KDD’98)
◼ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
668
Density-Based Clustering: Basic
Concepts
◼
◼
◼
Two parameters:
◼
Eps: Maximum radius of the neighbourhood
◼
MinPts: Minimum number of points in an Epsneighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if
◼
p belongs to NEps(q)
◼
core point condition:
|NEps (q)| ≥ MinPts
p
q
MinPts = 5
Eps = 1 cm
669
Density-Reachable and Density-Connected
◼
Density-reachable:
◼
◼
A point p is density-reachable from
a point q w.r.t. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p
p1
q
Density-connected
◼
A point p is density-connected to a
point q w.r.t. Eps, MinPts if there
is a point o such that both, p and
q are density-reachable from o
w.r.t. Eps and MinPts
p
q
o
670
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
◼
◼
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
671
DBSCAN: The Algorithm
◼
◼
◼
◼
◼
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps
and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
Continue the process until all of the points have been
processed
672
DBSCAN: Sensitive to
Parameters
673
OPTICS: A Cluster-Ordering Method
(1999)
◼
OPTICS: Ordering Points To Identify the Clustering
Structure
◼ Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
◼ Produces a special order of the database wrt its
density-based clustering structure
◼ This cluster-ordering contains info equiv to the densitybased clusterings corresponding to a broad range of
parameter settings
◼ Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
◼ Can be represented graphically or using visualization
techniques
674
OPTICS: Some Extension from
DBSCAN
◼
Index-based:
◼
k = number of dimensions
◼
N = 20
◼
D
p = 75%
M = N(1-p) = 5
◼ Complexity: O(NlogN)
Core Distance:
◼ min eps s.t. point is core
Reachability Distance p2
Max (core-distance (o), d (o, p))
◼
◼
◼
r(p1, o) = 2.8cm. r(p2,o) = 4cm
p1
o
o
MinPts = 5
e = 3 cm
675
Reachability
-distance
undefined
e
e‘
e
Cluster-order
of the objects
676
Density-Based Clustering: OPTICS & Its
Applications
677
DENCLUE: Using Statistical Density
Functions
◼
DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
◼
Using statistical density functions:
f Gaussian ( x , y ) = e
◼
Major features
−
d ( x,y)
2 2
total influence
on x
2
influence of y
on x
f
D
Gaussian
( x) =

i =1
−
e
2
2
D
f Gaussian
( x, xi ) = i =1 ( xi − x)  e
◼
Solid mathematical foundation
◼
Good for data sets with large amounts of noise
◼
N
d ( x , xi ) 2
N
−
d ( x , xi ) 2
2 2
gradient of x in
the direction of
xi
Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
◼
Significant faster than existing algorithm (e.g., DBSCAN)
◼
But needs a large number of parameters
678
Denclue: Technical Essence
◼
◼
◼
◼
◼
◼
◼
Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
Influence function: describes the impact of a data point within its
neighborhood
Overall density of the data space can be calculated as the sum of the
influence function of all data points
Clusters can be determined mathematically by identifying density
attractors
Density attractors are local maximal of the overall density function
Center defined clusters: assign to each density attractor the points
density attracted to it
Arbitrary shaped cluster: merge density attractors that are connected
through paths of high density (> threshold)
679
Density Attractor
680
Center-Defined and Arbitrary
681
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
◼
Cluster Analysis: Basic Concepts
◼
Partitioning Methods
◼
Hierarchical Methods
◼
Density-Based Methods
◼
Grid-Based Methods
◼
Evaluation of Clustering
◼
Summary
682
Grid-Based Clustering Method
◼
◼
Using multi-resolution grid data structure
Several interesting methods
◼ STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
◼
WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)
◼
◼
A multi-resolution clustering approach using
wavelet method
CLIQUE: Agrawal, et al. (SIGMOD’98)
◼
Both grid-based and subspace clustering
683
STING: A Statistical Information Grid
Approach
◼
◼
◼
Wang, Yang and Muntz (VLDB’97)
The spatial area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
684
The STING Clustering Method
◼
◼
◼
◼
◼
◼
Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated
from parameters of lower level cell
◼ count, mean, s, min, max
◼ type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer—typically with a small
number of cells
For each cell in the current level compute the confidence
interval
685
STING Algorithm and Its Analysis
◼
◼
◼
◼
◼
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the
next lower level
Repeat this process until the bottom layer is reached
Advantages:
◼ Query-independent, easy to parallelize, incremental
update
◼ O(K), where K is the number of grid cells at the lowest
level
Disadvantages:
◼ All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
686
CLIQUE (Clustering In QUEst)
◼
◼
◼
Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
◼
◼
◼
◼
It partitions each dimension into the same number of equal length
interval
It partitions an m-dimensional data space into non-overlapping
rectangular units
A unit is dense if the fraction of total data points contained in the
unit exceeds the input model parameter
A cluster is a maximal set of connected dense units within a
subspace
687
CLIQUE: The Major Steps
◼
◼
◼
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters
◼
◼
◼
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
Generate minimal description for the clusters
◼ Determine maximal regions that cover a cluster of
connected dense units for each cluster
◼ Determination of minimal cover for each cluster
688
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
30
Vacation
(week)
0 1 2 3 4 5 6 7
age
60
20
30
40
50
age
60
50
age
689
Strength and Weakness of
CLIQUE
◼
Strength
◼
◼
automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
◼ insensitive to the order of records in input and does not
presume some canonical data distribution
◼ scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
◼ The accuracy of the clustering result may be degraded
at the expense of simplicity of the method
690
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
◼
Cluster Analysis: Basic Concepts
◼
Partitioning Methods
◼
Hierarchical Methods
◼
Density-Based Methods
◼
Grid-Based Methods
◼
Evaluation of Clustering
◼
Summary
691
Assessing Clustering Tendency
◼
◼
Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
Test spatial randomness by statistic test: Hopkins Static
◼ Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in
the data space
◼ Sample n points, p1, …, pn, uniformly from D. For each pi, find its
nearest neighbor in D: xi = min{dist (pi, v)} where v in D
◼ Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D
and v ≠ qi
◼ Calculate the Hopkins Statistic:
◼
If D is uniformly distributed, ∑ xi and ∑ yi will be close to each
other and H is close to 0.5. If D is highly skewed, H is close to 0
692
Determine the Number of Clusters
◼
◼
◼
Empirical method
◼ # of clusters ≈√n/2 for a dataset of n points
Elbow method
◼ Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters
Cross validation method
◼ Divide a given data set into m parts
◼ Use m – 1 parts to obtain a clustering model
◼ Use the remaining part to test the quality of the clustering
◼ E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test
set and the closest centroids to measure how well the model fits
the test set
◼ For any k > 0, repeat it m times, compare the overall quality
measure w.r.t. different k’s, and find # of clusters that fits the data
693
the best
Measuring Clustering Quality
◼
Two methods: extrinsic vs. intrinsic
◼
Extrinsic: supervised, i.e., the ground truth is available
◼
◼
◼
Compare a clustering against the ground truth using
certain clustering quality measure
Ex. BCubed precision and recall metrics
Intrinsic: unsupervised, i.e., the ground truth is unavailable
◼
◼
Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
Ex. Silhouette coefficient
694
Measuring Clustering Quality: Extrinsic
Methods
◼
◼
Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
Q is good if it satisfies the following 4 essential criteria
◼ Cluster homogeneity: the purer, the better
◼ Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
◼ Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
◼ Small cluster preservation: splitting a small category
into pieces is more harmful than splitting a large
category into pieces
695
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
◼
Cluster Analysis: Basic Concepts
◼
Partitioning Methods
◼
Hierarchical Methods
◼
Density-Based Methods
◼
Grid-Based Methods
◼
Evaluation of Clustering
◼
Summary
696
Summary
◼
◼
◼
◼
◼
◼
◼
◼
Cluster analysis groups objects based on their similarity and has
wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
Quality of clustering results can be evaluated in various ways
697
CS512-Spring 2011: An
Introduction
◼
Coverage
◼
Cluster Analysis: Chapter 11
◼
Outlier Detection: Chapter 12
◼
Mining Sequence Data: BK2: Chapter 8
◼
Mining Graphs Data: BK2: Chapter 9
◼
Social and Information Network Analysis
◼
◼
◼
◼
◼
◼
BK2: Chapter 9
Partial coverage: Mark Newman: “Networks: An Introduction”, Oxford U.,
2010
Scattered coverage: Easley and Kleinberg, “Networks, Crowds, and Markets:
Reasoning About a Highly Connected World”, Cambridge U., 2010
Recent research papers
Mining Data Streams: BK2: Chapter 8
Requirements
◼
One research project
◼
One class presentation (15 minutes)
◼
Two homeworks (no programming assignment)
◼
Two midterm exams (no final exam)
698
References (1)
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
699
References (2)
◼
◼
◼
◼
◼
◼
◼
◼
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
700
References (3)
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic
Links”, VLDB'06
701
Slides unused in class
702
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
1
7
6
5
4
3
2
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
K=2
Until no
change
10
If quality is
improved.
3
4
5
6
7
8
9
10
10
Compute
total cost of
swapping
9
Swapping O
and Oramdom
2
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
1
8
7
6
9
8
7
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
703
PAM (Partitioning Around Medoids)
(1987)
◼
PAM (Kaufman and Rousseeuw, 1987), built in Splus
◼
Use real object to represent the cluster
◼
◼
◼
Select k representative objects arbitrarily
For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
For each pair of i and h,
◼
◼
◼
If TCih < 0, i is replaced by h
Then assign each non-selected object to the most
similar representative object
repeat steps 2-3 until there is no change
704
PAM Clustering: Finding the Best Cluster
Center
◼
Case 1: p currently belongs to oj. If oj is replaced by orandom as a
representative object and p is the closest to one of the other
representative object oi, then p is reassigned to oi
705
What Is the Problem with PAM?
◼
◼
Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
Pam works efficiently for small data sets but does not
scale well for large data sets.
◼
O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
➔Sampling-based method
CLARA(Clustering LARge Applications)
706
CLARA (Clustering Large Applications)
(1990)
◼
CLARA (Kaufmann and Rousseeuw in 1990)
◼
◼
Built in statistical analysis packages, such as SPlus
It draws multiple samples of the data set, applies
PAM on each sample, and gives the best clustering as
the output
◼
Strength: deals with larger data sets than PAM
◼
Weakness:
◼
◼
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
707
CLARANS (“Randomized” CLARA)
(1994)
◼
◼
◼
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
◼ Draws sample of neighbors dynamically
◼ The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
◼ If the local optimum is found, it starts with new
randomly selected node in search for a new local
optimum
Advantages: More efficient and scalable than both PAM
and CLARA
Further improvement: Focusing techniques and spatial
access structures (Ester et al.’95)
708
ROCK: Clustering Categorical Data
◼
◼
◼
◼
ROCK: RObust Clustering using linKs
◼ S. Guha, R. Rastogi & K. Shim, ICDE’99
Major ideas
◼ Use links to measure similarity/proximity
◼ Not distance-based
Algorithm: sampling-based clustering
◼ Draw random sample
◼ Cluster with links
◼ Label data in disk
Experiments
◼ Congressional voting, mushroom data
709
Similarity Measure in ROCK
◼
◼
◼
◼
Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
Example: Two groups (clusters) of transactions
◼
C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c,
e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
◼
C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Jaccard co-efficient may lead to wrong clustering result
◼
C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
◼
C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
Jaccard co-efficient-based similarity function:
T1  T2
Sim( T1 , T2 ) =
T1  T2
◼
Ex. Let T1 = {a, b, c}, T2 = {c, d, e}
Sim (T 1, T 2) =
{c}
{a, b, c, d , e}
=
1
= 0.2
5
710
Link Measure in ROCK
◼
◼
Clusters
◼
C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a,
d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
◼
C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Neighbors
◼
Two transactions are neighbors if sim(T1,T2) > threshold
Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
◼ T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e},
{a,b,f}, {a,b,g}
◼ T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}
◼ T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}
Link Similarity
◼
Link similarity between two transactions is the # of common neighbors
◼
◼
◼
link(T1, T2) = 4, since they have 4 common neighbors
◼
◼
{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
link(T1, T3) = 3, since they have 3 common neighbors
◼
{a, b, d}, {a, b, e}, {a, b, g}
711
Aggregation-Based Similarity Computation
0.2
4
0.9
1.0 0.8
10
11
ST2
5
0.9
1.0
13
14
12
a
b
ST1
For each node nk ∈ {n10, n11, n12} and nl ∈ {n13, n14}, their pathbased similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl).
sim (na , nb ) =
k =10 s(nk , n4 )
12
3

 s(n , n ) 
14
l =13
4
5
s(nl , n5 )
2
= 0.171
takes O(3+2) time
After aggregation, we reduce quadratic time computation to linear
time computation.
713
Computing Similarity with Aggregation
Average similarity
and total weight
sim(na, nb) can be computed
from aggregated similarities
a:(0.9,3)
0.2
4
10
11
12
a
b:(0.95,2)
5
13
14
b
sim(na, nb) = avg_sim(na,n4) x s(n4, n5) x avg_sim(nb,n5)
= 0.9 x 0.2 x 0.95 = 0.171
To compute sim(na,nb):
◼
◼
◼
Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb
with nj.
Calculate similarity (and weight) between na and nb w.r.t. ni and nj.
Calculate weighted average similarity between na and nb w.r.t. all such
pairs.
714
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
◼
Cluster Analysis: Basic Concepts
◼
Overview of Clustering Methods
◼
Partitioning Methods
◼
Hierarchical Methods
◼
Density-Based Methods
◼
Grid-Based Methods
◼
Summary
715
Link-Based Clustering: Calculate
Similarities Based On Links
Authors
Tom
Proceedings
sigmod03
sigmod04
Mike
Cathy
John
Mary
Conferences
◼
sigmod
sigmod05
vldb03
vldb04
vldb05
aaai04
aaai05
vldb
The similarity between two
objects x and y is defined as
the average similarity between
objects linked with x and
those with y:
C
sim (a, b ) =
I (a ) I (b )
aaai
Jeh & Widom, KDD’2002: SimRank
Two objects are similar if they are
linked with the same or similar
objects
◼
I ( a ) I (b )
  sim (I (a ), I (b))
i =1 j =1
i
j
Issue: Expensive to compute:
◼ For a dataset of N objects
and M links, it takes O(N2)
space and O(M2) time to
compute all similarities.
716
Observation 1: Hierarchical Structures
◼
Hierarchical structures often exist naturally among objects
(e.g., taxonomy of animals)
A hierarchical structure of
products in Walmart
Relationships between articles and
words (Chakrabarti, Papadimitriou,
Modha, Faloutsos, 2004)
grocery
TV
electronics
DVD
apparel
Articles
All
camera
Words
717
Observation 2: Distribution of Similarity
portion of entries
0.4
Distribution of SimRank similarities
among DBLP authors
0.3
0.2
0.1
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
similarity value
◼
Power law distribution exists in similarities
◼ 56% of similarity entries are in [0.005, 0.015]
◼ 1.4% of similarity entries are larger than 0.1
◼ Can we design a data structure that stores the significant
similarities and compresses insignificant ones?
718
A Novel Data Structure: SimTree
Each non-leaf node
represents a group
of similar lower-level
nodes
Each leaf node
represents an object
Similarities between
siblings are stored
Canon A40
digital camera
Digital
Sony V3 digital Cameras
Consumer
camera
electronics
Apparels
TVs
719
Similarity Defined by SimTree
Similarity between two
sibling nodes n1 and n2
n1
Adjustment ratio
for node n7
0.8
n7
Path-based node similarity
◼
◼
◼
0.9
0.8
n8
n3
0.9
n6
n40.3 n5
0.9
◼
n2
0.2
1.0
n9
simp(n7,n8) = s(n7, n4) x s(n4, n5) x s(n5, n8)
Similarity between two nodes is the average similarity
between objects linked with them in other SimTrees
Adjust/ ratio for x =Average similarity between x and all other nodes
Average similarity between x’s parent and all other nodes
720
LinkClus: Efficient Clustering via
Heterogeneous Semantic Links
Method
◼ Initialize a SimTree for objects of each type
◼ Repeat until stable
◼ For each SimTree, update the similarities between its
nodes using similarities in other SimTrees
◼ Similarity between two nodes x and y is the average
similarity between objects linked with them
◼ Adjust the structure of each SimTree
◼ Assign each node to the parent node that it is most
similar to
For details: X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient
Clustering via Heterogeneous Semantic Links”, VLDB'06
721
Initialization of SimTrees
◼
◼
Initializing a SimTree
◼ Repeatedly find groups of tightly related nodes, which
are merged into a higher-level node
Tightness of a group of nodes
◼ For a group of nodes {n1, …, nk}, its tightness is
defined as the number of leaf nodes in other SimTrees
that are connected to all of {n1, …, nk}
Nodes
n1
n2
Leaf nodes in
another SimTree
1
2
3
4
5
The tightness of {n1, n2} is 3
722
Finding Tight Groups by Freq. Pattern
Mining
◼
Finding tight groups
Frequent pattern mining
Reduced to
The tightness of a
g1
group of nodes is the
support of a frequent
pattern
g2
◼
n1
n2
n3
n4
1
2
3
4
5
6
7
8
9
Transactions
{n1}
{n1, n2}
{n2}
{n1, n2}
{n1, n2}
{n2, n3, n4}
{n4}
{n3, n4}
{n3, n4}
Procedure of initializing a tree
◼ Start from leaf nodes (level-0)
◼ At each level l, find non-overlapping groups of similar
nodes with frequent pattern mining
723
Adjusting SimTree Structures
n1
n2
0.9
n4
0.8
n7
◼
n6
n5
n7 n8
n3
n9
After similarity changes, the tree structure also needs to be
changed
◼ If a node is more similar to its parent’s sibling, then move
it to be a child of that sibling
◼ Try to move each node to its parent’s sibling that it is
most similar to, under the constraint that each parent
node can have at most c children
724
Complexity
For two types of objects, N in each, and M linkages between them.
Time
Space
Updating similarities
O(M(logN)2)
O(M+N)
Adjusting tree structures
O(N)
O(N)
LinkClus
O(M(logN)2)
O(M+N)
SimRank
O(M2)
O(N2)
725
Experiment: Email Dataset
◼
◼
◼
◼
◼
F. Nielsen. Email dataset.
Approach
www.imm.dtu.dk/~rem/data/Email-1431.zip
LinkClus
370 emails on conferences, 272 on jobs,
and 789 spam emails
SimRank
Accuracy: measured by manually labeled
ReCom
data
F-SimRank
Accuracy of clustering: % of pairs of objects
in the same cluster that share common label CLARANS
Accuracy time (s)
0.8026
1579.6
0.7965
39160
0.5711
74.6
0.3688
479.7
0.4768
8.55
Approaches compared:
◼
SimRank (Jeh & Widom, KDD 2002): Computing pair-wise similarities
◼
SimRank with FingerPrints (F-SimRank): Fogaras & R´acz, WWW 2005
◼
◼
pre-computes a large sample of random paths from each object and uses
samples of two objects to estimate SimRank similarity
ReCom (Wang et al. SIGIR 2003)
◼
Iteratively clustering objects using cluster labels of linked objects
726
WaveCluster: Clustering by Wavelet Analysis
(1998)
◼
◼
◼
Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
A multi-resolution clustering approach which applies wavelet transform
to the feature space; both grid-based and density-based
Wavelet transform: A signal processing technique that decomposes a
signal into different frequency sub-band
◼ Data are transformed to preserve relative distance between objects
at different levels of resolution
◼ Allows natural clusters to become more distinguishable
727
The WaveCluster Algorithm
◼
◼
How to apply wavelet transform to find clusters
◼ Summarizes the data by imposing a multidimensional grid
structure onto data space
◼ These multidimensional spatial data objects are represented in a
n-dimensional feature space
◼ Apply wavelet transform on feature space to find the dense
regions in the feature space
◼ Apply wavelet transform multiple times which result in clusters at
different scales from fine to coarse
Major features:
◼ Complexity O(N)
◼ Detect arbitrary shaped clusters at different scales
◼ Not sensitive to noise, not sensitive to input order
◼ Only applicable to low dimensional data
728
Quantization
& Transformation
◼
Quantize data into m-D grid structure,
then wavelet transform
a) scale 1: high resolution
b) scale 2: medium resolution
c) scale 3: low resolution
729
Data Mining:
Concepts and
Techniques
(3rd ed.)
— Chapter 12 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
730
Chapter 12. Outlier Analysis
◼
Outlier and Outlier Analysis
◼
Outlier Detection Methods
◼
Statistical Approaches
◼
Proximity-Base Approaches
◼
Clustering-Base Approaches
◼
Classification Approaches
◼
Mining Contextual and Collective Outliers
◼
Outlier Detection in High Dimensional Data
◼
Summary
731
What Are Outliers?
◼
◼
◼
◼
◼
Outlier: A data object that deviates significantly from the normal
objects as if it were generated by a different mechanism
◼ Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne
Gretzky, ...
Outliers are different from the noise data
◼ Noise is random error or variance in a measured variable
◼ Noise should be removed before outlier detection
Outliers are interesting: It violates the mechanism that generates the
normal data
Outlier detection vs. novelty detection: early stage, outlier; but later
merged into the model
Applications:
◼ Credit card fraud detection
◼ Telecom fraud detection
◼ Customer segmentation
732
Types of Outliers (I)
◼
◼
◼
Three kinds: global, contextual and collective outliers
Global Outlier
Global outlier (or point anomaly)
◼ Object is Og if it significantly deviates from the rest of the data set
◼ Ex. Intrusion detection in computer networks
◼ Issue: Find an appropriate measurement of deviation
Contextual outlier (or conditional outlier)
◼ Object is Oc if it deviates significantly based on a selected context
o
◼ Ex. 80 F in Urbana: outlier? (depending on summer or winter?)
◼ Attributes of data objects should be divided into two groups
◼ Contextual attributes: defines the context, e.g., time & location
◼ Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
◼ Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
◼ Issue: How to define or formulate meaningful context?
733
Types of Outliers (II)
◼
Collective Outliers
◼
◼
A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
Applications: E.g., intrusion detection:
◼
Collective Outlier
When a number of computers keep sending
denial-of-service packages to each other
Detection of collective outliers
◼ Consider not only behavior of individual objects, but also that of
groups of objects
◼ Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
A data set may have multiple types of outlier
One object may belong to more than one type of outlier
◼
◼
◼
734
Challenges of Outlier Detection
◼
◼
◼
◼
Modeling normal objects and outliers properly
◼ Hard to enumerate all possible normal behaviors in an application
◼ The border between normal and outlier objects is often a gray area
Application-specific outlier detection
◼ Choice of distance measure among objects and the model of
relationship among objects are often application-dependent
◼ E.g., clinic data: a small deviation could be an outlier; while in
marketing analysis, larger fluctuations
Handling noise in outlier detection
◼ Noise may distort the normal objects and blur the distinction
between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
Understandability
◼ Understand why these are outliers: Justification of the detection
◼ Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism
735
Chapter 12. Outlier Analysis
◼
Outlier and Outlier Analysis
◼
Outlier Detection Methods
◼
Statistical Approaches
◼
Proximity-Base Approaches
◼
Clustering-Base Approaches
◼
Classification Approaches
◼
Mining Contextual and Collective Outliers
◼
Outlier Detection in High Dimensional Data
◼
Summary
736
Outlier Detection I: Supervised Methods
◼
◼
Two ways to categorize outlier detection methods:
◼ Based on whether user-labeled examples of outliers can be obtained:
◼ Supervised, semi-supervised vs. unsupervised methods
◼ Based on assumptions about normal data and outliers:
◼ Statistical, proximity-based, and clustering-based methods
Outlier Detection I: Supervised Methods
◼ Modeling outlier detection as a classification problem
◼ Samples examined by domain experts used for training & testing
◼ Methods for Learning a classifier for outlier detection effectively:
◼ Model normal objects & report those not matching the model as
outliers, or
◼ Model outliers and treat those not matching the model as normal
◼ Challenges
◼ Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers
◼ Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers)
737
Outlier Detection II: Unsupervised Methods
◼
◼
◼
◼
◼
Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
An outlier is expected to be far away from any groups of normal objects
Weakness: Cannot detect collective outlier effectively
◼ Normal objects may not share any strong patterns, but the collective
outliers may share high similarity in a small area
Ex. In some intrusion or virus detection, normal activities are diverse
◼ Unsupervised methods may have a high false positive rate but still
miss many real outliers.
◼ Supervised methods can be more effective, e.g., identify attacking
some key resources
Many clustering methods can be adapted for unsupervised methods
◼ Find clusters, then outliers: not belonging to any cluster
◼ Problem 1: Hard to distinguish noise from outliers
◼ Problem 2: Costly since first clustering: but far less outliers than
normal objects
◼ Newer methods: tackle outliers directly
738
Outlier Detection III: Semi-Supervised
Methods
◼
Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
◼
Semi-supervised outlier detection: Regarded as applications of semisupervised learning
◼
If some labeled normal objects are available
◼
Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
◼
Those not fitting the model of normal objects are detected as
outliers
◼
If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
◼
To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods
739
Outlier Detection (1): Statistical Methods
◼
Statistical methods (also known as model-based methods) assume that
the normal data follow some statistical model (a stochastic model)
◼
◼
The data not following the model are outliers.
Example (right figure): First use Gaussian distribution
to model the normal data
◼ For each object y in region R, estimate gD(y), the
probability of y fits the Gaussian distribution
◼ If gD(y) is very low, y is unlikely generated by the
Gaussian model, thus an outlier
◼
Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
◼
There are rich alternatives to use various statistical models
◼
E.g., parametric vs. non-parametric
740
Outlier Detection (2): Proximity-Based
Methods
◼
◼
◼
◼
◼
◼
An object is an outlier if the nearest neighbors of the object are far
away, i.e., the proximity of the object is significantly deviates
from the proximity of most of the other objects in the same data set
Example (right figure): Model the proximity of an
object using its 3 nearest neighbors
◼
Objects in region R are substantially different
from other objects in the data set.
◼
Thus the objects in R are outliers
The effectiveness of proximity-based methods highly relies on the
proximity measure.
In some applications, proximity or distance measures cannot be
obtained easily.
Often have a difficulty in finding a group of outliers which stay close to
each other
Two major types of proximity-based outlier detection
◼ Distance-based vs. density-based
741
Outlier Detection (3): Clustering-Based
Methods
Normal data belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong
to any clusters
◼
◼
◼
◼
Example (right figure): two clusters
◼
All points not in R form a large cluster
◼
The two points in R form a tiny cluster,
thus are outliers
Since there are many clustering methods, there are many
clustering-based outlier detection methods as well
Clustering is expensive: straightforward adaption of a
clustering method for outlier detection can be costly and
does not scale up well for large data sets
742
Chapter 12. Outlier Analysis
◼
Outlier and Outlier Analysis
◼
Outlier Detection Methods
◼
Statistical Approaches
◼
Proximity-Base Approaches
◼
Clustering-Base Approaches
◼
Classification Approaches
◼
Mining Contextual and Collective Outliers
◼
Outlier Detection in High Dimensional Data
◼
Summary
743
Statistical Approaches
◼
◼
◼
◼
◼
Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
Idea: learn a generative model fitting the given data set, and then
identify the objects in low probability regions of the model as outliers
Methods are divided into two categories: parametric vs. non-
parametric
Parametric method
◼ Assumes that the normal data is generated by a parametric
distribution with parameter θ
◼ The probability density function of the parametric distribution f(x,
θ) gives the probability that object x is generated by the
distribution
◼ The smaller this value, the more likely x is an outlier
Non-parametric method
◼ Not assume an a-priori statistical model and determine the model
from the input data
◼ Not completely parameter free but consider the number and nature
of the parameters are flexible and not fixed in advance
744
Univariate Outliers Based on Normal
Distribution
◼
◼
◼
Univariate data: A data set involving only one attribute or variable
Often assume that data are generated from a normal distribution, learn
the parameters from the input data, and identify the points with low
probability as outliers
Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3,
29.4}
◼
◼
◼
◼
Use the maximum likelihood method to estimate μ and σ
Taking derivatives with respect to μ and σ2, we derive the following
maximum likelihood estimates
For the above data with n = 10, we have
Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since
745
Parametric Methods I: The Grubb’s Test
◼
Univariate outlier detection: The Grubb's test (maximum normed
residual test) ─ another statistical method under normal distribution
◼
For each object x in a data set, compute its z-score: x is an outlier if
where
is the value taken by a t-distribution at a
significance level of α/(2N), and N is the # of objects in the data
set
746
Parametric Methods II: Detection of
Multivariate Outliers
◼
Multivariate data: A data set involving two or more attributes or
variables
◼
Transform the multivariate outlier detection task into a univariate
outlier detection problem
◼
Method 1. Compute Mahalaobis distance
◼
Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T S
–1(o
– ō)
where S is the covariance matrix
◼
◼
Use the Grubb's test on this measure to detect outliers
Method 2. Use χ2 –statistic:
◼
where Ei is the mean of the i-dimension among all objects, and n
is the dimensionality
◼
If χ2 –statistic is large, then object oi is an outlier
747
Parametric Methods III: Using Mixture of
Parametric Distributions
◼
Assuming data generated by a normal distribution
could be sometimes overly simplified
◼
Example (right figure): The objects between the
two clusters cannot be captured as outliers since
they are close to the estimated mean
◼
To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
◼
Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
◼
An object o is an outlier if it does not belong to any cluster
748
Non-Parametric Methods: Detection Using
Histogram
◼
◼
◼
◼
◼
The model of normal data is learned from the
input data without any a priori structure.
Often makes fewer assumptions about the data,
and thus can be applicable in more scenarios
Outlier detection using histogram:
◼
Figure shows the histogram of purchase amounts in transactions
◼
A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
Problem: Hard to choose an appropriate bin size for histogram
◼
Too small bin size → normal objects in empty/rare bins, false positive
◼
Too big bin size → outliers in some frequent bins, false negative
Solution: Adopt kernel density estimation to estimate the probability
density distribution of the data. If the estimated density function is high,
the object is likely normal. Otherwise, it is likely an outlier.
749
Chapter 12. Outlier Analysis
◼
Outlier and Outlier Analysis
◼
Outlier Detection Methods
◼
Statistical Approaches
◼
Proximity-Base Approaches
◼
Clustering-Base Approaches
◼
Classification Approaches
◼
Mining Contextual and Collective Outliers
◼
Outlier Detection in High Dimensional Data
◼
Summary
750
Proximity-Based Approaches: Distance-Based
vs. Density-Based Outlier Detection
◼
◼
◼
Intuition: Objects that are far away from the others are
outliers
Assumption of proximity-based approach: The proximity of
an outlier deviates significantly from that of most of the
others in the data set
Two types of proximity-based outlier detection methods
◼
◼
Distance-based outlier detection: An object o is an
outlier if its neighborhood does not have enough other
points
Density-based outlier detection: An object o is an
outlier if its density is relatively much lower than that of
its neighbors
751
Distance-Based Outlier Detection
◼
◼
◼
◼
◼
For each object o, examine the # of other objects in the rneighborhood of o, where r is a user-specified distance threshold
An object o is an outlier if most (taking π as a fraction threshold) of
the objects in D are far away from o, i.e., not in the r-neighborhood of
o
An object o is a DB(r, π) outlier if
Equivalently, one can check the distance between o and its k-th
nearest neighbor ok, where
. o is an outlier if dist(o,
ok) > r
Efficient computation: Nested loop algorithm
◼
◼
For any object oi, calculate its distance from other objects, and
count the # of other objects in the r-neighborhood.
If π∙n other objects are within r distance, terminate the inner loop
752
Distance-Based Outlier Detection: A Grid-Based
Method
◼
◼
◼
◼
◼
Why efficiency is still a concern? When the complete set of objects
cannot be held into main memory, cost I/O swapping
The major cost: (1) each object tests against the whole data set, why
not only its close neighbor? (2) check objects one by one, why not
group by group?
Grid-based method (CELL): Data space is partitioned into a multi-D
grid. Each cell is a hyper cube with diagonal length r/2
Pruning using the level-1 & level 2 cell properties:
◼
For any possible point x in cell C and any
possible point y in a level-1 cell, dist(x,y) ≤ r
◼
For any possible point x in cell C and any point y
such that dist(x,y) ≥ r, y is in a level-2 cell
Thus we only need to check the objects that cannot be pruned, and
even for such an object o, only need to compute the distance between
o and the objects in the level-2 cells (since beyond level-2, the
distance from o is more than r)
753
Density-Based Outlier Detection
◼
◼
Local outliers: Outliers comparing to their local
neighborhoods, instead of the global data
distribution
In Fig., o1 and o2 are local outliers to C1, o3 is a
global outlier, but o4 is not an outlier. However,
proximity-based clustering cannot find o1 and o2
are outlier (e.g., comparing with O4).
◼
Intuition (density-based outlier detection): The density around an outlier
object is significantly different from the density around its neighbors
◼
Method: Use the relative density of an object against its neighbors as
the indicator of the degree of the object being outliers
◼
k-distance of an object o, distk(o): distance between o and its k-th NN
◼
k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)}
◼
Nk(o) could be bigger than k since multiple objects may have
identical distance to o
754
Local Outlier Factor: LOF
◼
Reachability distance from o’ to o:
◼
◼
where k is a user-specified parameter
Local reachability density of o:
◼
LOF (Local outlier factor) of an object o is the average of the ratio of
local reachability of o and those of o’s k-nearest neighbors
◼
The lower the local reachability density of o, and the higher the local
reachability density of the kNN of o, the higher LOF
◼
This captures a local outlier whose local density is relatively low
comparing to the local densities of its kNN
755
Chapter 12. Outlier Analysis
◼
Outlier and Outlier Analysis
◼
Outlier Detection Methods
◼
Statistical Approaches
◼
Proximity-Base Approaches
◼
Clustering-Base Approaches
◼
Classification Approaches
◼
Mining Contextual and Collective Outliers
◼
Outlier Detection in High Dimensional Data
◼
Summary
756
Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
◼
◼
◼
◼
An object is an outlier if (1) it does not belong to any cluster, (2) there
is a large distance between the object and its closest cluster , or (3) it
belongs to a small or sparse cluster
Case I: Not belong to any cluster
◼ Identify animals not part of a flock: Using a densitybased clustering method such as DBSCAN
Case 2: Far from its closest cluster
◼ Using k-means, partition data points of into clusters
◼ For each object o, assign an outlier score based on
its distance from its closest center
◼ If dist(o, co)/avg_dist(co) is large, likely an outlier
Ex. Intrusion detection: Consider the similarity between
data points and the clusters in a training data set
◼
◼
Use a training set to find patterns of “normal” data, e.g., frequent
itemsets in each segment, and cluster similar connections into groups
Compare new data points with the clusters mined—Outliers are
possible attacks
757
Clustering-Based Outlier Detection (3):
Detecting Outliers in Small Clusters
◼
FindCBLOF: Detect outliers in small clusters
◼
◼
◼
◼
◼
Find clusters, and sort them in decreasing size
To each data point, assign a cluster-based local
outlier factor (CBLOF):
If obj p belongs to a large cluster, CBLOF =
cluster_size X similarity between p and cluster
If p belongs to a small one, CBLOF = cluster size
X similarity betw. p and the closest large cluster
Ex. In the figure, o is outlier since its closest large cluster is C1, but the
similarity between o and C1 is small. For any point in C3, its closest
large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small
758
Clustering-Based Method: Strength and
Weakness
◼
◼
Strength
◼ Detect outliers without requiring any labeled data
◼
Work for many types of data
◼ Clusters can be regarded as summaries of the data
◼ Once the cluster are obtained, need only compare any object
against the clusters to determine whether it is an outlier (fast)
Weakness
◼ Effectiveness depends highly on the clustering method used—they
may not be optimized for outlier detection
◼ High computational cost: Need to first find clusters
◼ A method to reduce the cost: Fixed-width clustering
◼ A point is assigned to a cluster if the center of the cluster is
within a pre-defined distance threshold from the point
◼ If a point cannot be assigned to any existing cluster, a new
cluster is created and the distance threshold may be learned
Chapter 12. Outlier Analysis
◼
Outlier and Outlier Analysis
◼
Outlier Detection Methods
◼
Statistical Approaches
◼
Proximity-Base Approaches
◼
Clustering-Base Approaches
◼
Classification Approaches
◼
Mining Contextual and Collective Outliers
◼
Outlier Detection in High Dimensional Data
◼
Summary
760
Classification-Based Method I: One-Class
Model
◼
◼
◼
Idea: Train a classification model that can
distinguish “normal” data from outliers
A brute-force approach: Consider a training set
that contains samples labeled as “normal” and
others labeled as “outlier”
◼ But, the training set is typically heavily
biased: # of “normal” samples likely far
exceeds # of outlier samples
◼ Cannot detect unseen anomaly
One-class model: A classifier is built to describe only the normal class.
◼ Learn the decision boundary of the normal class using classification
methods such as SVM
◼ Any samples that do not belong to the normal class (not within the
decision boundary) are declared as outliers
◼ Adv: can detect new outliers that may not appear close to any outlier
objects in the training set
◼ Extension: Normal objects may belong to multiple classes
761
Classification-Based Method II: Semi-Supervised
Learning
◼
◼
◼
Semi-supervised learning: Combining classification-based
and clustering-based methods
Method
◼ Using a clustering-based approach, find a large
cluster, C, and a small cluster, C1
◼ Since some objects in C carry the label “normal”,
treat all objects in C as normal
◼ Use the one-class model of this cluster to identify
normal objects in outlier detection
◼ Since some objects in cluster C1 carry the label
“outlier”, declare all objects in C1 as outliers
◼ Any object that does not fall into the model for C
(such as a) is considered an outlier as well
Comments on classification-based outlier detection methods
◼ Strength: Outlier detection is fast
◼ Bottleneck: Quality heavily depends on the availability and quality of
the training set, but often difficult to obtain representative and highquality training data
762
Chapter 12. Outlier Analysis
◼
Outlier and Outlier Analysis
◼
Outlier Detection Methods
◼
Statistical Approaches
◼
Proximity-Base Approaches
◼
Clustering-Base Approaches
◼
Classification Approaches
◼
Mining Contextual and Collective Outliers
◼
Outlier Detection in High Dimensional Data
◼
Summary
763
into
Conventional Outlier Detection
◼
◼
◼
◼
If the contexts can be clearly identified, transform it to conventional
outlier detection
1. Identify the context of the object using the contextual attributes
2. Calculate the outlier score for the object in the context using a
conventional outlier detection method
Ex. Detect outlier customers in the context of customer groups
◼ Contextual attributes: age group, postal code
◼ Behavioral attributes: # of trans/yr, annual total trans. amount
Steps: (1) locate c’s context, (2) compare c with the other customers
in the same group, and (3) use a conventional outlier detection
method
If the context contains very few customers, generalize contexts
◼ Ex. Learn a mixture model U on the contextual attributes, and
another mixture model V of the data on the behavior attributes
◼ Learn a mapping p(Vi|Uj): the probability that a data object o
belonging to cluster Uj on the contextual attributes is generated by
cluster Vi on the behavior attributes
◼ Outlier score:
764
Mining Contextual Outliers II: Modeling
Normal Behavior with Respect to Contexts
◼
In some applications, one cannot clearly partition the data into
contexts
◼
◼
Model the “normal” behavior with respect to contexts
◼
◼
◼
◼
Ex. if a customer suddenly purchased a product that is unrelated to
those she recently browsed, it is unclear how many products
browsed earlier should be considered as the context
Using a training data set, train a model that predicts the expected
behavior attribute values with respect to the contextual attribute
values
An object is a contextual outlier if its behavior attribute values
significantly deviate from the values predicted by the model
Using a prediction model that links the contexts and behavior, these
methods avoid the explicit identification of specific contexts
Methods: A number of classification and prediction techniques can be
used to build such models, such as regression, Markov Models, and
Finite State Automaton
765
Mining Collective Outliers I: On
the Set of “Structured Objects”
◼
◼
◼
◼
◼
Collective outlier if objects as a group deviate
significantly from the entire data
Need to examine the structure of the data set, i.e, the
relationships between multiple data objects
Each of these structures is inherent to its respective type of data
◼ For temporal data (such as time series and sequences), we explore
the structures formed by time, which occur in segments of the time
series or subsequences
◼ For spatial data, explore local areas
◼ For graph and network data, we explore subgraphs
Difference from the contextual outlier detection: the structures are often
not explicitly defined, and have to be discovered as part of the outlier
detection process.
Collective outlier detection methods: two categories
◼ Reduce the problem to conventional outlier detection
◼ Identify structure units, treat each structure unit (e.g.,
subsequence, time series segment, local area, or subgraph) as
a data object, and extract features
◼ Then outlier detection on the set of “structured objects”
constructed as such using the extracted features
766
Mining Collective Outliers II: Direct
Modeling of the Expected Behavior of
Structure Units
◼
◼
◼
◼
Models the expected behavior of structure units directly
Ex. 1. Detect collective outliers in online social network of customers
◼ Treat each possible subgraph of the network as a structure unit
◼ Collective outlier: An outlier subgraph in the social network
◼ Small subgraphs that are of very low frequency
◼ Large subgraphs that are surprisingly frequent
Ex. 2. Detect collective outliers in temporal sequences
◼ Learn a Markov model from the sequences
◼ A subsequence can then be declared as a collective outlier if it
significantly deviates from the model
Collective outlier detection is subtle due to the challenge of exploring
the structures in data
◼ The exploration typically uses heuristics, and thus may be
application dependent
◼ The computational cost is often high due to the sophisticated
mining process
767
Chapter 12. Outlier Analysis
◼
Outlier and Outlier Analysis
◼
Outlier Detection Methods
◼
Statistical Approaches
◼
Proximity-Base Approaches
◼
Clustering-Base Approaches
◼
Classification Approaches
◼
Mining Contextual and Collective Outliers
◼
Outlier Detection in High Dimensional Data
◼
Summary
768
Challenges for Outlier Detection in HighDimensional Data
◼
◼
◼
◼
Interpretation of outliers
◼ Detecting outliers without saying why they are outliers is not very
useful in high-D due to many features (or dimensions) are involved
in a high-dimensional data set
◼ E.g., which subspaces that manifest the outliers or an assessment
regarding the “outlier-ness” of the objects
Data sparsity
◼ Data in high-D spaces are often sparse
◼ The distance between objects becomes heavily dominated by
noise as the dimensionality increases
Data subspaces
◼ Adaptive to the subspaces signifying the outliers
◼ Capturing the local behavior of data
Scalable with respect to dimensionality
◼ # of subspaces increases exponentially
769
Approach I: Extending Conventional Outlier
Detection
◼
Method 1: Detect outliers in the full space, e.g., HilOut Algorithm
◼ Find distance-based outliers, but use the ranks of distance instead of
the absolute distance in outlier detection
◼ For each object o, find its k-nearest neighbors: nn1(o), . . . , nnk(o)
◼ The weight of object o:
All objects are ranked in weight-descending order
◼ Top-l objects in weight are output as outliers (l: user-specified parm)
◼ Employ space-filling curves for approximation: scalable in both time
and space w.r.t. data size and dimensionality
Method 2: Dimensionality reduction
◼ Works only when in lower-dimensionality, normal instances can still
be distinguished from outliers
◼ PCA: Heuristically, the principal components with low variance are
preferred because, on such dimensions, normal objects are likely
close to each other and outliers often deviate from the majority
◼
◼
770
Approach II: Finding Outliers in
Subspaces
◼
◼
◼
Extending conventional outlier detection: Hard for outlier interpretation
Find outliers in much lower dimensional subspaces: easy to interpret
why and to what extent the object is an outlier
◼ E.g., find outlier customers in certain subspace: average transaction
amount >> avg. and purchase frequency << avg.
Ex. A grid-based subspace outlier detection method
◼ Project data onto various subspaces to find an area whose density is
much lower than average
◼ Discretize the data into a grid with φ equi-depth (why?) regions
◼ Search for regions that are significantly sparse
◼ Consider a k-d cube: k ranges on k dimensions, with n objects
◼ If objects are independently distributed, the expected number of
objects falling into a k-dimensional region is (1/ φ)kn = fkn,the
standard deviation is
◼ The sparsity coefficient of cube C:
◼ If S(C) < 0, C contains less objects than expected
◼ The more negative, the sparser C is and the more likely the
objects in C are outliers in the subspace
771
Approach III: Modeling High-Dimensional
Outliers
◼
◼
◼
◼
◼
◼
Develop new models for highdimensional outliers directly
A set of points
Avoid proximity measures and adopt
form a cluster
new heuristics that do not deteriorate
except c (outlier)
in high-dimensional data
Ex. Angle-based outliers: Kriegel, Schubert, and Zimek [KSZ08]
For each point o, examine the angle ∆xoy for every pair of points x, y.
◼ Point in the center (e.g., a), the angles formed differ widely
◼ An outlier (e.g., c), angle variable is substantially smaller
Use the variance of angles for a point to determine outlier
Combine angles and distance to model outliers
◼ Use the distance-weighted angle variance as the outlier score
◼ Angle-based outlier factor (ABOF):
◼
◼
Efficient approximation computation method is developed
It can be generalized to handle arbitrary types of data
772
Chapter 12. Outlier Analysis
◼
Outlier and Outlier Analysis
◼
Outlier Detection Methods
◼
Statistical Approaches
◼
Proximity-Base Approaches
◼
Clustering-Base Approaches
◼
Classification Approaches
◼
Mining Contextual and Collective Outliers
◼
Outlier Detection in High Dimensional Data
◼
Summary
773
Summary
◼
Types of outliers
◼
◼
global, contextual & collective outliers
Outlier detection
◼
supervised, semi-supervised, or unsupervised
◼
Statistical (or model-based) approaches
◼
Proximity-base approaches
◼
Clustering-base approaches
◼
Classification approaches
◼
Mining contextual and collective outliers
◼
Outlier detection in high dimensional data
774
References (I)
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika, 66:229–248,
1979.
M. Agyemang, K. Barker, and R. Alhajj. A comprehensive survey of numeric and symbolic outlier mining
techniques. Intell. Data Anal., 10:521–538, 2006.
F. J. Anscombe and I. Guttman. Rejection of outliers. Technometrics, 2:123–147, 1960.
D. Agarwal. Detecting anomalies in cross-classified streams: a bayesian approach. Knowl. Inf. Syst., 11:29–44,
2006.
F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets. TKDE, 2005.
C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD’01
R.J. Beckman and R.D. Cook. Outlier...s. Technometrics, 25:119–149, 1983.
I. Ben-Gal. Outlier detection. In Maimon O. and Rockach L. (eds.) Data Mining and Knowledge Discovery
Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic, 2005.
M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD’00
D. Barbar´a, Y. Li, J. Couto, J.-L. Lin, and S. Jajodia. Bootstrapping a data mining intrusion detection system.
SAC’03
Z. A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris. A comparative study for outlier detection techniques in
data mining. IEEE Conf. on Cybernetics and Intelligent Systems, 2006.
S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a
simple pruning rule. KDD’03
D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusion using bayesian estimators. SDM’01
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41:1–58,
2009.
D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative selection
algorithm. In CEC’02
References (2)
◼
E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly
detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining for Security
Applications, 2002.
◼
E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00
◼
T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1:291–316, 1997.
◼
V. J. Hodge and J. Austin. A survey of outlier detection methdologies. Artif. Intell. Rev., 22:85–126, 2004.
◼
D. M. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.
◼
Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recogn. Lett., 24, June, 2003.
◼
W. Jin, K. H. Tung, and J. Han. Mining top-n local outliers in large databases. KDD’01
◼
W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship.
PAKDD’06
◼
E. Knorr and R. Ng. A unified notion of outliers: Properties and computation. KDD’97
◼
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98
◼
E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB J., 8:237–
253, 2000.
◼
H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. KDD’08
◼
M. Markou and S. Singh. Novelty detection: A review—part 1: Statistical approaches. Signal Process., 83:2481–
2497, 2003.
◼
M. Markou and S. Singh. Novelty detection: A review—part 2: Neural network based approaches. Signal
Process., 83:2499–2521, 2003.
◼
C. C. Noble and D. J. Cook. Graph-based anomaly detection. KDD’03
References (3)
◼
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local
correlation integral. ICDE’03
◼
A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and latest
technological trends. Comput. Netw., 51, 2007.
◼
X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl. and Data
Eng., 19, 2007.
◼
Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric space. KDD’06
◼
N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions
into information systems. Quality and Reliability Engineering International, 17:105–112, 2001.
◼
B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for coevolving time sequences. ICDE’00
Un-Used Slides
778
Statistical
Approaches
Assume a model underlying distribution that generates data
set (e.g. normal distribution)
◼ Use discordancy tests depending on
◼ data distribution
◼ distribution parameter (e.g., mean, variance)
◼ number of expected outliers
◼ Drawbacks
◼ most tests are for single attribute
◼ In many cases, data distribution may not be known
779
Outlier Discovery: Distance-Based
Approach
◼
◼
◼
Introduced to counter the main limitations imposed by
statistical methods
◼ We need multi-dimensional analysis without knowing
data distribution
Distance-based outlier: A DB(p, D)-outlier is an object O in
a dataset T such that at least a fraction p of the objects in
T lies at a distance greater than D from O
Algorithms for mining distance-based outliers [Knorr & Ng,
VLDB’98]
◼ Index-based algorithm
◼ Nested-loop algorithm
◼ Cell-based algorithm
780
Density-Based Local
Outlier Detection
◼
M. M. Breunig, H.-P. Kriegel, R. Ng, J.
Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
◼
Distance-based outlier detection is based
on global distance distribution
◼
It encounters difficulties to identify
outliers if data is not uniformly distributed
◼
◼
Ex. C1 contains 400 loosely distributed
points, C2 has 100 tightly condensed
◼
Need the concept of local
outlier
Local outlier factor (LOF)
◼ Assume outlier is not
crisp
◼ Each point has a LOF
points, 2 outlier points o1, o2
◼
Distance-based method cannot identify o2
as an outlier
781
Outlier Discovery: Deviation-Based
Approach
◼
◼
◼
Identifies outliers by examining the main characteristics
of objects in a group
Objects that “deviate” from this description are
considered outliers
Sequential exception technique
◼
◼
simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
OLAP data cube technique
◼
uses data cubes to identify regions of anomalies in
large multidimensional data
782
References (1)
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika,
1979.
Malik Agyemang, Ken Barker, and Rada Alhajj. A comprehensive survey of numeric and symbolic
outlier mining techniques. Intell. Data Anal., 2006.
Deepak Agarwal. Detecting anomalies in cross-classied streams: a bayesian approach. Knowl. Inf.
Syst., 2006.
C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD'01.
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Optics-of: Identifying local outliers. PKDD '99
M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers.
SIGMOD'00.
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 2009.
D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative
selection algorithm. Computational Intelligence, 2002.
E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised
anomaly detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining
for Security Applications, 2002.
E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00.
T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1997.
R. Fujimaki, T. Yairi, and K. Machida. An approach to spacecraft anomaly detection problem using
kernel feature space. KDD '05
F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 1969.
783
References (2)
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
◼
V. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 2004.
Douglas M Hawkins. Identification of Outliers. Chapman and Hall, 1980.
P. S. Horn, L. Feng, Y. Li, and A. J. Pesce. Effect of Outliers and Nonhealthy Individuals on
Reference Interval Estimation. Clin Chem, 2001.
W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood
relationship. PAKDD'06
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98
M. Markou and S. Singh.. Novelty detection: a review| part 1: statistical approaches. Signal
Process., 83(12), 2003.
M. Markou and S. Singh. Novelty detection: a review| part 2: neural network based approaches.
Signal Process., 83(12), 2003.
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using
the local correlation integral. ICDE'03.
A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and
latest technological trends. Comput. Netw., 51(12):3448{3470, 2007.
W. Stefansky. Rejecting outliers in factorial designs. Technometrics, 14(2):469{479, 1972.
X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl.
and Data Eng., 19(5):631{645, 2007.
Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric
space. KDD '06:
N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting
intrusions into information systems. Quality and Reliability Engineering International, 2001.
784
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 13 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
Chapter 13: Data Mining Trends and
Research Frontiers
◼
Mining Complex Types of Data
◼
Other Methodologies of Data Mining
◼
Data Mining Applications
◼
Data Mining and Society
◼
Data Mining Trends
◼
Summary
787
Mining Complex Types of Data
◼
Mining Sequence Data
◼
Mining Time Series
◼
Mining Symbolic Sequences
◼
Mining Biological Sequences
◼
Mining Graphs and Networks
◼
Mining Other Kinds of Data
788
Mining Sequence Data
◼
Similarity Search in Time Series Data
◼
◼
Regression and Trend Analysis in Time-Series Data
◼
◼
Feature-based vs. sequence-distance-based vs. model-based
Alignment of Biological Sequences
◼
◼
GSP, PrefixSpan, constraint-based sequential pattern mining
Sequence Classification
◼
◼
long term + cyclic + seasonal variation + random movements
Sequential Pattern Mining in Symbolic Sequences
◼
◼
Subsequence match, dimensionality reduction, query-based similarity
search, motif-based similarity search
Pair-wise vs. multi-sequence alignment, substitution matirces, BLAST
Hidden Markov Model for Biological Sequence Analysis
◼
Markov chain vs. hidden Markov models, forward vs. Viterbi vs. BaumWelch algorithms
789
Mining Graphs and Networks
◼
Graph Pattern Mining
◼
◼
Statistical Modeling of Networks
◼
◼
◼
Small world phenomenon, power law (log-tail) distribution, densification
Clustering and Classification of Graphs and Homogeneous Networks
◼
Clustering: Fast Modularity vs. SCAN
◼
Classification: model vs. pattern-based mining
Clustering, Ranking and Classification of Heterogeneous Networks
◼
◼
Frequent subgraph patterns, closed graph patterns, gSpan vs. CloseGraph
RankClus, RankClass, and meta path-based, user-guided methodology
Role Discovery and Link Prediction in Information Networks
◼
PathPredict
◼
Similarity Search and OLAP in Information Networks: PathSim, GraphCube
◼
Evolution of Social and Information Networks: EvoNetClus
790
Mining Other Kinds of Data
◼
Mining Spatial Data
◼
◼
Mining Spatiotemporal and Moving Object Data
◼
◼
Topic modeling, i-topic model, integration with geo- and networked data
Mining Web Data
◼
◼
Social media data, geo-tagged spatial clustering, periodicity analysis, …
Mining Text Data
◼
◼
Applications: healthcare, air-traffic control, flood simulation
Mining Multimedia Data
◼
◼
Spatiotemporal data mining, trajectory mining, periodica, swarm, …
Mining Cyber-Physical System Data
◼
◼
Spatial frequent/co-located patterns, spatial clustering and classification
Web content, web structure, and web usage mining
Mining Data Streams
◼
Dynamics, one-pass, patterns, clustering, classification, outlier detection
791
Chapter 13: Data Mining Trends and
Research Frontiers
◼
Mining Complex Types of Data
◼
Other Methodologies of Data Mining
◼
Data Mining Applications
◼
Data Mining and Society
◼
Data Mining Trends
◼
Summary
792
Other Methodologies of Data Mining
◼
Statistical Data Mining
◼
Views on Data Mining Foundations
◼
Visual and Audio Data Mining
793
Major Statistical Data Mining Methods
◼
Regression
◼
Generalized Linear Model
◼
Analysis of Variance
◼
Mixed-Effect Models
◼
Factor Analysis
◼
Discriminant Analysis
◼
Survival Analysis
794
Statistical Data Mining (1)
◼
◼
There are many well-established statistical techniques for data
analysis, particularly for numeric data
◼ applied extensively to data from scientific experiments and data
from economics and the social sciences
Regression
predict the value of a response
(dependent) variable from one or
more predictor (independent) variables
where the variables are numeric
◼
forms of regression: linear, multiple,
weighted, polynomial, nonparametric,
and robust
◼
795
Scientific and Statistical Data Mining (2)
◼
◼
Generalized linear models
◼ allow a categorical response variable (or
some transformation of it) to be related
to a set of predictor variables
◼ similar to the modeling of a numeric
response variable using linear regression
◼ include logistic regression and Poisson
regression
Mixed-effect models
For analyzing grouped data, i.e. data that can be classified
according to one or more grouping variables
◼
Typically describe relationships between a response variable and
some covariates in data grouped according to one or more factors
◼
796
Scientific and Statistical Data Mining (3)
◼
Regression trees
◼ Binary trees used for classification
and prediction
◼
◼
◼
Similar to decision trees:Tests are
performed at the internal nodes
In a regression tree the mean of the
objective attribute is computed and
used as the predicted value
Analysis of variance
◼ Analyze experimental data for two or
more populations described by a
numeric response variable and one or
more categorical variables (factors)
797
Statistical Data Mining (4)
◼
◼
Factor analysis
◼ determine which variables are
combined to generate a given factor
◼ e.g., for many psychiatric data, one
can indirectly measure other
quantities (such as test scores) that
reflect the factor of interest
Discriminant analysis
◼ predict a categorical response
variable, commonly used in social
science
◼ Attempts to determine several
discriminant functions (linear
combinations of the independent
variables) that discriminate among
the groups defined by the response
variable
www.spss.com/datamine/factor.htm
798
Statistical Data Mining (5)
◼
Time series: many methods such as autoregression,
ARIMA (Autoregressive integrated moving-average
modeling), long memory time-series modeling
◼
Quality control: displays group summary charts
◼
Survival analysis
❑
Predicts the probability
that a patient
undergoing a medical
treatment would
survive at least to time
t (life span prediction)
799
Other Methodologies of Data Mining
◼
Statistical Data Mining
◼
Views on Data Mining Foundations
◼
Visual and Audio Data Mining
800
Views on Data Mining Foundations (I)
◼
◼
Data reduction
◼
Basis of data mining: Reduce data representation
◼
Trades accuracy for speed in response
Data compression
◼
◼
Basis of data mining: Compress the given data by
encoding in terms of bits, association rules, decision
trees, clusters, etc.
Probability and statistical theory
◼
Basis of data mining: Discover joint probability
distributions of random variables
801
Views on Data Mining Foundations (II)
◼
Microeconomic view
◼
◼
A view of utility: Finding patterns that are interesting only to the
extent in that they can be used in the decision-making process of
some enterprise
Pattern Discovery and Inductive databases
◼
◼
◼
◼
Basis of data mining: Discover patterns occurring in the database,
such as associations, classification models, sequential patterns, etc.
Data mining is the problem of performing inductive logic on
databases
The task is to query the data and the theory (i.e., patterns) of the
database
Popular among many researchers in database systems
802
Other Methodologies of Data Mining
◼
Statistical Data Mining
◼
Views on Data Mining Foundations
◼
Visual and Audio Data Mining
803
Visual Data Mining
◼
◼
Visualization: Use of computer graphics to create visual
images which aid in the understanding of complex, often
massive representations of data
Visual Data Mining: discovering implicit but useful
knowledge from large data sets using visualization
techniques
Multimedia
Human
Computer
Systems
Computer
Graphics
Interfaces
Visual Data
Mining
High
Pattern
Performance
Recognition
Computing
804
Visualization
◼
Purpose of Visualization
◼
◼
◼
◼
◼
Gain insight into an information space by mapping data
onto graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities,
relationships among data.
Help find interesting regions and suitable parameters
for further quantitative analysis.
Provide a visual proof of computer representations
derived
805
Visual Data Mining & Data Visualization
◼
◼
Integration of visualization and data mining
◼ data visualization
◼ data mining result visualization
◼ data mining process visualization
◼ interactive visual data mining
Data visualization
◼ Data in a database or data warehouse can be viewed
◼ at different levels of abstraction
◼ as different combinations of attributes or
dimensions
◼ Data can be presented in various visual forms
806
Data Mining Result Visualization
◼
◼
Presentation of the results or knowledge obtained from
data mining in visual forms
Examples
◼
Scatter plots and boxplots (obtained from descriptive
data mining)
◼
Decision trees
◼
Association rules
◼
Clusters
◼
Outliers
◼
Generalized rules
807
Boxplots from Statsoft: Multiple
Variable Combinations
808
Visualization of Data Mining Results in
SAS Enterprise Miner: Scatter Plots
809
Visualization of Association Rules in
SGI/MineSet 3.0
810
Visualization of a Decision Tree in
SGI/MineSet 3.0
811
Visualization of Cluster Grouping in IBM
Intelligent Miner
812
Data Mining Process Visualization
◼
Presentation of the various processes of data mining in
visual forms so that users can see
◼
Data extraction process
◼
Where the data is extracted
◼
How the data is cleaned, integrated, preprocessed,
and mined
◼
Method selected for data mining
◼
Where the results are stored
◼
How they may be viewed
813
Visualization of Data Mining Processes
by Clementine
See your solution
discovery
process clearly
Understand
variations with
visualized data
814
Interactive Visual Data Mining
◼
◼
Using visualization tools in the data mining process to
help users make smart data mining decisions
Example
◼
◼
Display the data distribution in a set of attributes
using colored sectors or columns (depending on
whether the whole space is represented by either a
circle or a set of columns)
Use the display to which sector should first be
selected for classification and where a good split point
for this sector may be
815
Interactive Visual Mining by
Perception-Based Classification (PBC)
816
Audio Data Mining
◼
◼
◼
◼
◼
Uses audio signals to indicate the patterns of data or
the features of data mining results
An interesting alternative to visual mining
An inverse task of mining audio (such as music)
databases which is to find patterns from audio data
Visual data mining may disclose interesting patterns
using graphical displays, but requires users to
concentrate on watching patterns
Instead, transform patterns into sound and music and
listen to pitches, rhythms, tune, and melody in order to
identify anything interesting or unusual
817
Chapter 13: Data Mining Trends and
Research Frontiers
◼
Mining Complex Types of Data
◼
Other Methodologies of Data Mining
◼
Data Mining Applications
◼
Data Mining and Society
◼
Data Mining Trends
◼
Summary
818
Data Mining Applications
◼
◼
Data mining: A young discipline with broad and diverse
applications
◼ There still exists a nontrivial gap between generic data
mining methods and effective and scalable data mining
tools for domain-specific applications
Some application domains (briefly discussed here)
◼ Data Mining for Financial data analysis
◼ Data Mining for Retail and Telecommunication
Industries
◼ Data Mining in Science and Engineering
◼ Data Mining for Intrusion Detection and Prevention
◼ Data Mining and Recommender Systems
819
Data Mining for Financial Data Analysis (I)
◼
◼
◼
Financial data collected in banks and financial institutions
are often relatively complete, reliable, and of high quality
Design and construction of data warehouses for
multidimensional data analysis and data mining
◼ View the debt and revenue changes by month, by
region, by sector, and by other factors
◼ Access statistical information such as max, min, total,
average, trend, etc.
Loan payment prediction/consumer credit policy analysis
◼ feature selection and attribute relevance ranking
◼ Loan payment performance
◼ Consumer credit rating
820
Data Mining for Financial Data Analysis (II)
◼
◼
Classification and clustering of customers for targeted
marketing
◼ multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customer
groups or associate a new customer to an appropriate
customer group
Detection of money laundering and other financial crimes
◼ integration of from multiple DBs (e.g., bank
transactions, federal/state crime history DBs)
◼ Tools: data visualization, linkage analysis,
classification, clustering tools, outlier analysis, and
sequential pattern analysis tools (find unusual access
sequences)
821
Data Mining for Retail & Telcomm. Industries (I)
◼
◼
Retail industry: huge amounts of data on sales, customer
shopping history, e-commerce, etc.
Applications of retail data mining
◼
Identify customer buying behaviors
◼
Discover customer shopping patterns and trends
◼
Improve the quality of customer service
◼
Achieve better customer retention and satisfaction
◼
Enhance goods consumption ratios
◼
◼
Design more effective goods transportation and
distribution policies
Telcomm. and many other industries: Share many similar
goals and expectations of retail data mining
822
Data Mining Practice for Retail Industry
◼
◼
Design and construction of data warehouses
Multidimensional analysis of sales, customers, products, time, and
region
◼
Analysis of the effectiveness of sales campaigns
◼
Customer retention: Analysis of customer loyalty
◼
◼
◼
Use customer loyalty card information to register sequences of
purchases of particular customers
Use sequential pattern mining to investigate changes in customer
consumption or loyalty
Suggest adjustments on the pricing and variety of goods
◼
Product recommendation and cross-reference of items
◼
Fraudulent analysis and the identification of usual patterns
◼
Use of visualization tools in data analysis
823
Data Mining in Science and Engineering
◼
Data warehouses and data preprocessing
◼
◼
Mining complex data types
◼
◼
Resolving inconsistencies or incompatible data collected in diverse
environments and different periods (e.g. eco-system studies)
Spatiotemporal, biological, diverse semantics and relationships
Graph-based and network-based mining
◼
Links, relationships, data flow, etc.
◼
Visualization tools and domain-specific knowledge
◼
Other issues
◼
◼
Data mining in social sciences and social studies: text and social
media
Data mining in computer science: monitoring systems, software
bugs, network intrusion
824
Data Mining for Intrusion Detection and
Prevention
◼
Majority of intrusion detection and prevention systems use
◼
◼
◼
Signature-based detection: use signatures, attack patterns that are
preconfigured and predetermined by domain experts
Anomaly-based detection: build profiles (models of normal
behavior) and detect those that are substantially deviate from the
profiles
What data mining can help
◼
◼
◼
New data mining algorithms for intrusion detection
Association, correlation, and discriminative pattern analysis help
select and build discriminative classifiers
Analysis of stream data: outlier detection, clustering, model
shifting
◼
Distributed data mining
◼
Visualization and querying tools
825
Data Mining and Recommender Systems
◼
◼
◼
Recommender systems: Personalization, making product
recommendations that are likely to be of interest to a user
Approaches: Content-based, collaborative, or their hybrid
◼ Content-based: Recommends items that are similar to items the
user preferred or queried in the past
◼ Collaborative filtering: Consider a user's social environment,
opinions of other customers who have similar tastes or preferences
Data mining and recommender systems
◼ Users C × items S: extract from known to unknown ratings to
predict user-item combinations
◼ Memory-based method often uses k-nearest neighbor approach
◼ Model-based method uses a collection of ratings to learn a model
(e.g., probabilistic models, clustering, Bayesian networks, etc.)
◼ Hybrid approaches integrate both to improve performance (e.g.,
using ensemble)
826
Chapter 13: Data Mining Trends and
Research Frontiers
◼
Mining Complex Types of Data
◼
Other Methodologies of Data Mining
◼
Data Mining Applications
◼
Data Mining and Society
◼
Data Mining Trends
◼
Summary
827
Ubiquitous and Invisible Data Mining
◼
◼
Ubiquitous Data Mining
◼
Data mining is used everywhere, e.g., online shopping
◼
Ex. Customer relationship management (CRM)
Invisible Data Mining
◼
◼
◼
◼
◼
Invisible: Data mining functions are built in daily life operations
Ex. Google search: Users may be unaware that they are
examining results returned by data
Invisible data mining is highly desirable
Invisible mining needs to consider efficiency and scalability, user
interaction, incorporation of background knowledge and
visualization techniques, finding interesting patterns, real-time, …
Further work: Integration of data mining into existing business
and scientific technologies to provide domain-specific data mining
tools
828
Privacy, Security and Social Impacts of
Data Mining
◼
Many data mining applications do not touch personal data
◼
◼
◼
E.g., meteorology, astronomy, geography, geology, biology, and
other scientific and engineering data
Many DM studies are on developing scalable algorithms to find general
or statistically significant patterns, not touching individuals
The real privacy concern: unconstrained access of individual records,
especially privacy-sensitive information
◼
Method 1: Removing sensitive IDs associated with the data
◼
Method 2: Data security-enhancing methods
◼
◼
◼
Multi-level security model: permit to access to only authorized level
Encryption: e.g., blind signatures, biometric encryption, and
anonymous databases (personal information is encrypted and stored
at different locations)
Method 3: Privacy-preserving data mining methods
829
Privacy-Preserving Data Mining
◼
◼
Privacy-preserving (privacy-enhanced or privacy-sensitive) mining:
◼ Obtaining valid mining results without disclosing the underlying
sensitive data values
◼ Often needs trade-off between information loss and privacy
Privacy-preserving data mining methods:
◼ Randomization (e.g., perturbation): Add noise to the data in order
to mask some attribute values of records
◼ K-anonymity and l-diversity: Alter individual records so that they
cannot be uniquely identified
◼
◼
◼
◼
k-anonymity: Any given record maps onto at least k other records
l-diversity: enforcing intra-group diversity of sensitive values
Distributed privacy preservation: Data partitioned and distributed
either horizontally, vertically, or a combination of both
Downgrading the effectiveness of data mining: The output of data
mining may violate privacy
◼
Modify data or mining results, e.g., hiding some association rules or slightly
distorting some classification models
830
Chapter 13: Data Mining Trends and
Research Frontiers
◼
Mining Complex Types of Data
◼
Other Methodologies of Data Mining
◼
Data Mining Applications
◼
Data Mining and Society
◼
Data Mining Trends
◼
Summary
831
Trends of Data Mining
◼
Application exploration: Dealing with application-specific problems
◼
Scalable and interactive data mining methods
◼
Integration of data mining with Web search engines, database
systems, data warehouse systems and cloud computing systems
◼
Mining social and information networks
◼
Mining spatiotemporal, moving objects and cyber-physical systems
◼
Mining multimedia, text and web data
◼
Mining biological and biomedical data
◼
Data mining with software engineering and system engineering
◼
Visual and audio data mining
◼
Distributed data mining and real-time data stream mining
◼
Privacy protection and information security in data mining
832
Chapter 13: Data Mining Trends and
Research Frontiers
◼
Mining Complex Types of Data
◼
Other Methodologies of Data Mining
◼
Data Mining Applications
◼
Data Mining and Society
◼
Data Mining Trends
◼
Summary
833
Summary
◼
◼
We present a high-level overview of mining complex data types
Statistical data mining methods, such as regression, generalized linear
models, analysis of variance, etc., are popularly adopted
◼
Researchers also try to build theoretical foundations for data mining
◼
Visual/audio data mining has been popular and effective
◼
◼
◼
◼
Application-based mining integrates domain-specific knowledge with
data analysis techniques and provide mission-specific solutions
Ubiquitous data mining and invisible data mining are penetrating our
data lives
Privacy and data security are importance issues in data mining, and
privacy-preserving data mining has been developed recently
Our discussion on trends in data mining shows that data mining is a
promising, young field, with great, strategic importance
834
References and Further Reading
❖
The books lists a lot of references for further reading. Here we only list a few books
◼
E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011
◼
◼
◼
◼
◼
◼
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed., Wiley-Interscience, 2000
D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected
World. Cambridge University Press, 2010.
U. Fayyad, G. Grinstein, and A. Wierse (eds.), Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. 2011
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
◼
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
◼
B. Liu. Web Data Mining, Springer 2006.
◼
T. M. Mitchell. Machine Learning, McGraw Hill, 1997
◼
M. Newman. Networks: An Introduction. Oxford University Press, 2010.
◼
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
◼
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
835
836