2. Data Mining - Department of Computer Science and Engineering

advertisement
UNIT:1
1. What Motivated Data Mining? Why Is It Important?
Data mining has attracted a great deal of attention in the information industry and in society as a
whole in recent years, due to the wide availability of huge amounts of data and the imminent
need for turning such data into useful information and knowledge. The information and
knowledge gained can be used for applications ranging from market analysis, fraud detection,
and customer retention, to production control and science exploration. Data mining can be
viewed as a result of the natural evolution of information technology.
Evolution of Database Technology
1960s and earlier:
Data Collection and Database Creation
 Primitive file processing
1970s - early 1980s:
Data Base Management Systems
 Hieratical and network database systems
 Relational database Systems
 Query languages: SQL
 Transactions, concurrency control and recovery.
 On-line transaction processing (OLTP)
Mid -1980s - present:
 Advanced data models
Extended relational, object-relational
 Advanced application-oriented DBMS
spatial, scientific, engineering, temporal, multimedia, active, stream and
sensor, knowledge-based
Late 1980s-present
 Advanced Data Analysis
Data warehouse and OLAP
Data mining and knowledge discovery
Advanced data mining appliations
Data mining and socity
1990s-present:
 XML-based database systems
 Integration with information retrieval
 Data and information integreation
Present – future:
 New generation of integrated data and information system.
2. Data Mining:
Data mining refers to extracting or “mining” knowledge from large amounts of data. The term is
actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold
mining rather than rock or sand mining. Thus, data mining should have been more appropriately
named “knowledge mining from data,” which is unfortunately somewhat long. “Knowledge
mining,” a shorter term, may not reflect the emphasis on mining from large amounts of data.
Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery from Data, or KDD. Alternatively, others view data mining as simply an essential step
in the process of knowledge discovery. Knowledge discovery as a process is consists of an
iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present the mined knowledge to the user)
3. Architecture of a typical data mining system
The architecture of a typical data mining system may have the following major components.
Database, data warehouse, World Wide Web, or other information repository: This is one or a set
of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data
cleaning and data integration techniques may be performed on the data.
Database or data warehouse server: The database or data warehouse server is responsible for
fetching the relevant data, based on the user’s data mining request.
Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may
also be included. Other examples of domain knowledge are additional interestingness constraints
or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
Data mining engine: This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
Pattern evaluation module: This component typically employs interestingness measures and
interacts with the data mining modules so as to focus the search toward interesting patterns. It
may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the implementation
of the data mining method used. For efficient data mining, it is highly recommended to push the
evaluation of pattern interestingness as deep as possible into the mining process so as to confine
the search to only the interesting patterns.
User interface: This module communicates between users and the data mining system, allowing
the user to interact with the system by specifying a data mining query or task, providing
information to help focus the search, and performing exploratory data mining based on the
intermediate data mining results. In addition, this component allows the user to browse database
and data warehouse schemas or data structures, evaluate mined patterns, and visualize the
patterns in different forms.
From a data warehouse perspective, data mining can be viewed as an advanced stage of
on-line analytical processing (OLAP). However, data mining goes far beyond the narrow scope
of summarization-style analytical processing of data warehouse systems by incorporating more
advanced techniques for data analysis. Although there are many “data mining systems” on the
market, not all of them can perform true data mining. A data analysis system that does not handle
large amounts of data should be more appropriately categorized as a machine learning system, a
statistical data analysis tool, or an experimental system prototype. A system that can only
perform data or information retrieval, including finding aggregate values, or that performs
deductive query answering in large databases should be more appropriately categorized as a
database system, an information retrieval system, or a deductive database system. Data mining
involves an integration of techniques from multiple disciplines such as database and data
warehouse technology, statistics, machine learning, high-performance computing, pattern
recognition, neural networks, data visualization, information retrieval, image and signal
processing, and spatial or temporal data analysis.
The discovered knowledge can be applied to decision making, process control,
information management, and query processing. Therefore, data mining is considered one of the
most important frontiers in database and information systems and one of the most promising
interdisciplinary developments in the information technology.
4. Data Mining: On What Kind of Data?
Advanced DB and information repositories
 Object-oriented and object-relational databases
 Spatial databases




Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
5. Data Mining Functionalities:
Concept description: Characterization and discrimination
 Data can be associated with classes or concepts
 Ex. AllElectronics store classes of items for sale include computer and printers.
 Description of class or concept called class/concept description.
 Data characterization
 Data discrimination
Mining Frequent Patterns, Associations, and Correlations
Frequent patters- patterns occurs frequently
Item sets, subsequences and substructures
Frequent item set
Sequential patterns
Structured patterns
Association Analysis
 Multi-dimensional vs. single-dimensional association
 age(X, “20..29”) ^ income(X, “20..29K”) => buys(X, “PC”) [support = 2%,
confidence = 60%]
 contains(T,
“computer”)
=>
contains(x,
“software”)
[support=1%,
confidence=75%]
Classification and Prediction
 Finding models (functions) that describe and distinguish data classes or concepts
for predict the class whose label is unknown
 E.g., classify countries based on climate, or classify cars based on gas mileage
 Models: decision-tree, classification rules (if-then), neural network
 Prediction: Predict some unknown or missing numerical values
Cluster analysis
 Analyze class-labeled data objects, clustering analyze data objects without
consulting a known class label.
 Clustering based on the principle: maximizing the intra-class similarity and
minimizing the interclass similarity
Outlier analysis
 Outlier: a data object that does not comply with the general behavior of the model
of the data
 It can be considered as noise or exception but is quite useful in fraud detection,
rare events analysis
Trend and evolution analysis
 Trend and deviation: regression analysis
 Sequential pattern mining, periodicity analysis
 Similarity-based analysis
6. Classification of Data Mining Systems
Classification according to the kinds of databases mined: A data mining system can be
classified according to the kinds of databases mined. Database systems can be classified
according to different criteria (such as data models, or the types of data or applications involved),
each of which may require its own data mining technique. Data mining systems can therefore be
classified accordingly. For instance, if classifying according to data models, we may have a
relational, transactional, object-relational, or data warehouse mining system. If classifying
according to the special types of data handled, we may have a spatial, time-series, text, stream
data, multimedia data mining system, or a World Wide Web mining system.
Classification according to the kinds of knowledge mined: Data mining systems can be
categorized according to the kinds of knowledge they mine, that is, based on data mining
functionalities, such as characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive
data mining system usually provides multiple and/or integrated data mining functionalities.
Moreover, data mining systems can be distinguished based on the granularity or levels of
abstraction of the knowledge mined, including generalized knowledge (at a high level of
abstraction), primitive-level knowledge (at a raw data level), or knowledge at multiple levels
(considering several levels of abstraction). An advanced data mining system should facilitate the
discovery of knowledge at multiple levels of abstraction.
Data mining systems can also be categorized as those that mine data regularities
(commonly occurring patterns) versus those that mine data irregularities (such as exceptions, or
outliers). In general, concept description, association and correlation analysis, classification,
prediction, and clustering mine data regularities, rejecting outliers as noise. These methods may
also help detect outliers.
Classification according to the kinds of techniques utilized: Data mining systems can be
categorized according to the underlying data mining techniques employed. These techniques can
be described according to the degree of user interaction involved (e.g., autonomous systems,
interactive exploratory systems, query-driven systems) or the methods of data analysis employed
(e.g., database-oriented or data warehouse– oriented techniques, machine learning, statistics,
visualization, pattern recognition, neural networks, and so on). A sophisticated data mining
system will often adopt multiple data mining techniques or work out an effective, integrated
technique that combines the merits of a few individual approaches.
Classification according to the applications adapted: Data mining systems can also be
categorized according to the applications they adapt. For example, data mining systems may be
tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on.
Different applications often require the integration of application-specific methods. Therefore, a
generic, all-purpose data mining system may not fit domain-specific mining tasks.
7. Integration of a Data Mining System with a Database or Data ware house
System
A good system architecture will facilitate the data mining system to make best use of the
software environment, accomplish data mining tasks in an efficient and timely manner,
interoperate and exchange information with other information systems, be adaptable to users’
diverse requirements, and evolve with time. A critical question in the design of a data mining
(DM) system is how to integrate or couple the DM system with a database (DB) system and/or a
data warehouse (DW) system. If a DM system works as a stand-alone system or is embedded in
an application program, there are no DB or DW systems with which it has to communicate. This
simple scheme is called no coupling, where the main focus of the DM design rests on developing
effective and efficient algorithms for mining the available data sets. However, when a DM
system works in an environment that requires it to communicate with other information system
components, such as DB and DW systems, possible integration schemes include no coupling,
loose coupling, semi tight coupling, and tight coupling. We examine each of these schemes, as
follows:
No coupling: No coupling means that a DM system will not utilize any function of a DB or DW
system. It may fetch data from a particular source (such as a file system), process data using
some data mining algorithms, and then store the mining results in another file. Such a system,
though simple, suffers from several drawbacks. First, a DB system provides a great deal of
flexibility and efficiency at storing, organizing, accessing, and processing data.Without using a
DB/DWsystem, aDMsystemmay spend a substantial amount of time finding, collecting,
cleaning, and transforming data. In DB and/or DWsystems, data tend to be well organized,
indexed, cleaned, integrated, or consolidated, so that finding the task-relevant, high-quality data
becomes an easy task. Second, there are many tested, scalable algorithms and data structures
implemented in DB and DWsystems. It is feasible to realize efficient, scalable implementations
using such systems. Moreover, most data have been or will be stored in DB/DW systems.
Without any coupling of such systems, a DM system will need to use other tools to extract data,
making it difficult to integrate such a system into an information processing environment. Thus,
no coupling represents a poor design.
Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or
DW system, fetching data from a data repository managed by these systems, performing data
mining, and then storing the mining results either in a file or in a designated place in a database
or data warehouse. Loose coupling is better than no coupling because it can fetch any portion of
data stored in databases or data warehouses by using query processing, indexing, and other
system facilities. It incurs some advantages of the flexibility, efficiency, and other features
provided by such systems. However, many loosely coupled mining systems are main memorybased. Because mining does not explore data structures and query optimization methods
provided by DB or DW systems, it is difficult for loose coupling to achieve high scalability and
good performance with large data sets.
Semi tight coupling: Semi tight coupling means that besides linking a DM system to a DB/DW
system, efficient implementations of a few essential data mining primitives (identified by the
analysis of frequently encountered data mining functions) can be provided in the DB/DW
system. These primitives can include sorting, indexing, aggregation, histogram analysis, multi
way join, and pre computation of some essential statistical measures, such as sum, count, max,
min, standard deviation, and so on. Moreover, some frequently used intermediate mining results
can be pre computed and stored in the DB/DW system. Because these intermediate mining
results are either pre computed or can be computed efficiently, this design will enhance the
performance of a DM system.
Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW
system. The data mining subsystem is treated as one functional component of an information
system. Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a DB or DW system. With
further technology advances, DM, DB, and DW systems will evolve and integrate together as
one information system with multiple functionalities. This will provide a uniform information
processing environment. This approach is highly desirable because it facilitates efficient
implementations of data mining functions, high system performance, and an integrated
information processing environment.
With this analysis, it is easy to see that a data mining system should be coupled with a
DB/DW system. Loose coupling, though not efficient, is better than no coupling because it uses
both data and system facilities of a DB/DW system. Tight coupling is highly desirable, but its
implementation is nontrivial and more research is needed in this area. Semi tight coupling is a
compromise between loose and tight coupling. It is important to identify commonly used data
mining primitives and provide efficient implementations of such primitives in DB or DW
systems.
8. Major Issues in Data Mining
Major issues in data mining regarding mining methodology, user interaction, performance, and
diverse data types.
Mining methodology and user interaction issues: These reflect the kinds of knowledge mined,
the ability to mine knowledge at multiple granularities, the use of domain knowledge, ad hoc
mining, and knowledge visualization.
Mining different kinds of knowledge in databases: Because different users can be interested in
different kinds of knowledge, data mining should cover a wide spectrum of data analysis and
knowledge discovery tasks, including data characterization, discrimination, association and
correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis
(which includestrend and similarity analysis). These tasks may use the same database in different
ways and require the development of numerous data mining techniques.
Interactive mining of knowledge at multiple levels of abstraction: Because it is difficult to
know exactly what can be discovered within a database, the data mining process should be
interactive. For databases containing a huge amount of data, appropriate sampling techniques can
first be applied to facilitate interactive data exploration. Interactive mining allows users to focus
the search for patterns, providing and refining data mining requests based on returned results.
Specifically, knowledge should be mined by drilling down, rolling up, and pivoting through the
data space and knowledge space interactively, similar to what OLAP can do on data cubes. In
this way, the user can interact with the data mining system to view data and discovered patterns
at multiple granularities and from different angles.
Incorporation of background knowledge: Background knowledge, or information regarding the
domain under study, may be used to guide the discovery process and allow discovered patterns to
be expressed in concise terms and at different levels of abstraction. Domain knowledge related to
databases, such as integrity constraints and deduction rules, can help focus and speed up a data
mining process, or judge the interestingness of discovered patterns.
Data mining query languages and ad hoc data mining: Relational query languages (such as
SQL) allow users to pose ad hoc queries for data retrieval. In a similar vein, high-level data
mining query languages need to be developed to allow users to describe ad hoc data mining tasks
by facilitating the specification of the relevant sets of data for analysis, the domain knowledge,
the kinds of knowledge to be mined, and the conditions and constraints to be enforced on the
discovered patterns. Such a language should be integrated with a database or data warehouse
query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results: Discovered knowledge should be
expressed in high-level languages, visual representations, or other expressive forms so that the
knowledge can be easily understood and directly usable by humans. This is especially crucial if
the data mining system is to be interactive. This requires the system to adopt expressive
knowledge representation techniques, such as trees, tables, rules, graphs, charts, crosstabs,
matrices, or curves.
Handling noisy or incomplete data: The data stored in a database may reflect noise, exceptional
cases, or incomplete data objects. When mining data regularities, these objects may confuse the
process, causing the knowledge model constructed to overfit the data. As a result, the accuracy of
the discovered patterns can be poor. Data cleaning methods and data analysis methods that can
handle noise are required, as well as outlier mining methods for the discovery and analysis of
exceptional cases.
Pattern evaluation—the interestingness problem: A data mining system can uncover thousands
of patterns. Many of the patterns discovered may be uninteresting to the given user, either
because they represent common knowledge or lack novelty. Several challenges remain regarding
the development of techniques to assess the interestingness of discovered patterns, particularly
with regard to subjective measures that estimate the value of patterns with respect to a given user
class, based on user beliefs or expectations. The use of interestingness measures or user-specified
constraints to guide the discovery process and reduce the search space is another active area of
research. Performance issues: These include efficiency, scalability, and parallelization of data
mining algorithms.
Efficiency and scalability of data mining algorithms: To effectively extract information from a
huge amount of data in databases, data mining algorithms must be efficient and scalable. In other
words, the running time of a data mining algorithm must be predictable and acceptable in large
databases. Froma database perspective on knowledge discovery, efficiency and scalability are
key issues in the implementation of data mining systems. Many of the issues discussed above
under mining methodology and user interaction must also consider efficiency and scalability.
Parallel, distributed, and incremental mining algorithms: The huge size of many databases, the
wide distribution of data, and the computational complexity of some data mining methods are
factors motivating the development of parallel and distributed data mining algorithms. Such
algorithms divide the data into partitions, which are processed in parallel. The results from the
partitions are then merged. Moreover, the high cost of some data mining processes promotes the
need for incremental data mining algorithms that incorporate database updates without having to
mine the entire data again “from scratch.” Such algorithms perform knowledge modification
incrementally to amend and strengthen what was previously discovered.
Issues relating to the diversity of database types:
Handling of relational and complex types of data: Because relational databases and data
warehouses are widely used, the development of efficient and effective data mining systems for
such data is important. However, other databases may contain complex data objects, hypertext
and multimedia data, spatial data, temporal data, or transaction data. It is unrealistic to expect
one system to mine all kinds of data, given the diversity of data types and different goals of data
mining. Specific data mining systems should be constructed for mining specific kinds of data.
Therefore, one may expect to have different data mining systems for different kinds of data.
Mining information from heterogeneous databases and global information systems : Localand wide-area computer networks (such as the Internet) connect many sources of data, forming
huge, distributed, and heterogeneous databases. The discovery of knowledge from different
sources of structured, semi structured, or unstructured data with diverse data semantics poses
great challenges to data mining. Data mining may help disclose high-level data regularities in
multiple heterogeneous databases that are unlikely to be discovered by simple query systems and
may improve information exchange and interoperability in heterogeneous databases. Web
mining, which uncovers interesting knowledge about Web contents, Web structures, Web usage,
and Web dynamics, becomes a very challenging and fast-evolving field in data mining.
9. Data Pre Processing:
Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due
to their typically huge size (often several gigabytes or more) and their likely origin from
multiple, heterogenous sources. Low-quality data will lead to low-quality mining results.“How
can the data be preprocessed in order to help improve the quality of the data and, consequently,
of the mining results? How can the data be preprocessed so as to improve the efficiency and ease
of the mining process?”
There are a number of data preprocessing techniques.
Data cleaning can be applied to remove noise and correct inconsistencies in the data.
Data integration merges data from multiple sources into a coherent data store, such as a data
warehouse.
Data transformations, such as normalization, may be applied. For example, normalization may
improve the accuracy and efficiency of mining algorithms involving distance measurements.
Data reduction can reduce the data size by aggregating, eliminating redundant features, or
clustering, for instance. These techniques are not mutually exclusive; they may work together.
For example, data cleaning can involve transformations to correct wrong data, such as by
transforming all entries for a date field to a common format. Data processing techniques, when
applied before mining, can substantially improve the overall quality of the patterns mined and/or
the time required for the actual mining.
Data Discretization: Part of data reduction but with particular importance, especially for
numerical data.
Data Cleaning:
Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
Missing Data
Data is not always available
 E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing
Fill in the missing value manually
Use a global constant to fill in the missing value: ex. “unknown”
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in the missing
value
Use the most probable value to fill in the missing value: inference-based such as
Bayesian formula or decision tree
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy Data?
Binning method:
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries
Clustering
 detect and remove outliers
Regression
 smooth by fitting the data to a regression functions – linear regression
Simple Discretization Methods: Binning
Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing approximately same number
of samples
 Good data scaling
 Managing categorical attributes can be tricky.
Data Integration and Transformation
Data integration:
 combines data from multiple sources into a coherent store
Schema integration
 integrate metadata from different sources
 Entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different sources are different
 possible reasons: different representations, different scales, e.g., metric vs. British
units
Handling Redundant Data in Data Integration
Redundant data occur often when integration of multiple databases
 The same attribute may have different names in different databases
 One attribute may be a “derived” attribute in another table, e.g., annual revenue
Redundant data may be able to be detected by correlation analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed and quality
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Reduction
Warehouse may store terabytes of data: Complex data analysis/mining may take a very
long time to run on the complete data set
Data reduction
 Obtains a reduced representation of the data set that is much smaller in volume
but yet produces the same (or almost the same) analytical results
Data reduction strategies
 Data cube aggregation
 Attribute subset selection
 Dimensionality reduction
 Numerosity reduction
 Discretization and concept hierarchy generation
Dimensionality Reduction
Feature selection (attribute subset selection):
 Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
 reduce # of patterns in the patterns, easier to understand
Heuristic methods
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward elimination
 decision-tree induction
Wavelet Transforms
Discrete wavelet transform (DWT): linear signal processing
Compressed approximation: store only a small fraction of the strongest of the wavelet
coefficients
Similar to discrete Fourier transform (DFT), but better lossy compression, localized in
space
Method:
 Length, L, must be an integer power of 2 (padding with 0s, when necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length
Attribute subset selection
Attribute subset selection reduces the data set size by removing irrelevent or redundant
attributes.
Goal is find min set of attributes
Uses basic heuristic methods of attribute selection
Numerosity Reduction
Parametric methods
 Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
 Log-linear models: obtain value at a point in m-D space as the product on
appropriate marginal subspaces
Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
Regress Analysis and Log-Linear Models
Linear regression: Y =  +  X
 Two parameters ,  and  specify the line and are to be estimated by using the
data at hand.
 using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed into the above.
Log-linear models:
 The multi-way table of joint probabilities is approximated by a product of lowerorder tables.
 Probability: p(a, b, c, d) = ab acad bcd
Clustering
Partition data set into clusters, and one can store cluster representation only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional index tree structures
There are many choices of clustering definitions and clustering algorithms.
Sampling
Allows a large data set to be represented by a much smaller of the data.
Let a large data set D, contains N tuples.
Methods to reduce data set D:
 Simple random sample without replacement (SRSWOR)
 Simple random sample with replacement (SRSWR)
 Cluster sample
 Stright sample
Discretization and concept hierarchy generation
Discretization
Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers
Discretization: divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis
Discretization and Concept hierarchy
Discretization
 reduce the number of values for a given continuous attribute by dividing the range
of the attribute into intervals. Interval labels can then be used to replace actual
data values.
Concept hierarchies
 reduce the data by collecting and replacing low level concepts (such as numeric
values for the attribute age) by higher level concepts (such as young, middle-aged,
or senior).
Discretization and concept hierarchy generation for numeric data
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Discretization by intuitive partitioning
UNIT:2
What is Data Warehouse?
Defined in many different ways
 A decision support database that is maintained separately from the organization’s
operational database
 Support information processing by providing a solid platform of consolidated,
historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process.”—W. H. Inmon
Data warehousing:
 The process of constructing and using data warehouses
Data Warehouse—Subject-Oriented
Organized around major subjects, such as customer, product, sales.
Focusing on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing.
Provide a simple and concise view around particular subject issues by excluding data that
are not useful in the decision support process.
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data sources
 relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are applied.
 Ensure consistency in naming conventions, encoding structures, attribute
measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly longer than that of operational
systems.
 Operational database: current value data.
 Data warehouse data: provide information from a historical perspective (e.g., past
5-10 years)
Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not contain “time element”.
Data Warehouse—Non-Volatile
A physically separate store of data transformed from the operational environment.
Operational update of data does not occur in the data warehouse environment.
 Does not require transaction processing, recovery, and concurrency control
mechanisms
 Requires only two operations in data accessing:
initial loading of data and access of data.
Data Warehouse vs. Operational DBMS
Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,
registration, accounting, etc.
OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
A multi-dimensional data model
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a set of dimension tables
 Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables, viewed as a
collection of stars, therefore called galaxy schema or fact constellation
Data warehouse architecture
Steps for the Design and Construction of Data Warehouse
The design of a data warehouse: a business analysis framework
The process of data warehouse design
A three-tier data ware house architecture
Four views regarding the design of a data warehouse
Top-down view
allows selection of the relevant information necessary for the data
warehouse
Data warehouse view
consists of fact tables and dimension tables
Data source view
exposes the information being captured, stored, and managed by
operational systems
Business query view
Types of OLAP Servers
Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and manage warehouse data
and OLAP middle ware to support missing pieces
 Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
 greater scalability
Multidimensional OLAP (MOLAP)
 Array-based multidimensional storage engine (sparse matrix techniques)
 fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers
 specialized support for SQL queries over star/snowflake schemas
From data warehousing to data mining
Three kinds of data warehouse applications
 Information processing
supports querying, basic statistical analysis, and reporting using crosstabs,
tables, charts and graphs
 Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
 Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.
Differences among the three tasks
From On-Line Analytical Processing to On Line Analytical Mining (OLAM)
Why online analytical mining?
 High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
 Available information processing structure surrounding data warehouses
ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP
tools
 OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions
integration and swapping of multiple mining functions, algorithms, and
tasks.
Architecture of OLAM
UNIT:3
Data Mining Task Primitives
A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the data mining system during
discovery in order to direct the mining process, or examine the findings from different angles or
depths. The data mining primitives specify the following, as illustrated
The set of task-relevant data to be mined: This specifies the portions of the database or the set
of data in which the user is interested. This includes the database attributes or data warehouse
dimensions of interest (referred to as the relevant attributes or dimensions).
The kind of knowledge to be mined: This specifies the data mining functions to be performed,
such as characterization, discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about the
domain to be mined is useful for guiding the knowledge discovery process and for evaluating the
patterns found. Concept hierarchies are a popular form of background knowledge, which allow
data to be mined at multiple levels of abstraction.
The interestingness measures and thresholds for pattern evaluation: They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds
of knowledge may have different interestingness measures. For example, interestingness
measures for association rules include support and confidence. Rules whose support and
confidence values are below user-specified thresholds are considered uninteresting.
The expected representation for visualizing the discovered patterns: This refers to the form in
which discovered patterns are to be displayed, which may include rules, tables, charts, graphs,
decision trees, and cubes.
A data mining query language can be designed to incorporate these primitives, allowing
users to flexibly interact with data mining systems. Having a data mining query language
provides a foundation on which user-friendly graphical interfaces can be built. This facilitates a
data mining system’s communication with other information systems and its integration with the
overall information processing environment. Designing a comprehensive data mining language is
challenging because data mining covers a wide spectrum of tasks, from data characterization to
evolution analysis. Each task has different requirements. The design of an effective data mining
query language requires a deep understanding of the power, limitation, and underlying
mechanisms of the various kinds of data mining tasks.
A Data Mining Query Language (DMQL)
Motivation
 A DMQL can provide the ability to support ad-hoc and interactive data mining
 By providing a standardized language like SQL
to achieve a similar effect like that SQL has on relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer, commercialization
and wide acceptance
Design
 DMQL is designed with the primitives
Syntax for DMQL
Syntax for specification of
 task-relevant data
 the kind of knowledge to be mined
 concept hierarchy specification
 interestingness measure
 pattern presentation and visualization
— a DMQL query
Syntax for task-relevant data specification
use database database_name, or use data warehouse data_warehouse_name
from relation(s)/cube(s) [where condition]
in relevance to att_or_dim_list
order by order_list
group by grouping_list
having condition
Syntax for specifying the kind of knowledge to be mined
Characterization
Mine_Knowledge_Specification ::= mine characteristics [as pattern_name]
analyze measure(s)
Discrimination
Mine_Knowledge_Specification ::= mine comparison [as pattern_name]
for target_class where target_condition
{versus contrast_class_i where contrast_condition_i}
analyze measure(s)
Association
Mine_Knowledge_Specification ::=
mine associations [as pattern_name]
Classification
Mine_Knowledge_Specification ::=
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
Prediction
Mine_Knowledge_Specification ::=
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
Syntax for concept hierarchy specification
To specify what concept hierarchies to use
use hierarchy <hierarchy> for <attribute_or_dimension>
use different syntax to define different type of hierarchies
 schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
 set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
 operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age)
 rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) <= $250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250
Syntax for interestingness measure specification
Interestingness measures and thresholds can be specified by the user with the statement:
with <interest_measure_name> threshold = threshold_value
Example:
with support threshold = 0.05
with confidence threshold = 0.7
Syntax for pattern presentation and visualization specification
syntax which allows users to specify the display of discovered patterns in one or more
forms
display as <result_form>
To facilitate interactive viewing at different concept level, the following syntax is
defined:
Multilevel_Manipulation ::= roll up on attribute_or_dimension
| drill down on attribute_or_dimension
| add attribute_or_dimension
| drop
attribute_or_dimension
Design graphical user interfaces based on a data mining query language
What tasks should be considered in the design GUIs based on a data mining query
language?
 Data collection and data mining query composition
 Presentation of discovered patterns
 Hierarchy specification and manipulation
 Manipulation of data mining primitives
 Interactive multilevel mining
 Other miscellaneous information
UNIT: 4
What is Concept Description?
Descriptive vs. predictive data mining
 Descriptive mining: describes concepts or task-relevant data sets in concise,
summarative, informative, discriminative forms
 Predictive mining: Based on data and analysis, constructs models for the
database, and predicts the trend and properties of unknown data
Concept description:
 Characterization: provides a concise and succinct summarization of the given
collection of data
 Comparison: provides descriptions comparing two or more collections of data
Concept Description vs. OLAP
Concept description:
 can handle complex data types of the attributes and their aggregations
 a more automated process
OLAP:
 restricted to a small number of dimension and measure types
 user-controlled process
Data Generalization and Summarization-based Characterization
Data generalization
 A process which abstracts a large set of task-relevant data in a database from a
low conceptual levels to higher ones.
 Approaches:
Data cube approach(OLAP approach)
Attribute-oriented induction approach
Characterization: Data Cube Approach
Perform computations and store results in data cubes
Strength
 An efficient implementation of data generalization
 Computation of various kinds of measures
count( ), sum( ), average( ), max( )
 Generalization and specialization can be performed on a data cube by roll-up and
drill-down
Limitations
 handle only dimensions of simple nonnumeric data and measures of simple
aggregated numeric values.
 Lack of intelligent analysis, can’t tell which dimensions should be used and what
levels should the generalization reach
Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
 Collect the task-relevant data( initial relation) using a relational database query
 Perform generalization by attribute removal or attribute generalization.
 Apply aggregation by merging identical, generalized tuples and accumulating
their respective counts.
 Interactive presentation with users.
Basic Principles of Attribute-Oriented Induction
Data focusing
 task-relevant data, including dimensions, and the result is the initial relation.
Attribute-removal
 remove attribute A if there is a large set of distinct values for A but
 (1) there is no generalization operator on A, or
 (2) A’s higher level concepts are expressed in terms of other attributes.
Attribute-generalization
 If there is a large set of distinct values for A, and there exists a set of
generalization operators on A, then select an operator and generalize A.
Attribute-threshold control
Generalized relation threshold control
control the final relation/rule size.
Basic Algorithm for Attribute-Oriented Induction
Initial Relation
 Query processing of task-relevant data, deriving the initial relation.
Pre Generalization
Based on the analysis of the number of distinct values in each attribute, determine
generalization plan for each attribute: removal? or how high to generalize?
Prime Generalization
 Based on the PreGen plan, perform generalization to the right level to derive a
“prime generalized relation”, accumulating the counts.
Presentation
 User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules,
cross tabs, visualization presentations.
Implementation by Cube Technology
Construct a data cube on-the-fly for the given data mining query
 Facilitate efficient drill-down analysis
 May increase the response time
 A balanced solution: precomputation of “subprime” relation
Use a predefined & precomputed data cube
 Construct a data cube beforehand
 Facilitate not only the attribute-oriented induction, but also attribute relevance
analysis, dicing, slicing, roll-up and drill-down
 Cost of cube computation and the nontrivial storage overhead
Analytical characterization: Analysis of attribute relevance
Characterization vs. OLAP
Similarity:
 Presentation of data summarization at multiple levels of abstraction.
 Interactive drilling, pivoting, slicing and dicing.
Differences:
 Automated desired level allocation.
 Dimension relevance analysis and ranking when there are many relevant
dimensions.
 Sophisticated typing on dimensions and measures.
 Analytical characterization: data dispersion analysis.
Attribute Relevance Analysis
Why?




What?

Which dimensions should be included?
How high level of generalization?
Automatic vs. interactive
Reduce # attributes; easy to understand patterns
statistical method for preprocessing data
filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
 relevance related to dimensions and levels
 analytical characterization, analytical comparison
Attribute relevance analysis
Data Collection
Analytical Generalization
Use information gain analysis to identify highly relevant dimensions and levels.
Relevance Analysis
Sort and select the most relevant dimensions and levels.
Attribute-oriented Induction for class description
 On selected dimension/level
 OLAP operations (drilling, slicing) on relevance rules
Relevance Measures
Quantitative relevance measure determines the classifying power of an attribute within a
set of data.
Methods
 information gain (ID3)
 gain ratio (C4.5)
 gini index
 2 contingency table statistics
 uncertainty coefficient
Information-Theoretic Approach
Decision tree
 each internal node tests an attribute
 each branch corresponds to attribute value
 each leaf node assigns a classification
ID3 algorithm
 build decision tree based on training objects with known class labels to classify
testing objects
 rank attributes with information gain measure
 minimal height
the least number of tests to classify an object
Top-Down Induction of Decision Tree
Mining Class Comparisons
Comparison
 Comparing two or more classes.
Method
 Partition the set of relevant data into the target class and the contrasting classes
 Generalize both classes to the same high level concepts
 Compare tuples with the same high level descriptions
 Present for every tuple its description and two measures:
support - distribution within single class
comparison - distribution between classes
 Highlight the tuples with strong discriminant features
Relevance Analysis
 Find attributes (features) which best distinguish different classes.
Mining Data Dispersion Characteristics
Motivation
 To better understand the data: central tendency, variation and spread
Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions -correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of precision
 Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
Boxplot Analysis
Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum
Boxplot
 Data is represented with a box
 The ends of the box are at the first and third quartiles, i.e., the height of the box is
IRQ
 The median is marked by a line within the box
 Whiskers: two lines outside the box extend to Minimum and Maximum
Mining Descriptive Statistical Measures in Large Databases
Variance
s2 
1 n
1 
1
2
( xi  x ) 2 
xi2   xi  



n  1 i 1
n 1 
n

Standard deviation: the square root of the variance
 Measures spread about the mean
 It is zero if and only if all the values are equal
 Both the deviation and the variance are algebraic
Graphic Displays of Basic Statistical Descriptions
Histogram
Boxplot
Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of
data are  xi
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against
the corresponding quantiles of another
Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
Loess (local regression) curve: add a smooth curve to a scatter plot to provide better
perception of the pattern of dependence
Incremental and Parallel Mining of Concept Description
Incremental mining: revision based on newly added data DB
 Generalize DB to the same level of abstraction in the generalized relation R to
derive R
 Union R U R, i.e., merge counts and other statistical information to produce a
new relation R’
Similar philosophy can be applied to data sampling, parallel and/or distributed mining,
etc.
UNIT: 5
Association rule mining
•
•
•
Association rule mining
– Finding frequent patterns, associations, correlations, or causal structures among
sets of items or objects in transaction databases, relational databases, and other
information repositories.
Applications
– Basket data analysis, cross-marketing, catalog design, loss-leader analysis,
clustering, classification, etc.
Rule form
prediction (Boolean variables) =>
prediction (Boolean variables) [support,
confidence]
– Computer => antivirus_software [support =2%, confidence = 60%]
– buys (x, “computer”) ® buys (x, “antivirus_software”) [0.5%, 60%]
Basic Concepts
•
•
•
•
Given a database of transactions each transaction is a list of items (purchased by a
customer in a visit)
Find all rules that correlate the presence of one set of items with that of another set of
items
Find frequent patterns
Example for frequent itemset mining is market basket analysis.
Association rule performance measures
• Confidence
• Support
• Minimum support threshold
• Minimum confidence threshold
Martket Basket Analysis
•
•
•
•
•
Shopping baskets
Each item has a Boolean variable representing the presence or absence of that item.
Each basket can be represented by a Boolean vector of values assigned to these variables.
Identify patterns from Boolean vector
Patterns can be represented by association rules.
A Road Map
• Boolean vs. quantitative associations
- Based on the types of values handled
•
•
– buys(x, “SQLServer”) ^ buys(x, “DMBook”) => buys(x, “DBMiner”) [0.2%,
60%]
– age(x, “30..39”) ^ income(x, “42..48K”) => buys(x, “PC”) [1%, 75%]
Single dimension vs. multiple dimensional associations
Single level vs. multiple-level analysis
Mining single-dimensional Boolean association rules from transactional
databases
Apriori Algorithm
•
•
•
Single dimensional, single-level, Boolean frequent item sets
Finding frequent item sets using candidate generation
Generating association rules from frequent item sets
Mining Frequent Itemsets: the Key Step
•
Find the frequent itemsets: the sets of items that have minimum support
– A subset of a frequent itemset must also be a frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset
•
– Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.
•
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
that are contained in t
How to Generate Candidates
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
How to Count Supports of Candidates
•
•
Why counting supports of candidates a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates
Method
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a transaction
Methods to Improve Apriori’s Efficiency
•
•
•
Hash-based itemset counting
– A k-itemset whose corresponding hashing bucket count is below the threshold
cannot be frequent
Transaction reduction
– A transaction that does not contain any frequent k-itemset is useless in subsequent
scans
Partitioning
– Any itemset that is potentially frequent in DB must be frequent in at least one of
the partitions of DB
Methods to Improve Apriori’s Efficiency
•
•
Sampling
– mining on a subset of given data, lower support threshold + a method to
determine the completeness
Dynamic itemset counting
– add new candidate itemsets only when all of their subsets are estimated to be
frequent
Mining Frequent Patterns Without Candidate Generation
•
•
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology: decompose mining tasks into smaller ones
– Avoid candidate generation: sub-database test only
Mining multilevel association rules from transactional databases
Mining various kinds of association rules
Mining Multilevel association rules
 Concepts at different levels
Mining Multidimensional association rules
 More than one dimensional
Mining Quantitative association rules
 Numeric attributes
Multi-level Association
•
Uniform Support- the same minimum support for all levels
– + One minimum support threshold. No need to examine itemsets containing any
item whose ancestors do not have minimum support.
– – Lower level items do not occur as frequently. If support threshold
• too high  miss low level associations
• too low  generate too many high level associations
•
Reduced Support- reduced minimum support at lower levels
– There are 4 search strategies:
• Level-by-level independent
• Level-cross filtering by k-itemset
• Level-cross filtering by single item
• Controlled level-cross filtering by single item
Multi-level Association: Redundancy Filtering
•
•
•
•
Some rules may be redundant due to “ancestor” relationships between items.
Example
– milk  wheat bread [support = 8%, confidence = 70%]
– 2% milk  wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule.
A rule is redundant if its support is close to the “expected” value, based on the rule’s
ancestor.
Mining multidimensional association rules from transactional databases and
data warehouse
Multi-Dimensional Association
• Single-dimensional rules
buys(X, “milk”)  buys(X, “bread”)
• Multi-dimensional rules
– Inter-dimension association rules -no repeated predicates
age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”)
– hybrid-dimension association rules -repeated predicates
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
•
•
Categorical Attributes
– finite number of possible values, no ordering among values
Quantitative Attributes
– numeric, implicit ordering among values
Techniques for Mining MD Associations
•
Search for frequent k-predicate set:
– Example: {age, occupation, buys} is a 3-predicate set.
– Techniques can be categorized by how age are treated.
1. Using static discretization of quantitative attributes
– Quantitative attributes are statically discretized by using predefined concept
hierarchies.
2. Quantitative association rules
– Quantitative attributes are dynamically discretized into “bins”based on the
distribution of the data.
3. Distance-based association rules
– This is a dynamic discretization process that considers the distance between data
points.
From association mining to correlation analysis
Interestingness Measurements
•
Objective measures
– Two popular measurements
support
confidence
• Subjective measures
A rule (pattern) is interesting if
*it is unexpected (surprising to the user); and/or
*actionable (the user can do something with it)
Constraint-based association mining
•
•
Interactive, exploratory mining
kinds of constraints
– Knowledge type constraint- classification, association, etc.
–
–
–
–
Data constraint: SQL-like queries
Dimension/level constraints
Rule constraint
Interestingness constraints
Rule Constraints in Association Mining
•
•
Two kind of rule constraints:
– Rule form constraints: meta-rule guided mining.
• P(x, y) ^ Q(x, w) ® takes(x, “database systems”).
– Rule (content) constraint: constraint-based query optimization (Ng, et al.,
SIGMOD’98).
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
1-variable vs. 2-variable constraints
– 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown
above.
– 2-var: A constraint confining both sides (L and R).
• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
Constrain-Based Association Query
•
Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
•
A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
– where C is a set of constraints on S1, S2 including frequency constraint
•
A classification of (single-variable) constraints:
– Class constraint: S  A. e.g. S  Item
– Domain constraint:
• S v,   { , , , , ,  }. e.g. S.Price < 100
• v S,  is  or . e.g. snacks  S.Type
• V S, or S V,   { , , , ,  }
– e.g. {snacks, sodas }  S.Type
– Aggregation constraint: agg(S)  v, where agg is in {min, max, sum, count, avg},
and   { , , , , ,  }.
• e.g. count(S1.Type)  1 , avg(S2.Price)  100
Anti-monotone and Monotone Constraints
•
•
A constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the
super-patterns of S can satisfy Ca
A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S
also satisfies it
•
Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection
predicate p, where  is a selection operator
• SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I, s.t.
SP can be expressed in terms of the strict power sets of I1, …, Ik using union and minus
• A constraint Cs is succinct provided SATCs(I) is a succinct power set
Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint
implies that each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies
that each pattern of which S is a suffix w.r.t. R also satisfies C
UNIT: 6
Classification and Prediction
Databases are rich with hidden information that can be used for intelligent decision
making. Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Such analysis can help
provide us with a better understanding of the data at large.
Whereas classification predicts categorical (discrete, unordered) labels, prediction
models continuous valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment given their income and
occupation. Many classification and prediction methods have been proposed by researchers in
machine learning, pattern recognition, and statistics. Most algorithms are memory resident,
typically assuming a small data size. Recent data mining research has built on such work,
developing scalable classification and prediction techniques capable of handling large diskresident data.
Preparing the Data for Classification and Prediction
The following preprocessing steps may be applied to the data to help improve the
accuracy, efficiency, and scalability of the classification or prediction process.
Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by
applying smoothing techniques, for example) and the treatment of missing values (e.g., by
replacing a missing value with the most commonly occurring value for that attribute, or with the
most probable value based on statistics). Although most classification algorithms have some
mechanisms for handling noisy or missing data, this step can help reduce confusion during
learning.
Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis
can be used to identify whether any two given attributes are statistically related. For example, a
strong correlation between attributes A1 and A2 would suggest that one of the two could be
removed from further analysis. A database may also contain irrelevant attributes. Attribute
subset selection4 can be used in these cases to find a reduced set of attributes such that the
resulting probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes. Hence, relevance analysis, in the form of correlation
analysis and attribute subset selection, can be used to detect attributes that do not contribute to
the classification or prediction task. Including such attributes may otherwise slow down, and
possibly mislead, the learning step.
Data transformation and reduction: The data may be transformed by normalization,
particularly when neural networks or methods involving distance measurements are used in the
learning step. Normalization involves scaling all values for a given attribute so that they fall
within a small specified range, such as -1.0 to 1.0, or 0.0 to 1.0. In methods that use distance
measurements, for example, this would prevent attributes with initially large ranges (like, say,
income) from outweighing attributes with initially smaller ranges (such as binary attributes). The
data can also be transformed by generalizing it to higher-level concepts. Concept hierarchies
may be used for this purpose. This is particularly useful for continuous valued attributes. For
example, numeric values for the attribute income can be generalized to discrete ranges, such as
low, medium, and high. Similarly, categorical attributes, like street, can be generalized to higherlevel concepts, like city. Because generalization compresses the original training data, fewer
input/output operations may be involved during learning. Data can also be reduced by applying
many other methods, ranging from wavelet transformation and principle components analysis to
discretization techniques, such as binning, histogram analysis, and clustering.
Comparing Classification and Prediction Methods
Classification and prediction methods can be compared and evaluated according to the following
criteria:
Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly
predict the class label of new or previously unseen data (i.e., tuples without class label
information). Similarly, the accuracy of a predictor refers to how well a given predictor can
guess the value of the predicted attribute for new or previously unseen data. Accuracy can be
estimated using one or more test sets that are independent of the training set. Estimation
techniques, such as cross-validation and bootstrapping.
Speed: This refers to the computational costs involved in generating and using the given
classifier or predictor.
Robustness: This is the ability of the classifier or predictor to make correct predictions given
noisy data or data with missing values.
Scalability: This refers to the ability to construct the classifier or predictor efficiently given large
amounts of data.
Interpretability: This refers to the level of understanding and insight that is provided by the
classifier or predictor. Interpretability is subjective and therefore more difficult to assess.
Supervised vs. Unsupervised Learning
Supervised learning (classification)
 Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
 New data is classified based on the training set
Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
Classification by Decision Tree Induction
Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
 Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
 Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision tree
Attribute Selection Measure
•
•
Information gain (ID3/C4.5)
– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each attribute
– May need other tools, such as clustering, to get the possible split values
– Can be modified for categorical attributes
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of class N
– The amount of information, needed to decide if an arbitrary example in S belongs
to P or N is defined as
p
p
n
n
I ( p, n)  
log 2

log 2
pn
pn pn
pn
Information Gain in Decision Tree Induction
•
Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N, the entropy, or the expected
information needed to classify objects in all subtrees Si is
pi  ni
I ( pi , ni )
i 1 p  n
– The encoding information that would be gained by branching on A
Gain( A)  I ( p, n)  E ( A)

E ( A)  
Enhancements to basic decision tree induction
•
•
•
Allow for continuous-valued attributes
– Dynamically define new discrete-valued attributes that partition the continuous
attribute value into a discrete set of intervals
Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
Attribute construction
– Create new attributes based on existing ones that are sparsely represented
– This reduces fragmentation, repetition, and replication
Bayesian Classification
•
•
•
•
•
Statical classifiers
Based on Baye’s theorem
Naïve Bayesian classification
Class conditional independence
Bayesian belief netwoks
Baye’s Theorem
•
•
Let X be a data tuple and H be hypothesis, such that X belongs to a specific class C.
Posterior probability of a hypothesis h on X, P(h|X) follows the Baye’s theorem
P( H | X ) 
P( X | H ) P( H )
P( X )
Naïve Bayesian Classification
•
•
•
•
Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
If attribute is categorical:
P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in
class C
If attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
Computationally easy in both cases
Classification by Back propagation
•
•
•
Neural network learning algorithm
Psychologists and Neurobiologists
Neural network – set of connected input/output units in which each connection has a
weight associated with it.
Neural Networks
•
•
Advantages
– prediction accuracy is generally high
– robust, works when training examples contain errors
– output may be discrete, real-valued, or a vector of several discrete or real-valued
attributes
– fast evaluation of the learned target function
limitations
– long training time
– difficult to understand the learned function (weights)
– not easy to incorporate domain knowledge
Network Pruning and Rule Extraction
•
•
Network pruning
– Fully connected network will be hard to articulate
– N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights
– Pruning: Remove some of the links without affecting classification accuracy of
the network
Extracting rules from a trained network
– Discretize activation values; replace individual activation value by the cluster
average maintaining the network accuracy
– Enumerate the output from the discretized activation values to find rules between
activation value and output
– Find the relationship between the input and activation value
– Combine the above two to have rules relating the output to input
Other Classification Methods
•
•
•
•
•
k-nearest neighbor classifier
case-based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
The k-Nearest Neighbor Algorithm
•
•
•
•
•
All instances correspond to points in the n-D space.
The nearest neighbor are defined in terms of Euclidean distance.
The target function could be discrete- or real- valued.
For discrete-valued, the k-NN returns the most common value among the k training
examples nearest to xq.
Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training
examples.
Case-Based Reasoning
•
•
•
•
•
Also uses: lazy evaluation + analyze similar instances
Difference: Instances are not “points in a Euclidean space”
Example: Water faucet problem in CADET (Sycara et al’92)
Methodology
– Instances represented by rich symbolic descriptions ( function graphs)
– Multiple retrieved cases may be combined
– Tight coupling between case retrieval, knowledge-based reasoning, and problem
solving
Research issues
– Indexing based on syntactic similarity measure, and when failure, backtracking,
and adapting to additional cases
Genetic Algorithms
•
•
•
GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
An initial population is created consisting of randomly generated rules
– e.g., IF A1 and Not A2 then C2 can be encoded as 100
• Based on the notion of survival of the fittest, a new population is formed to consists of
the fittest rules and their offsprings
• The fitness of a rule is represented by its classification accuracy on a set of training
examples
Offsprings are generated by crossover and mutation
Rough Set Approach
•
•
•
Rough sets are used to approximately or “roughly” define equivalent classes
A rough set for a given class C is approximated by two sets: a lower approximation
(certain to be in C) and an upper approximation (cannot be described as not belonging to
C)
Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but
a discernibility matrix is used to reduce the computation intensity
Fuzzy Set Approaches
•
•
•
•
•
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership
(such as using fuzzy membership graph)
Attribute values are converted to fuzzy values
– e.g., income is mapped into the discrete categories {low, medium, high} with
fuzzy values calculated
For a given new sample, more than one fuzzy value may apply
Each applicable rule contributes a vote for membership in the categories
Typically, the truth values for each predicted category are summed
Prediction
•
•
Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
Predictive Modeling in Databases
•
•
•
•
Predictive modeling: Predict data values or construct generalized linear models based on
the database data.
One can only predict value ranges or category distributions
Method outline:
– Minimal generalization
– Attribute relevance analysis
– Generalized linear model construction
– Prediction
Determine the major factors which influence the prediction
•
– Data relevance analysis: uncertainty measurement, entropy analysis, expert
judgment, etc.
Multi-level prediction: drill-down and roll-up analysis
Classification Accuracy: Estimating Error Rates
•
•
•
Partition: Training-and-testing
– use two independent data sets- training set , test set
– used for data set with large number of samples
Cross-validation
– divide the data set into k subsamples
– use k-1 subsamples as training data and one sub-sample as test data --- k-fold
cross-validation
– for data set with moderate size
Bootstrapping (leave-one-out)
– for small size data
Boosting and Bagging
•
•
•
Boosting increases classification accuracy
– Applicable to decision trees or Bayesian classifier
Learn a series of classifiers, where each classifier in the series pays more attention to the
examples misclassified by its predecessor
Boosting requires only linear time and constant space
UNIT: 7
Cluster analysis
The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering. A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters. A cluster of data
objects can be treated collectively as one group and so may be considered as a form of data
compression. Although classification is an effective means for distinguishing groups or classes
of objects, it requires the often costly collection and labeling of a large set of training tuples or
patterns, which the classifier uses to model each group. It is often more desirable to proceed in
the reverse direction: First partition the set of data into groups based on data similarity (e.g.,
using clustering), and then assign labels to the relatively small number of groups. Additional
advantages of such a clustering-based process are that it is adaptable to changes and helps single
out useful features that distinguish different groups.
Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing. In business, clustering can
help marketers discover distinct groups in their customer bases and characterize customer groups
based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionality, and gain insight into structures inherent in
populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to house
type, value, and geographic location, as well as the identification of groups of automobile
insurance policy holders with a high average claim cost. It can also be used to help classify
documents on theWeb for information discovery.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity. Clustering can also be used for
outlier detection, where outliers (values that are “far away” from any cluster) may be more
interesting than common cases. Applications of outlier detection include the detection of credit
card fraud and the monitoring of criminal activities in electronic commerce. As a data mining
function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of
data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for
further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as
characterization, attribute subset selection, and classification, which would then operate on the
detected clusters and the selected attributes or features.
Data clustering is under vigorous development. Contributing areas of research include
data mining, statistics, machine learning, spatial database technology, biology, and marketing.
Owing to the huge amounts of data collected in databases, cluster analysis has recently become a
highly active topic in data mining research. As a branch of statistics, cluster analysis has been
extensively studied for many years, focusing mainly on distance-based cluster analysis.
The following are typical requirements of clustering in data mining:
Scalability: Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions of
objects. Clustering on a sample of a given large data set may lead to biased results. Highly
scalable clustering algorithms are needed.
Ability to deal with different types of attributes: Many algorithms are designed to cluster
interval-based (numerical) data. However, applications may require clustering other types of
data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters
based on Euclidean or Manhattan distance measures. Algorithms based on such distance
measures tend to find spherical clusters with similar size and density. However, a cluster could
be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape.
Minimal requirements for domain knowledge to determine input parameters: Many
clustering algorithms require users to input certain parameters in cluster analysis (such as the
number of desired clusters). The clustering results can be quite sensitive to input parameters.
Parameters are often difficult to determine, especially for data sets containing high-dimensional
objects. This not only burdens users, but it also makes the quality of clustering difficult to
control.
Ability to deal with noisy data: Most real-world databases contain outliers or missing,
unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead
to clusters of poor quality.
Incremental clustering and insensitivity to the order of input records: Some clustering
algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering
structures and, instead, must determine a new clustering from scratch. Some clustering
algorithms are sensitive to the order of input data. That is, given a set of data objects, such an
algorithm may return dramatically different clusterings depending on the order of presentation of
the input objects. It is important to develop incremental clustering algorithms and algorithms that
are insensitive to the order of input.
High dimensionality: A database or a data warehouse can contain several dimensions or
attributes. Many clustering algorithms are good at handling low-dimensional data, involving only
two to three dimensions. Human eyes are good at judging the quality of clustering for up to three
dimensions. Finding clusters of data objects in high dimensional space is challenging, especially
considering that such data can be sparse and highly skewed.
Constraint-based clustering: Real-world applications may need to perform clustering under
various kinds of constraints. Suppose that your job is to choose the locations for a given number
of new automatic banking machines (ATMs) in a city. To decide upon this, you may cluster
households while considering constraints such as the city’s rivers and highway networks, and the
type and number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
Interpretability and usability: Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied to specific semantic
interpretations and applications. It is important to study how an application goal may influence
the selection of clustering features and methods.
Types of Data in Cluster Analysis
•
•
•
•
Interval-scaled variables
Binary variables
Categorical, Ordinal, and Ratio Scaled variables
Variables of mixed types
Interval-valued variables
•
Standardize data
– Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
Where
mf  1
n (x1 f  x2 f  ...  xnf )
– Calculate the standardized measurement (z-score)
xif  m f
zif 
sf
•
Using mean absolute deviation is more robust than using standard deviation
Similarity and Dissimilarity Objects
•
•
Distances are normally used to measure the similarity or dissimilarity between two data
objects
Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2 j 2
ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a
positive integer
• If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1
i2 j 2
ip j p
Similarity and Dissimilarity Objects
•
If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1
i2
j2
ip
jp
– Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
•
Also one can use weighted distance, parametric Pearson product moment correlation, or
other disimilarity measures.
Categorical Variables
•
•
•
A generalization of the binary variable in that it can take more than 2 states, e.g., red,
yellow, blue, green
Method 1: Simple matching
– M is no of matches, p is total no of variables
Method 2: use a large number of binary variables
– creating a new binary variable for each of the M nominal states
Ordinal Variables
•
•
•
An ordinal variable can be discrete or continuous
order is important, e.g., rank
Can be treated like interval-scaled
– replacing xif by their rank
– map the range of each variable onto [0, 1] by replacing i-th object in the f-th
variable by
– compute the dissimilarity using methods for interval-scaled variables
Ratio-Scaled Variables
•
•
Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at
exponential scale, such as AeBt or Ae-Bt
Methods:
– treat them like interval-scaled variables
– apply logarithmic transformation
yif = log(xif)
– treat them as continuous ordinal data treat their rank as interval-scaled.
Major Clustering Approaches
•
•
•
•
•
Partitioning algorithms
– Construct various partitions and then evaluate them by some criterion
Hierarchy algorithms
– Create a hierarchical decomposition of the set of data (or objects) using some
criterion
Density-based
– based on connectivity and density functions
Grid-based
– based on a multiple-level granularity structure
Model-based
–
A model is hypothesized for each of the clusters and the idea is to find the best fit
of that model to each other
Partitioning Algorithms: Basic Concept
•
•
Partitioning method: Construct a partition of a database D of n objects into a set of k
clusters
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means - Each cluster is represented by the center of the cluster
– k-medoids or PAM (Partition around medoids) - Each cluster is represented by
one of the objects in the cluster
The K-Means Clustering Method
•
Given k, the k-means algorithm is implemented in 4 steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the current partition. The
centroid is the center (mean point) of the cluster.
– Assign each object to the cluster with the nearest seed point.
– Go back to Step 2, stop when no more new assignment.
•
Strength
– Relatively efficient: O(tkn), where n is no of objects, k is no of clusters, and t is
no of iterations. Normally, k,
t << n.
– Often terminates at a local optimum. The global optimum may be found using
techniques such as: deterministic annealing and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
The K-Medoids Clustering Method
•
•
•
Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of the medoids by
one of the non-medoids if it improves the total distance of the resulting clustering
– PAM works effectively for small data sets, but does not scale well for large data
sets
CLARA (Kaufmann & Rousseeuw, 1990)
•
•
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)
PAM (Partitioning Around Medoids) (1987)
•
•
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
– Select k representative objects arbitrarily
– For each pair of non-selected object h and selected object i, calculate the total
swapping cost TCih
– For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar representative
object
– repeat steps 2-3 until there is no change
CLARA (Clustering Large Applications) (1990)
•
CLARA (Kaufmann and Rousseeuw in 1990)
– Built in statistical analysis packages, such as S+
•
•
•
It draws multiple samples of the data set, applies PAM on each sample, and gives the best
clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased
CLARANS (“Randomized” CLARA) (1994)
•
•
•
•
•
CLARANS (A Clustering Algorithm based on Randomized Search) CLARANS draws
sample of neighbors dynamically
The clustering process can be presented as searching a graph where every node is a
potential solution, that is, a set of k medoids
If the local optimum is found, CLARANS starts with new randomly selected node in
search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may further improve its performance
Hierarchical Methods
More on Hierarchical Clustering Methods
•
•
Major weakness of agglomerative clustering methods
– do not scale well: time complexity of at least O(n2), where n is the number of total
objects
– can never undo what was done previously
Integration of hierarchical with distance-based clustering
– BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
– CURE (1998): selects well-scattered points from the cluster and then shrinks them
towards the center of the cluster by a specified fraction
– CHAMELEON (1999): hierarchical clustering using dynamic modeling
Download