Data Mining: Process and Techniques - UIC

advertisement
Chapter 5:
UIC - CS 594
Clustering
1
Searching for groups



Clustering is unsupervised or undirected.
Unlike classification, in clustering, no preclassified data.
Search for groups or clusters of data
points (records) that are similar to one
another.

Similar points may mean: similar
customers, products, that will behave in
similar ways.
UIC - CS 594
2
Group similar points together

Group points into classes using some
distance measures.


Within-cluster distance, and between cluster
distance
Applications:


As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
UIC - CS 594
3
An Illustration
UIC - CS 594
4
Examples of Clustering
Applications



Marketing: Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing
programs
Insurance: Identifying groups of motor insurance
policy holders with some interesting
characteristics.
City-planning: Identifying groups of houses
according to their house type, value, and
geographical location
UIC - CS 594
5
Concepts of Clustering


Clusters
Different ways of
representing clusters





Division with boundaries
Spheres
Probabilistic
Dendrograms
…
UIC - CS 594
1 2 3
I1
0.5 0.2 0.3
I2
…
In
6
Clustering

Clustering quality





Inter-clusters distance  maximized
Intra-clusters distance  minimized
The quality of a clustering result depends on both
the similarity measure used by the method and its
application.
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns
Clustering vs. classification


Which one is more difficult? Why?
There are a huge number of clustering techniques.
UIC - CS 594
7
Dissimilarity/Distance Measure




Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function, which
is typically metric: d (i, j)
The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables.
Weights should be associated with different
variables based on applications and data
semantics.
It is hard to define “similar enough” or “good
enough”. The answer is typically highly subjective.
UIC - CS 594
8
Types of data in clustering
analysis

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types
UIC - CS 594
9
Interval-valued variables


Continuous measurements in a roughly linear
scale, e.g., weight, height, temperature, etc
Standardize data (depending on applications)

Calculate the mean absolute deviation:
s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where

mf  1
n (x1 f  x2 f
 ... 
xnf )
.
Calculate the standardized measurement (z-score)
xif  m f
zif 
sf
UIC - CS 594
10
Similarity Between Objects


Distance: Measure the similarity or dissimilarity
between two data objects
Some popular ones include: Minkowski
distance:
q
q
d (i, j)  (| x  x |  | x  x | ... | x  x | )
i1 j1
i2 j 2
ip jp
q
q
where (xi1, xi2, …, xip) and (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2
ip j p
UIC - CS 594
11
Similarity Between Objects (Cont.)

If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1
i2 j 2
i p jp

Properties





d(i,j)  0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)
Also, one can use weighted distance, and many
other similarity/distance measures.
UIC - CS 594
12
Binary Variables

A contingency table for binary data
Object j
Object i


1
0
1
a
b
0
c
d
sum a  c b  d
sum
a b
cd
p
Simple matching coefficient (invariant, if the
bc
binary variable is symmetric): d (i, j) 
a bc  d
Jaccard coefficient (noninvariant if the binary
variable is asymmetric): d (i, j) 
UIC - CS 594
bc
a bc
13
Dissimilarity of Binary Variables

Example
Name
Jack
Mary
Jim



Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute (not used below)
the remaining attributes are asymmetric attributes
let the values Y and P be set to 1, and the value N
be set to 0
01
d ( jack , mary ) 
UIC - CS 594
 0.33
2 01
11
d ( jack , jim ) 
 0.67
111
1 2
d ( jim , mary ) 
 0.75
11 2
14
Nominal Variables


A generalization of the binary variable in that it
can take more than 2 states, e.g., red, yellow,
blue, green, etc
Method 1: Simple matching

m: # of matches, p: total # of variables
m
d (i, j)  p 
p

Method 2: use a large number of binary variables

creating a new binary variable for each of the M
nominal states
UIC - CS 594
15
Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank

Can be treated like interval-scaled (f is a variable)


replace xif by their ranks
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
zif

rif {1,...,M f }
rif 1

M f 1
compute the dissimilarity using methods for intervalscaled variables
UIC - CS 594
16
Ratio-Scaled Variables


Ratio-scaled variable: a measurement on a
nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt, e.g., growth of a
bacteria population.
Methods:

treat them like interval-scaled variables—not a good idea!
(why?—the scale can be distorted)

apply logarithmic transformation
yif = log(xif)

treat them as continuous ordinal data and then treat their
ranks as interval-scaled
UIC - CS 594
17
Variables of Mixed Types

A database may contain all six types of variables


symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
One may use a weighted formula to combine
their effects
p
(f)
(f)
 f  1 ij dij
d (i, j) 
 pf  1 ij( f )



f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
 compute ranks rif and
r 1
z

if
 and treat zif as interval-scaled
M 1
if
f
UIC - CS 594
18
Major Clustering Techniques




Partitioning algorithms: Construct various
partitions and then evaluate them by some
criterion
Hierarchy algorithms: Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
Density-based: based on connectivity and density
functions
Model-based: A model is hypothesized for each
of the clusters and the idea is to find the best fit
of the model to each other.
UIC - CS 594
19
Partitioning Algorithms: Basic
Concept


Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms


k-means : Each cluster is represented by the center of
the cluster
k-medoids or PAM (Partition around medoids): Each
cluster is represented by one of the objects in the cluster
UIC - CS 594
20
The K-Means Clustering

Given k, the k-means algorithm is as follows:
1) Choose k cluster centers to coincide with k
randomly-chosen points
2) Assign each data point to the closest cluster center
3) Recompute the cluster centers using the current
cluster memberships.
4) If a convergence criterion is not met, go to 2).
Typical convergence criteria are: no (or minimal)
reassignment of data points to new cluster centers, or
minimal decrease in squared error.
p is a point and mi
k
E    pC | p  mi |
i 1
UIC - CS 594
i
2
is the mean of
cluster Ci
21
Example

For simplicity, 1 dimensional data and k=2.
data: 1, 2, 5, 6,7

K-means:






Randomly select 5 and 6 as initial centroids;
=> Two clusters {1,2,5} and {6,7}; meanC1=8/3,
meanC2=6.5
=> {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
=> no change.
Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2
= 2.5
UIC - CS 594
22
Comments on K-Means



Strength: efficient: O(tkn), where n is # data points, k is
# clusters, and t is # iterations. Normally, k, t << n.
Comment: Often terminates at a local optimum. The
global optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
Weakness





Applicable only when mean is defined, difficult for categorical data
Need to specify k, the number of clusters, in advance
Sensitive to noisy data and outliers
Not suitable to discover clusters with non-convex shapes
Sensitive to initial seeds
UIC - CS 594
23
Variations of the K-Means Method


A few variants of the k-means which differ in

Selection of the initial k seeds

Dissimilarity measures

Strategies to calculate cluster means
Handling categorical data: k-modes



Replacing means of clusters with modes
Using new dissimilarity measures to deal with
categorical objects
Using a frequency based method to update modes of
clusters
UIC - CS 594
24
k-Medoids clustering method

k-Means algorithm is sensitive to outliers



Since an object with an extremely large value may
substantially distort the distribution of the data.
Medoid – the most centrally located point in a
cluster, as a representative point of the cluster.
An example
Initial Medoids

In contrast, a centroid is not necessarily inside a
cluster.
UIC - CS 594
25
Partition Around Medoids

PAM:
1.
2.
3.
4.
Given k
Randomly pick k instances as initial medoids
Assign each data point to the nearest medoid x
Calculate the objective function

5.
6.
7.
the sum of dissimilarities of all points to their
nearest medoids. (squared-error criterion)
Randomly select an point y
Swap x by y if the swap reduces the objective
function
Repeat (3-6) until no change
UIC - CS 594
26
Comments on PAM


Pam is more robust than k-means in the
presence of noise and outliers because a
medoid is less influenced by outliers or
other extreme values than a mean
(why?)
Outlier (100 unit away)
Pam works well for small data sets but
does not scale well for large data sets.

O(k(n-k)2 ) for each change
where n is # of data, k is # of clusters
UIC - CS 594
27
CLARA: Clustering Large Applications




CLARA: Built in statistical analysis packages, such
as S+
It draws multiple samples of the data set, applies
PAM on each sample, and gives the best
clustering as the output
Strength: deals with larger data sets than PAM
Weakness:



Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
There are other scale-up methods e.g., CLARANS
UIC - CS 594
28
Hierarchical Clustering

Use distance matrix for clustering. This method
does not require the number of clusters k as an
input, but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
UIC - CS 594
agglomerative
divisive
Step 3
Step 2 Step 1 Step 0
29
Agglomerative Clustering
At the beginning, each data point forms a cluster
(also called a node).
Merge nodes/clusters that have the least
dissimilarity.
Go on merging
Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
UIC - CS 594
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
30
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster.
UIC - CS 594
31
Divisive Clustering

Inverse order of agglomerative clustering

Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
UIC - CS 594
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
32
More on Hierarchical Methods

Major weakness of agglomerative clustering
methods



do not scale well: time complexity at least O(n2), where
n is the total number of objects
can never undo what was done previously
Integration of hierarchical with distance-based
clustering to scale-up these clustering methods


BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the
cluster by a specified fraction
UIC - CS 594
33
Summary





Cluster analysis groups objects based on their
similarity and has wide applications
Measure of similarity can be computed for various
types of data
Clustering algorithms can be categorized into
partitioning methods, hierarchical methods,
density-based methods, etc
Clustering can also be used for outlier detection
which are useful for fraud detection
What is the best clustering algorithm?
UIC - CS 594
34
Other Data Mining Methods
UIC - CS 594
35
Sequence analysis


Market basket analysis analyzes things that
happen at the same time.
How about things happen over time?
E.g., If a customer buys a bed, he/she is likely to
come to buy a mattress later

Sequential analysis needs


A time stamp for each data record
customer identification
UIC - CS 594
36
Sequence analysis


(cont …)
The analysis shows which item come before, after
or at the same time as other items.
Sequential patterns can be used for analyzing
cause and effect.
Other applications

Finding cycles in association rules




Some association rules hold strongly in certain periods
of time
E.g., every Monday people buy item X and Y together
Stock market predicting
Predicting possible failure in network, etc
UIC - CS 594
37
Discovering holes in data



Holes are empty (sparse) regions in the data
space that contain few or no data points. Holes
may represent impossible value combinations in
the application domain.
E.g., in a disease database, we may find that
certain test values and/or symptoms do not go
together, or when certain medicine is used,
some test value never go beyond certain range.
Such information could lead to significant
discovery: a cure to a disease or some biological
law.
UIC - CS 594
38
Data and pattern visualization

Data visualization: Use computer graphics
effect to reveal the patterns in data,
2-D, 3-D scatter plots, bar charts, pie charts,
line plots, animation, etc.

Pattern visualization: Use good interface
and graphics to present the results of
data mining.
Rule visualizer, cluster visualizer, etc
UIC - CS 594
39
Scaling up data mining
algorithms

Adapt data mining algorithms to work on
very large databases.



Data reside on hard disk (too large to fit in
main memory)
Make fewer passes over the data
Quadratic algorithms are too expensive

Many data mining algorithms are quadratic,
especially, clustering algorithms.
UIC - CS 594
40
Download