Clustering on Observations

advertisement
Binary Classification
Notes on Cluster Analysis
Overview
• Explanation of Observation Clustering
(versus Variable Clustering)
• Types of Clustering
• Preparing Data for Clustering
• Assessing Clustering Results
• Demonstrations
– Fishers Iris Data
– Credit Data
STAT4330/8330 PRIESTLEY
Cluster Analysis
A few concepts:
1. Clustering is UNSUPERVISED.
2. Used for segmentation of observations
(versus variable clustering which is used
for variable and redundancy reduction).
3. You will ALWAYS find a mathematical
solution…which may or may not be
meaningful.
4. It may not be generalizable.
STAT4330/8330 PRIESTLEY
Cluster Analysis
Generally, we are trying to partition data
into groups that are as internally similar as
possible and as dissimilar as possible to
other groups.
STAT4330/8330 PRIESTLEY
Cluster Analysis
Types of questions answered by cluster
analysis:
1. Are there characteristics of patients that
define the state of a disease?
2. What sorts of complaints are most common
in a call center?
3. What kinds of cars are people buying? And
can I predict this?
4. Can I use credit attributes to create
profitability segments?
STAT4330/8330 PRIESTLEY
Cluster Analysis
Types of Clustering:
1. Hierarchical
a. Agglomerative
i. Each obs starts in its own cluster
ii. Merge the clusters that are the most similar
iii. Repeat
b. Divisive
i.
ii.
Each obs starts in a single cluster
Partition the observations that are the least similar
into a second cluster
iii. Repeat
2. Partitive (k-means)
STAT4330/8330 PRIESTLEY
Hierarchical Clustering
Iteration
Agglomerative
1
2
3
4
STAT4330/8330 PRIESTLEY
Divisive
Partitive Clustering
Old location
X
X
X
XX
XX
New location
X
“Seeds”
X
X
X
X
Observations
Initial State
Final State
STAT4330/8330 PRIESTLEY
Cluster Analysis
While there are pros and cons of each technique,
Partitive (k-means) clustering is generally preferred
with large datasets.
However, Partitive clustering also:
1. Requires that you estimate the number of clusters
present in the data (trial and error).
2. Is heavily influenced by selection of the seed – so
outliers need to be “controlled”.
3. Is inappropriate for small datasets – because the
solution becomes sensitive to the order in which
the data is read.
STAT4330/8330 PRIESTLEY
Cluster Analysis
All of the clustering techniques depend
upon measurements of similarity or
“distance” to assess assignment of an
observation to a cluster.
While there are many types of distance
measurements, consider the common
Euclidean Distance measurement…
STAT4330/8330 PRIESTLEY
Euclidean Distance Similarity Metric
DE 
d
2


x

w
 i i
i 1
• Pythagorean Theorem: The square of
the hypotenuse is equal to the sum of
the squares of the other two sides.
(x1, x2)
2
h   xi2
2
i 1
x2
(0, 0)
x
STAT4330/83301 PRIESTLEY
Euclidean Distance
Consider the impact on distance between
RBAL and Age in their original units.
STAT4330/8330 PRIESTLEY
Cluster Analysis
To control the impact of scale, standardization of the variables is
recommended.
The PROC STDIZE procedure provides for a wide range of
standardization options including:
Mean
Median
Sum
Euclidean Length
Standard Deviation (Z)
Range
MidRange (Range/2)
MaxABS – Max Absolute Value
IQR
MAD
AHUBER – Huber estimate
AWAVE – Wave estimate
L(p) – Minkowski distances
STAT4330/8330 PRIESTLEY
Cluster Analysis
Again, each of these options has pros and
cons. I encourage you to test multiple
options – must like we did with the
imputation options. For the current
exercises, we will use the Range option and
the Median option. Note that the Std (Z)
option is highly sensitive to outliers.
STAT4330/8330 PRIESTLEY
Determining Optimal Number of Clusters
We will use three basic metrics to determine
the optimal number of clusters:
1. Cubic Clustering Criterion (CCC)
2. Pseudo-F Statistic (PSF)
3. Pseudo-T Statistic (PST)
STAT4330/8330 PRIESTLEY
FISHER’S IRIS DATA
The data set consists of 50 samples from
each of three species of Iris (Iris setosa, Iris
virginica, and Iris versicolor). Four features
of the flowers were measured from each
sample: the length and the width of the
sepals and the petals. Based on the
combination of these four features, Fisher
developed a discriminant model to
distinguish the species from each other.
STAT4330/8330 PRIESTLEY
Clustering Fishers Data
Step 1: Standardize the Data
Step 2: Proc Cluster (hierarchical)
Step 3: Analyze the ccc, pst and psf to
determine the optimal clusters
Step 4: Use the information from Step 3 and
run Proc Fastclus (k-means)
STAT4330/8330 PRIESTLEY
Credit Data
Step 1: Standardize the Data (use ~ 10 variables)
Step 2: Proc Cluster (hierarchical)
Step 3: Analyze the ccc, pst and psf to determine the
optimal clusters
Step 4: Use the information from Step 3 and run Proc
Fastclus (k-means)
Step 5: Determine the profitability by cluster
STAT4330/8330 PRIESTLEY
Download