**** 1 - LISA (Laboratory for Interdisciplinary Statistical Analysis)

advertisement
LISA Short Course Series
Multivariate Clustering Analysis in R
Yuhyun Song
Nov 03, 2015
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Laboratory for Interdisciplinary
Statistical Analysis
LISA helps VT researchers benefit from the use of
Statistics
Collaboration:
Visit our website to request personalized statistical advice and assistance with:
Designing Experiments • Analyzing Data • Interpreting Results
Grant Proposals • Software (R, SAS, JMP, Minitab...)
LISA statistical collaborators aim to explain concepts in ways useful for your research.
Great advice right now: Meet with LISA before collecting your data.
LISA also offers:
Educational Short Courses: Designed to help graduate students apply statistics in their research
Walk-In Consulting: Available Monday-Friday from 1-3 PM in the Old Security Building (OSB), Tuesday,
Thursday, and Friday from 10-12pm in the GLC, and Wednesday from 10 am-12 pm in Hutcheson for questions
<30 mins.
All services are FREE for VT researchers. We assist with research—not class projects or homework.
www.lisa.stat.vt.edu
OUTLINE
1.
2.
3.
4.
Data
What is multivariate analysis?
What is clustering analysis?
Clustering algorithms
 Hierarchical agglomerative clustering algorithm
 Partitioning clustering algorithms
 K-means clustering
 Partitioning Around Medoids (PAM)
5. Cluster Validation
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
OUTLINE
1.
2.
3.
4.
Data
What is multivariate analysis?
What is clustering analysis?
Clustering algorithms
 Hierarchical agglomerative clustering algorithm
 Partitioning clustering algorithms
 K-means clustering
 Partitioning Around Medoids (PAM)
 Model based clustering algorithm
5. Cluster Validation
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
DATA: Tweeter data
• Can be downloaded from the website, http://www.rdatamining.com
• Contains 320 tweets by @RDataMining
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
DATA: Tweeter data
• Text data needs several procedures for data munging
since text data is categorical.
- Transforming text
 Changing letters to lower case
 Removing punctuations, numbers, stop words.
- Stemming words
- Building a term document matrix containing word
frequencies.
We will implement above procedures before we apply
clustering algorithms into a data matrix!
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
OUTLINE
1.
2.
3.
4.
Data
What is multivariate analysis?
What is clustering analysis?
Clustering algorithms
 Hierarchical agglomerative clustering algorithm
 Partitioning clustering algorithms
 K-means clustering
 Partitioning Around Medoids (PAM)
5. Cluster Validation
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Multivariate Data Analysis
• Univariate Data Analysis
– used when one outcome variable is measured for each
object.
• Multivariate Data Analysis
– used when more than one outcome variables are
measured for each object.
– refers any statistical technique used to analyze data that
arises from more than one variable.
– concerned with the study of association among sets of
measurements.
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Multivariate Data Analysis
Method
Objectives
Exploratory vs.
Confirmatory
Principal Components
Analysis
Dimension Reduction
Exploratory
Factor Analysis
Understand patterns of
intercorrelation
Both
Multidimensional Scaling
Analysis
Create spatial
representation from objects
similarities
Mainly Exploratory
Classification Analysis
Build a classification rules
for predefined groups
Both
Clustering Analysis
Create groupings from
objects similarities
Exploratory
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
OUTLINE
1.
2.
3.
4.
Data
What is multivariate analysis?
What is clustering analysis?
Clustering algorithms
 Hierarchical agglomerative clustering algorithm
 Partitioning clustering algorithms
 K-means clustering
 Partitioning Around Medoids (PAM)
5. Cluster Validation
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Clustering Analysis
•What is a natural grouping among characters?
•Segmenting characters into groups is subjective.
Villains
Males
Heroes
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Females
Clustering Analysis
• Cluster: a collection of data
objects
– Objects are similar to one another
within the same cluster.
– Objects are dissimilar to the objects
in other clusters.
• Cluster analysis
– Finding similarities between data
according to the characteristics
found in the data and grouping a
set of data objects in such a way
that objects in the same group
Minimize
intracluster
distances
• Unsupervised learning: no
predefined classes
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Maximize
intercluster
distances
Two Types of Clustering Analysis
•Hierarchical Clustering: Objects are partitioned into nested groups
that are organized as a hierarchical tree.
•Partitioning Clustering: Objects are partitioned into nonoverlapping groups and each object belongs to one group only.
 Hierarchical
LISA: Multivariate Clustering Analysis in R
 Partitional
Nov 3, 2015
Data Structure
• Data matrix
 n x p matrix, where n is the number of data objects and p
is the number of variables
 most suitable for partitioning methods
• Similarity/dissimilarity (distance) matrix
 n × n matrix calculated from the data matrix
 most suitable for hierarchical agglomerative methods
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Dissimilarity (Distance) Measures
•A distance measure is the numerical measure that indicates how
different two objects are; the lower its value the more similar the
objects are.
•Given two data objects X1 and X2, the distance between X1 and X2 is
a real number denoted by d(X1,X2).
•Common distance measures between data objects:
• Euclidean Distance: d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1
j1
i2
j2
ip
jp
• Manhattan Distance: d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1
i2 j 2
ip j p
• Minkowski Distance: d (i, j)  (( x  x )  ( x  x )  ...  ( x  x ) )
LISA: Multivariate Clustering Analysis in R
q
i1
j1
Nov 3, 2015
q
i2
j2
q 1/ q
ip
jp
OUTLINE
1.
2.
3.
4.
Data
What is multivariate analysis?
What is clustering analysis?
Clustering algorithms
 Hierarchical agglomerative clustering algorithm
 Partitioning clustering algorithms
 K-means clustering
 Partitioning Around Medoids (PAM)
5. Cluster Validation
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Hierarchical Agglomerative Clustering
• Hierarchical Agglomerative Clustering produces a sequence of
solutions (nested clusters), and is organized in a hierarchical tree
structure.
• Use a distance matrix for clustering and the solution is visualized
by a dendrogram.
• This method does not require the number of clusters k as an
input.
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Hierarchical Agglomerative Clustering
• Distance between clusters:
single link
(min)
Single linkage: smallest distance
between an object in one cluster and
an object in the other, i.e., d(Ci, Cj) =
min(Xip, Xjq)
complete link
(max)
Complete linkage: largest distance
between an object in one cluster and
an object in the other, i.e., d(Ci, Cj) = =
max(Xip, Xjq)
Average linkage: avg distance
average
between an object in one cluster and
an object in the other, i.e., d(Ci, Cj) = =
avg(Xip, Xjq)
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Hierarchical Agglomerative Clustering
•Given a data set of n data objects, Hierarchical Agglomerative
Clustering algorithm is implemented in following steps:
Step 1. Calculate the distance matrix for n data objects
Step 2. Set each object as a cluster
Step 3. Repeat until the number of cluster is 1
Step 3.1. Merge two closest clusters
Step 3.2. Update the distance matrix by linkage functions
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Hierarchical Agglomerative Clustering
Example: Given 5 data objects,
E
Distance Matrix
A
B
C
D
A 0 0.71 2.69 3.20 6.4
B 0.71 0 2.06 2.6 5.7
C 2.69 3.2
D 3.20 2.6
E
6 .4
5 .7
D
E
0
1
1
0
3. 9
3. 2
3 .9
3 .2
0
LISA: Multivariate Clustering Analysis in R
A
B
Nov 3, 2015
C
Hierarchical Agglomerative Clustering
Update the distance matrix by using Single Linkage function.
A
A
B
B
C
D
E
0
0.71 2.69 3.20 6.4
0.71
0
2.06 2.6 5.7
C 2.69
D 3.20
3 .2
2 .6
0
1
1
0
3. 9
3. 2
E
5 .7
3 .9
3 .2
0
6 .4
d (( A, B ), C )  min( d ( A, C ), d ( B, C ))  min( 2.69,2.06)  2.06
d (( A, B ), D )  min( d ( A, D), d ( B, D ))  min( 3.20,2.6)  2.6
d (( A, B ), E )  min( d ( A, E ), d ( B, E ))  min( 6.4,5.7)  5.7
( A, B) C
D E
( A, B)
0
2.06 2.6 5.7
C
2.06
0
1
3.9
D
2.6
1
0
3.2
E
5.7
3.9
3.2
0
LISA: Multivariate Clustering Analysis in R
E
A
B
Nov 3, 2015
D
C
Hierarchical Agglomerative Clustering
Update the distance matrix by using Single Linkage function.
( A, B) C
D E
( A, B)
0
2.06 2.6 5.7
C
2.06
0
1
3.9
D
2.6
1
0
3.2
E
5.7
3.9
3.2
0
d ((C , D), ( A, B))  min( d (C , ( A, B)), d ( D, ( A, B)))  min( 2.06,2.6)  2.06
d ((C , D), E )  min( d (C , E ), d ( D, E ))  min( 3.9,3.2)  3.2
( A, B) (C , D) E
( A, B)
0
2.06 5.7
(C , D) 2.06
0
3.2
E
5.7
3.2
0
LISA: Multivariate Clustering Analysis in R
E
A
B
Nov 3, 2015
D
C
Hierarchical Agglomerative Clustering
Dendrogram
1. In the beginning we have 5
clusters.
2. We merge clusters A and B into
cluster (A, B) at distance 0.71
3. We merge cluster C and cluster D
into (C, D) at distance 1
4. We merge clusters (A,B) and (C, D)
into ((A, B), (C, D)) at distance
2.06.
5. We merge clusters ((A, B), (C, D))
and E at distance 3.2.
6. The last cluster contain all the
objects,
thus conclude the computation
Dist.
3.2
2.06
1
0.71
A
B
C
D
LISA: Multivariate Clustering Analysis in R
E
Nov 3, 2015
Hierarchical Agglomerative Clustering
•How do we decide the number of clusters? Cut the tree.
Dist.
3.2
K=2
2.06
K=3
1
K=4
0.71
A
B
C
D
LISA: Multivariate Clustering Analysis in R
E
Nov 3, 2015
R: Hierarchical Agglomerative Clustering
• Let’s build a data matrix of the word frequencies that
enumerates the number of times that each word occurs in
each tweet (document) in R. Then, we will cluster words in
tweets by a Hierarchical Agglomerative Clustering algorithm.
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
OUTLINE
1.
2.
3.
4.
Data
What is multivariate analysis?
What is clustering analysis?
Clustering algorithms
 Hierarchical agglomerative clustering algorithm
 Partitioning clustering algorithms
 K-means clustering
 Partitioning Around Medoids (PAM)
5. Cluster Validation
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Partitioning Algorithm
•Partitioning method: Construct a partition of n data objects into
a set of K clusters. Given a pre-determined K, find a partition of K
clusters that optimizes the chosen partitioning criterion.
k-means (MacQueen’67): Each cluster is represented by the
center of the cluster.
PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
data objects in the cluster.
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
OUTLINE
1.
2.
3.
4.
Data
What is multivariate analysis?
What is clustering analysis?
Clustering algorithms
 Hierarchical agglomerative clustering algorithm
 Partitioning clustering algorithms
 K-means clustering
 Partitioning Around Medoids (PAM)
5. Cluster Validation
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
K-means clustering
•Given a set of observations, K-means clustering aims to partition n
observations into K clusters by minimizing the within-cluster sum of
squares (WCSS), where
k
arg min  
S
i 1 xS i
x  i
2
•Each cluster is associated with a centroid.
– Each point is assigned to the cluster with the closest centroid.
– Initial K centroids are chosen randomly.
– The centroid is the mean of the points in the cluster.
•Number of clusters, K must be specified
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Nov 3, 2015
K-means clustering
• Given K, K-means algorithm is implemented in four
steps:
Step 1. Partition objects into K nonempty subsets
Step 2. Compute seed points as the centroids of the clusters of the
current partition (the centroid is the center, i.e., mean point, of
the cluster)
Step 3. Assign each object to the cluster with the nearest seed
point
Step 4. Go back to Step 2, stop when no more new assignment
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
K-means clustering
• How to determine the
number of clusters in Kmeans clustering?
– Fit K-means clustering with
different K’s and calculate
WSS.
– Draw a scree plot
– Choose the number of clusters
where there is sharp drop with
respect to WSS
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Clustering Analysis: K-means clustering
Reference:"Kmeans animation withoutWatermark" by Incheol - Licensed under CC BY-SA 4.0 via
https://commons.wikimedia.org/wiki/File:Kmeans_animation_withoutWatermark.gif#/media/File:Kmeans_anim
ation_withoutWatermark.gif
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
OUTLINE
1.
2.
3.
4.
Data
What is multivariate analysis?
What is clustering analysis?
Clustering algorithms
 Hierarchical agglomerative clustering algorithm
 Partitioning clustering algorithms
 K-means clustering
 Partitioning Around Medoids (PAM)
5. Cluster Validation
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Partitioning Around Medoids(PAM)
• The PAM algorithm partitions the n objects into K clusters by
specifying the clustering solution which minimizes the overall
dissimilarity between the represents of each cluster and its
members.
• Each cluster is associated with a medoid.
– Each point is assigned to the cluster with the closest
medoid.
– K medoids are K representative data objects.
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Partitioning Around Medoids(PAM)
• In PAM, Swapping Cost is used as a objective function:
– For each pair of a medoid m and a non-medoid object h,
measure whether h is better than m as a medoid
– Use the squared-error
criterion
k
E    d ( p, mi ) 2
i 1 pCi
– Compute Eh-Em
– Negative: swapping brings benefit
• Choose the minimum swapping cost
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Partitioning Around Medoids(PAM)
• Given K, PAM is implemented in 6 steps:
Step 1. Randomly pick K data points as initial medoids
Step 2. Assign each data point to the nearest medoid x
Step 3. Calculate the objective function the sum of dissimilarities of
all points to their nearest medoids. (squared-error criterion)
Step 4. Randomly select an point y
Step 5. Swap x by y if the swap reduces the objective function
Step 6. Repeat step 3-step 6 until no change
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
OUTLINE
1.
2.
3.
4.
Data
What is multivariate analysis?
What is clustering analysis?
Clustering algorithms
 Hierarchical agglomerative clustering algorithm
 Partitioning clustering algorithms
 K-means clustering
 Partitioning Around Medoids (PAM)
5. Cluster Validation
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Cluster Validation
– Why is cluster validation necessary?
• Clustering algorithms will define clusters even if there are no natural
cluster structure. In higher dimensions, it is not easy to detect whether
there are natural cluster structures. Thus, we need approaches to
determine whether there is non-random structure in the data and how
well the results of a cluster fit the data.
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
The Silhouette Coefficient
• The Silhouette Coefficient: a method of interpretation and
validation of consistency within clusters of data. It quantifies the
quality of clustering .
• How to compute the SC?
• For an individual point, i
– Calculate a = avg. distance of i to the points in its cluster
– Calculate b = min (avg. distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case)
• Can calculate the Average Silhouette Coefficient for a cluster or a
clustering. The closer to 1 the better.
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
The Silhouette Coefficient
LISA: Multivariate Clustering Analysis in R
Range of
avg. SC
Interpretation
0.71-1.0
A strong structure has been found
0.51-0.70
A reasonable structure has been found
0.26-0.50
The structure is weak and could be artificial. Try
additional methods of data analysis.
<0.26
No substantial structure has been found
Nov 3, 2015
R: K-means clustering and PAM
• We will cluster tweets by K-means clustering
and PAM.
• Then, we will visualize the silhouette plot to
see the quality of clustering solutions.
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Reference
• RDataMining: http://www.rdatamining.com
• Pang-Ning, Tan, Michael Steinbach, and Vipin
Kumar. "Introduction to data mining." Library of
Congress. 2006.
• Friedman, Jerome, Trevor Hastie, and Robert
Tibshirani. The elements of statistical learning.
Vol. 1. Springer, Berlin: Springer series in
statistics, 2001.
LISA: Multivariate Clustering Analysis in R
Nov 03, 2015
Please don’t forget to fill the sign in sheet
and to complete the survey that will be sent
to you by email.
Thank you!
LISA: Multivariate Clustering Analysis in R
Nov 3, 2015
Download