Uploaded by ray temu

Clustering & Classification MU

advertisement
Sugata Sen Roy
Department of Statistics,
University of Calcutta.
What is multivariate analysis ?
It is the simultaneous study of m related variables
X = (x1 x2 … xm )
 gives a complete profile of the unit under study
 allows the inter-relationships between the variables to
come into play
Examples
 (Gender, age, nationality) of an individual
 (min temp., max temp., rainfall, humidity) on a day
 (GDP, life expectancy, literacy rate) of a country
 ( blood pressure, blood sugar, hemoglobin,
cholesterol, creatinine, albumin, ….) of an individual
Argentina
Australia
Austria
Belgium
Bermuda
Brazil
Burma
Canada
Chile
China
Colombia
Cook Is.
Costa Rica
Czeck
Denmark
Dom.Rep.
Finland
France
GDR
FRG
GB
Greece
100m
10.39
10.31
10.44
10.34
10.28
10.22
10.64
10.17
10.34
10.51
10.43
12.18
10.94
10.35
10.56
10.14
10.43
10.11
10.12
10.16
10.11
10.22
200m
20.81
20.06
20.81
20.68
20.58
20.43
21.52
20.22
20.8
21.04
21.05
23.2
21.9
20.65
20.52
20.65
20.69
20.38
20.33
20.37
20.21
20.71
400m
46.84
44.84
46.82
45.04
45.91
45.21
48.3
45.68
46.2
47.3
46.1
52.94
48.66
45.64
45.89
46.8
45.49
45.28
44.87
44.5
44.93
46.56
800m
1.81
1.74
1.79
1.73
1.8
1.73
1.8
1.76
1.79
1.81
1.82
2.02
1.87
1.76
1.78
1.82
1.74
1.73
1.73
1.73
1.7
1.78
1500m
3.7
3.57
3.6
3.6
3.75
3.66
3.85
3.63
3.71
3.73
3.74
4.24
3.84
3.58
3.61
3.82
3.61
3.57
3.56
3.53
3.51
3.64
5000m
14.04
13.28
13.26
13.22
14.68
13.62
14.45
13.55
13.61
13.9
13.49
16.7
14.03
13.42
13.5
14.91
13.27
13.34
13.17
13.21
13.01
14.59
10000m
29.36
27.66
27.72
27.45
30.55
28.62
30.28
28.09
29.3
29.13
28.88
35.38
28.81
28.19
28.11
31.45
27.52
27.97
27.42
27.61
27.51
28.45
marathon
137.72
128.3
135.9
129.95
146.62
133.13
139.95
130.15
134.03
133.53
131.35
164.7
136.58
134.32
130.78
154.12
130.87
132.3
129.92
132.23
129.13
134.6
Guatemala
Hungary
India
10.98
10.26
10.6
21.82
20.62
21.42
48.4
46.02
45.73
1.89
1.77
1.76
3.8
3.62
3.73
14.16
13.49
13.77
30.11
28.44
28.81
139.33
132.58
131.98
Multivariate analysis techniques
 can be extensions of the univariate techniques to the
simultaneous study of several variables
 can be techniques which are peculiar to the case of
several variables only
Three broad aspects
• Grouping of Individuals
• Dimension Reduction
• Cause-Effect Relationships
Grouping of Individuals
 Q : Can the individuals be grouped according to similarity ?
• Cluster Analysis
• Discriminant Analysis and Classification
Dimension Reduction
 Q : Is it necessary to look at all the variables, or is it possible
for a smaller set of variables to capture the same information ?
• Principal Component Analysis
• Factor Analysis
• Canonical Correlations
• Multidimensional Scaling
Cause-Effect Relationships
 Q : Do a set of variables have an effect on another set of
variables ? If so, how are they related ?
➢ Multivariate Linear Models
➢ Multivariate Regression
➢ MANOVA
➢ MANCOVA
Cluster Analysis
 given a group of individuals with a set of characteristics how
can we form groups according to their similarity
➢ easy if it is a single or may be two characteristics
➢ difficult for large individual profiles
How to arrange these hats on the shelves?
And how to group these children ?
What is clustering ?
 An automated statistical method which segregates multi-
dimensional data according to similarity and dissimilarity.
 Leads to
➢ either a tree with closer individuals in nearer branches
➢ or homogeneous groups separated as far apart as possible
Types of Clustering
 Hierarchical Clustering
•
here the individuals are clustered in a
hierarchically arranged tree with similar members
closer to each other than dissimilar members

Agglomerative

Divisive
Example
 US Health Data – 1986
 States : ME NH VT MA RI CT NY NJ PA
 Variables : no. of deaths due to
(1) accidents
(2) cardiovascular diseases
(3) cancer
(4) pulmonary diseases
(5) pneumonia
(6) diabetes
(7) liver diseases
NJ
MA
ME
PA
NY
10
CT
VT
RI
20
NH
30
40
Height
50
60
70
Hierarchical Clustering
Partitioning
 the individuals are grouped such that
➢ the members within a cluster are as similar among
themselves as possible
➢ the members of a cluster are as different from members
of other clusters as possible
 Overlapping Clusters
Partitioning
NH, VT, CT
Low Cardiovascular,
Cancer, Diabetes, Liver
High Pulmonary diseases,
Accidents
NY, PA
High cardiovascular,
Pneumonia, Liver
Low Pulmonary diseases
ME, MA, RI, NJ
High cancer
Intermediate for all others
Distance
X = (x1, x2, …, xm)
vector of m variables or characteristics
Xi , i = 1, …, n
observations corresponding to n individuals
 Need to define how far the individuals are from each other
i.e. their Distance from the others.
Distance between observations
 Common distance measures between two individuals,
say r and s,
 City-block :
m
d rs =  | xrj − xsj |
j =1
 Euclidean :
m
d rs =  ( xrj − xsj ) 2
j =1
Distance matrix
D
nn
 0

 d12 0

= d13 d 23 0

..
..
 ..
d
 1n d 2 n d 3n





.. 

0
Distance between Clusters
 Suppose one cluster (C1) has n1 members
and another cluster (C2) has n2 members.
 What is the distance between the two clusters ?
 Simple Linkage : 12 = min{ drs | r C1, s C2}
 Complete Linkage : 12 = max{ drs | r C1, s C2}
More Distances between Clusters
 Average Linkage :
12
1
=
n1n2
n1
n2
d
r =1 s =1
rs
 Very often scaled distances are preferred :
Mahalanobis Distance :
12 = ( Xr – Xs )’S-1( Xr – Xs ) ,
where S is the variance-covariance matrix.
Hierarchical Clustering
 Start with n clusters of one member each.
 Find the closest two members and link them.
 Then find the next two closest member (can be individual
members or can be an individual and the first formed
group).
 Continue till all the members are linked.
Breakfast Cereal Data
Cereal Brand
Protein Carbohydrate Fat
(gms)
s (gms)
(gms)
Calories
(per oz.
Vitamin A
(%)_
Life
6
19
1
110
0
Grape Nuts
3
23
0
100
25
Super Sugar Crisp
2
26
0
110
25
Special K
6
21
0
110
25
Rice Krispies
2
25
0
110
25
Raisin Bran
3
28
1
120
25
Product 19
2
24
0
110
100
Wheaties
3
23
1
110
25
Total
3
23
1
110
100
Puffed Rice
1
13
0
50
0
Sugar Corn Pops
1
26
0
110
25
Sugar Smacks
2
25
0
110
25
9
4
8
7
12
5
11
3
6
2
0
1
20
40
Height
10
60
Hierarchical Clustering
Partitioning
 Decide on the number of groups.
 Segregate the observations according to their closeness into these groups.
 Criterion : Since
Total variability T = W + B, where
➢ W = Within cluster variability
➢ B = Between cluster variability
 Segregate into the groups so as to
Minimize
W/ B or W/ T
IRIS Data
Sepal
Length
Sepal
Width
Petal
Length
Petal
Width
Sepal
Length
Sepal
Width
Petal
Length
Petal
Width
5.1
3.5
1.4
0.2
6.5
3.0
5.8
2.2
7.0
3.2
4.7
1.4
7.6
3.0
6.6
2.1
6.3
3.3
6.0
2.5
4.6
3.1
1.5
0.2
7.8
2.7
5.1
1.9
5.0
3.6
1.4
0.2
6.4
3.2
4.5
1.5
6.5
2.8
4.6
1.5
4.9
3.0
1.4
0.2
6.9
2.5
5.5
1.7
7.1
3.0
5.9
2.1
5.4
3.9
1.7
0.4
4.7
3.2
1.3
0.2
7.3
2.9
6.3
1.8
6.9
3.1
4.9
1.5
4.6
3.4
1.4
0.3
5.5
2.3
4.0
1.3
5.7
2.8
4.5
1.3
6.3
2.9
5.6
1.8
Results
➢ Cluster sizes :
8 7 6
➢ Clustering membership
2 3 1 1 3 2 1 2 3 3 1 1 1 2 2 3 1 2 1 2 3
Means
Sepal Length
Sepal
Width
Petal Length
Petal
Width
Within Sum of
Squares
Cluster 1
Cluster 2
6.97
4.90
2.91
3.38
5.85
1.44
2.01
0.24
4.75
1.24
Cluster 3
6.33
2.90
4.53
1.42
2.99
Partitioning
3, 4, 7, 11, 12, 13, 17, 19
Iris Virginica
1,6, 8, 14, 15, 18, 20
Iris Setosa
2, 5, 9, 10, 16, 21
Iris Versicolor
Looking for the elbow
What after the clusters are formed ?
 Suppose we have well-formed clusters
➢ identify what characterizes a particular cluster
and how different it is from the other clusters
• Discriminant Analysis
➢
how to place a new entrant into one of the groups
•
Classification
Example
 A new entrant Ohio (OH) with characteristics
(33.2, 443.1, 198.8, 27.4, 18.0, 18.9, 10.2)
has to be classified in one of the 3 groups.
 Problem :
➢
➢
➢
➢
➢
➢
➢
cardiovascular rate between Gr 1 and Gr 3
accident rate close to Gr 2
cancer rate close to Gr 1
Pulmonary rate close to Gr 3
Pneumonia rate very low – closest is Gr 3
Diabetes rate very high – closest are Gr 2 and Gr 3
Liver rate very low – closest is Gr 1
How to classify ?
 Let there be two groups : A and B
characterized by p.f.s fA(x) and fB(x).
 Let
pA = probability of coming from A
and pB = probability of coming from B
 Probability of misclassification :
➢ An individual from A is classified in B = P(B|A)
➢ An individual from B is classified in A = P(A|B)
Cost of Misclassification
 Cost of misclassification c(B|A) and c(A|B)
 Correct classifications :
P(A|A) pA and
P(B|B) pB
 Misclassifications :
P(B|A) pA and
P(A|B) pB
Minimize Expected Cost of Misclassification
 Expected cost of misclassification :
EMC = c(B|A)P(B|A) pA + c(A|B)P(A|B) pB
 Minimize the ECM to get the classification rule
 Classify in A if
f A ( x) c( A | B) pB

f B ( x ) c ( B | A) p A
 Classify in B otherwise.
Special Cases
 For equal misclassification cost
𝑓𝐴 (𝑥) 𝑝𝐵
≥
𝑓𝐵 (𝑥) 𝑝𝐴
 For equal misclassification and prior probabilities
𝑓𝐴 (𝑥)
≥1
𝑓𝐵 (𝑥)
 Apparent Error Rate :
AER =
# misclassification from A to B + # misclassification from B to A
total # of observations
Normal population
 Suppose for A,
fA(x) ~ N(A, ΣA)
and for B,
fB(x) ~ N(B, ΣB)
 Assume ΣA = ΣB = Σ. Then the Linear Discriminant function is
Allocate x to A if
( A −  B )'  −1x  1 ( A −  B )'  −1 ( A +  B ) + ln c( A | B) pB
2
c( B | A) p A
Allocate x to B if
( A −  B )'  −1x  1 ( A −  B )'  −1 ( A +  B ) + ln c( A | B) pB
2
c( B | A) p A
Normal population
 Let
C=
1
( A −  B )'  −1 ( A +  B ) + ln c( A | B) pB
2
c( B | A) p A
 Then allocate x to A if
and allocate to B if
( A −  B )'  −1x  C
( A −  B )'  −1x  C
 If ΣA ≠ ΣB , then we get a Quadratic Discriminant function.
R-packages
 Agglomerative Clustering : Functions hclust() from stats
and agnes() from cluster
 Divisive Clustering : diana() from cluster
 Partitioning : kmeans() from stats and pam() from cluster
 Several other specialized packages
References
 Johnson R.A. & Wichern D.W. (2002) : Applied Multivariate
Statistical Analysis, 5th ed.,PHI Learning, New Delhi.
 Seber G.A.F. (2004) : Multivariate Observations, J. Wiley, New York.
 Anderson T.W. (1958) : Introduction to Multivariate Statistical
Analysis, John Wiley, New York.
 Muirhead R.J. (1982) : Aspects of Multivariate Statistical Theory, John
Wiley, New York.
References
 Rencher A.C. & Christensen W.F. (2012) : Methods of
Multivariate Analysis, 3rd ed., John Wiley, New York.
 Kendall M.G. (1975) : Multivariate Analysis, Griffin, London.
 Everitt B.S. (1993) : Cluster Analysis, 3rd ed., Edward Arnold,
London.
 Hand D.J. (1981) : Discrimination and Classification, New York,
John Wiley
Download