Finding groups in a diagnostic plot

advertisement
Finding groups in a diagnostic plot
G. Brys
Faculty of Applied Economics, Universiteit Antwerpen (UA), Prinsstraat 13,
B-2000 Antwerp, Belgium, guy.brys@ua.ac.be
Summary. This paper has both connections with principal component analysis and
with clustering. ROBPCA, a robust alternative for principal components analysis
(PCA), is extensively studied in [?]. It is based on projection pursuit and robust
covariance estimation. As a by-product, ROBPCA produces a diagnostic plot that
displays and classifies the outliers. The aim of this paper is to find clusters of similar
outliers in this diagnostic plot. A possible approach is to find the maximal average
silhouette width over a range of possible number of clusters [KR90]. Nevertheless,
although mostly preferable, this approach leads to wrong conclusions for some unbalanced designs. Therefore, we propose a minor adjustment to the stated rule which
is mostly able to find the correct number of clusters. Some simulation results are
presented.
1 Introduction
The subject of determining the number of clusters has a long history in the literature.
It is often argued to be an open and unsolvable problem, mainly dependent on the
situation and interpretation. Nevertheless, there exist a large number of possible
methods discussed and compared in [MC85]. The aim of this paper is to find a
suitable method to determine clusters of similar outliers in a diagnostic plot.
Here, we assume the diagnostic plot to be constructed from principal component analysis (PCA) which is a popular statistical method that tries to explain the
covariance structure of data by means of a small number of components. Often,
data reduction with PCA is used as a first step in other multivariate techniques
like discriminant analysis or independent component analysis. As classical principal
components analysis searches for directions with maximal variance, it is sensitive
to outlying values. Therefore, robust methods are required. A possible robustification is to eliminate outliers before applying the classical techniques by means of a
rejection rule like the ones proposed in [BHR05] in the context of robust independent component analysis. Other robust methods give little to zero weight to the
outliers, but include them in the analysis. As such, ROBPCA is proposed in [?]. It is
a robust PCA which combines ideas of projection pursuit and robust scatter matrix
estimation.
1660
G. Brys
The diagnostic plot arises as a direct consequence of PCA and plots the orthogonal distances (Y-axis) versus the score distances (X-axis) as defined in [?]. It
distinguishes the regular observations from the outliers which can be grouped in
three categories. First, there are good leverage points which lie close to the PCA
space but far from the regular observations. This corresponds to a large score distance and a small orthogonal distance, i.e. points in the lower right corner of the
diagnostic plot. Secondly, the vertical outliers cannot be seen when they are projected on the PCA space (small score distance). Their orthogonal distance is on the
other hand large and so, they can be found in the upper left corner of the diagnostic
plot. Thirdly, observations in the upper right corner of the diagnostic plot have both
large score and orthogonal distances to the PCA space. They are called bad leverage
points.
A diagnostic plot shows how to distinguish the outliers from the regular observations. Nevertheless, sometimes borderline observations are present in the regular
data. Automated procedures may be asked to objectively detect and remove outliers
from the regular data. When using a strict cutoff value, this will exclude regular
data too. Therefore, we want to cluster the scatter dots of the diagnostic plot to get
elliptical separations of the different types of outliers and the regular data. Strong
clusters of outliers are prefered to be recognised in advance and, if necessary, to be
removed from the data.
There exist a wide variety of clustering algorithms, but here we will focus on the
PAM method (Partitioning Around Medoids) of Kaufman and Rousseeuw [KR90].
We opted to apply this technique on the scaled robust orthogonal and score distances found by the ROBPCA algorithm. The scaling has been done by dividing the
distances by their cutoff value. Consequently, the outliers in the presented diagnostic
plots correspond to observations with one of the two distances larger than one. A
drawback of PAM is that we preliminary need to give the correct number of clusters,
which we will denote here by C.
In [KR90] it is proposed to determine the number of clusters by finding the
maximal average silhouette width over a range of possible number of clusters. In
the remainder of the paper, we will refer to it as ASW. This approach is defended
in [PV02] to be favorable above several other methods. Nevertheless, the same authors confirm that it sometimes fails to find relatively small clusters in the presence
of one or more larger clusters. Moreover, they present mean split silhouette (MSS)
as an objective function to overcome the previously stated problem. As the better
clustering corresponds with low MSS, we redefine MSS here as one mines itself.
In Section 2 we propose a minor adjustment to the rule of [KR90] to determine
the groups of outliers in a diagnostic plot. Section 3 compares the three methods
by means of a simulation study, while Section 4 illustrates the methods on a real
example. Finally, Section 5 concludes.
2 Weighted average silhouette width (WAS)
In case of unbalanced designs ASW may fail due to the fact that the average silhouette width equally weights all observations. When a small cluster of data has a
very high clustering structure, this may be masked due to a relatively large number
of data in a larger cluster with weak structure. Therefore, we propose to give larger
weight to data points lying in smaller clusters. Assume c groups and n observations
Finding groups in a diagnostic plot
1661
xi in p dimensions. Let ni be the number of elements in the cluster to which xi belongs and let si be the silhouette width of xi . Then, the weighted average silhouette
width (WAS) is defined as:
WAS =
P
P
n
ni
i=1 (1 − n )si
.
n
ni
i=1 (1 − n )
It is straightforward to see that WAS may take values between zero and one. A large
WAS value indicates better clustering.
Consider for example the diagnostic plot in Figure 1 constructed from simulated
data. The three panels show the clusters found by PAM with c varying from two to
four. It is natural to say that C should be equal to three. Nevertheless, with ASW
and MSS the objective function is maximal for c = 2, while for WAS it is maximal
for c = 3.
Fig. 1. Diagnostic plot with clusters found by PAM for c = 2 (left panel), c = 3
(middle panel) and c = 4 (right panel).
3 Simulation study
We conducted a simulation study to compare the performance of ASW, MSS and
WAS. Similar as in [?], we generated 1000 samples of size n from the contamination
model
X N (p̃, Σ̃)
m
(1 − mε)Np (0, Σ) + ε
p
(1)
i=1
or
X t (p̃, Σ̃)
m
(1 − mε)t5 (0, Σ) + ε
5
(2)
i=1
for different values of m, n, p, Σ, µ̃ and Σ̃. Here, we suppose ε = 0.05 and the
sum sign is used to represent the addition of observations drawn from different
distributions. The selected values are given in Table 1.
Here, due to the choice of Σ we will retain five principal components for (a)
and three for (b) and (c). Next, we will contaminate the directions corresponding
1662
G. Brys
Table 1. Selected values of n, p, Σ and µ̃.
n
p
µ̃
Σ
(a) 50 100 diag(17, 13.5, 8, 3, 1, 0.95, ..., 0.01) [c1 , c1 , c1 , c1 , c1 , c2 , 0, ..., 0]
(b) 100 4
diag(8, 4, 2, 1)
[c1 , c1 , c1 , c2 ]
(c) 500 4
diag(8, 4, 2, 1)
[c1 , c1 , c1 , c2 ]
with the principal components. Therefore, we will multiply respectively the first five
or the first three entries of Σ with c3 to get Σ̃. Further, we consider four sets of
(c1 , c2 , c3 ), in particular (0, 20, 1) (A set), (0, 10, 10) (B set), (10, 0, 1) (C set) and
(0, 10, 100) (D set). In Table 2 we combine m sets (m = 1, 2, 3) to get seven different
diagnostic plots which show clusters of vertical outliers (VO), of bad leverage points
(BL) and good leverage points (GL) or combinations of them. Note that the cluster
of VO, BL or GL is in itself not clustered. As such, the correct number of clusters
equals m + 1.
Table 2. Combinations of the sets to get seven different diagnostic plots
combination sets
1
2
3
4
5
6
7
outliers
A
VO
B
BL
C
GL
A,B
VO,BL
A,C
VO,GL
B,C
BL,GL
A,D,C VO,BL,GL
Next, we compare the performance of the three methods by measuring the percentage of correct detections of m+1 clusters. Therefore, we ran the PAM algorithm
with c varying from two to five. The number of clusters is correctly detected when
the objective function (ASW, MSS or WAS) reaches its maximal value for C = m+1.
The results are given in Table 3 for the normal distribution (1) and in Table 4 for
the t5 distribution (2).
The MSS method which never achieves high performance has only to be prefered
in a single situation. From the simulations, it appears that the objective function
of the MSS differs only slightly across the different chosen number of clusters. Both
ASW and WAS are better able to find the correct number of clusters, although the
ASW method often fails when different types of outliers are present, especially with
the combination of vertical outliers and good leverage points. With the exception of
some situations, the WAS method achieves the highest performance of the considered
methods.
Finding groups in a diagnostic plot
1663
Table 3. Percentage of correct detections of m + 1 clusters (normal distribution).
combination
1
2
3
4
5
6
7
(a)
ASW 100.0 98.2 100.0 97.7 88.6 91.0 62.5
MSS 23.6 31.3 27.2 37.1 36.9 20.0 45.9
WAS 100.0 96.3 99.9 96.7 99.9 91.4 57.3
(b)
ASW 100.0 99.8 100.0 98.1 2.1 71.6 52.5
MSS 30.6 34.1 34.5 30.0 42.8 43.0 43.2
WAS 100.0 99.2 99.9 99.4 67.6 98.3 58.3
(c)
ASW 100.0 100.0 100.0 100.0 0.1 76.2 65.7
MSS 22.5 35.0 33.5 31.6 38.4 41.9 47.0
WAS 100.0 100.0 100.0 100.0 63.0 100.0 68.6
Table 4. Percentage of correct detections of m + 1 clusters (t5 distribution).
combination
1
2
3
4
5
6
7
(a)
ASW
MSS
WAS
98.7 94.5 99.4 77.4 71.5 38.1 29.7
7.0 20.3 13.2 25.0 25.2 23.9 36.6
97.0 73.8 92.3 76.4 92.0 50.1 35.5
(b)
ASW
MSS
WAS
99.6 96.7 99.7 77.0 1.8 42.6 38.1
36.2 34.2 33.5 41.1 43.1 35.9 30.8
99.6 93.0 99.0 91.8 28.1 77.4 44.0
(c)
ASW 100.0 99.9 100.0 92.0 0.1 28.5 49.4
MSS 37.6 37.6 34.6 36.8 35.2 33.5 29.1
WAS 100.0 99.7 99.5 98.5 13.7 85.9 52.7
4 The car data
To illustrate the use of the PAM method upon a diagnostic plot we consider the car
data which is available in S-PLUS. For 111 cars 11 characteristics were measured,
including the length, width and height of the car. Similar to [?], we retained two
principal components in the construction of the diagnostic plot with ROBPCA.
Figure 2 shows the diagnostic plot with the drawing of two, three and four clusters
(c = 2, 3, 4). It is clear that there are two different types of outliers, and it appears
from Table 5 that all methods found the correct number of clusters (C = 3).
5 Conclusions
Finding clusters of similar outliers in a diagnostic plot is sometimes hard due to the
unbalanced design of the different clusters. Indeed, for some situations the technique
of the maximal average silhouette width (ASW) or the mean split silhouette (MSS)
fails. Therefore, we proposed a slight modification of the ASW method by using the
weighted average silhouette width (WAS). In our simulations, it appeared to give
1664
G. Brys
Fig. 2. Diagnostic plot of the car data with clusters found by PAM for c = 2 (left
panel), c = 3 (middle panel) and c = 4 (right panel).
Table 5. Values of ASW, MSS and WAS for the car data.
c
2
3
4
ASW 0.842 0.862 0.437
MSS 0.412 0.628 0.609
WAS 0.715 0.798 0.456
generally better results, especially when two or more outlying clusters were present.
Note that it is possible to apply WAS in any situation, not only when ROBPCA
has been used. Moreover, it leads to the same results as ASW in case of balanced
designs. The diagnostic plot with the clusters implied will be available in the LIBRA
toolbox, available at:
http://www.wis.kuleuven.be/stat/robust.html.
References
[BHR05] Brys, G., Hubert, M., Rousseeuw, P.J. (2005) A robustification of independent component analysis. Journal of Chemometrics, 19, 1–12.
[HRV05] Hubert, M., Rousseeuw, P.J., and Vanden Branden, K. (2005) ROBPCA:
a new approach to robust principal component analysis. Technometrics,
47(1), 64–79.
[KR90]
Kaufman, L., Rousseeuw, P.J. (1990) Finding Groups in Data. John Wiley
and Sons, New York (1990)
[MC85] Milligan, G.W., Cooper, M.C. (1985) An examination of procedures for
determining the number of clusters in a data set. Psychometrika, 50, 159–
179.
[PV02]
Pollard, K.S., van der Laan, M.J. (2002) New methods for identifying
significant clusters in gene expression data. In: ASA Proceedings of the
Joint Statistical Meetings, 2714–2719.
Download