Finding groups in a diagnostic plot G. Brys Faculty of Applied Economics, Universiteit Antwerpen (UA), Prinsstraat 13, B-2000 Antwerp, Belgium, guy.brys@ua.ac.be Summary. This paper has both connections with principal component analysis and with clustering. ROBPCA, a robust alternative for principal components analysis (PCA), is extensively studied in [?]. It is based on projection pursuit and robust covariance estimation. As a by-product, ROBPCA produces a diagnostic plot that displays and classifies the outliers. The aim of this paper is to find clusters of similar outliers in this diagnostic plot. A possible approach is to find the maximal average silhouette width over a range of possible number of clusters [KR90]. Nevertheless, although mostly preferable, this approach leads to wrong conclusions for some unbalanced designs. Therefore, we propose a minor adjustment to the stated rule which is mostly able to find the correct number of clusters. Some simulation results are presented. 1 Introduction The subject of determining the number of clusters has a long history in the literature. It is often argued to be an open and unsolvable problem, mainly dependent on the situation and interpretation. Nevertheless, there exist a large number of possible methods discussed and compared in [MC85]. The aim of this paper is to find a suitable method to determine clusters of similar outliers in a diagnostic plot. Here, we assume the diagnostic plot to be constructed from principal component analysis (PCA) which is a popular statistical method that tries to explain the covariance structure of data by means of a small number of components. Often, data reduction with PCA is used as a first step in other multivariate techniques like discriminant analysis or independent component analysis. As classical principal components analysis searches for directions with maximal variance, it is sensitive to outlying values. Therefore, robust methods are required. A possible robustification is to eliminate outliers before applying the classical techniques by means of a rejection rule like the ones proposed in [BHR05] in the context of robust independent component analysis. Other robust methods give little to zero weight to the outliers, but include them in the analysis. As such, ROBPCA is proposed in [?]. It is a robust PCA which combines ideas of projection pursuit and robust scatter matrix estimation. 1660 G. Brys The diagnostic plot arises as a direct consequence of PCA and plots the orthogonal distances (Y-axis) versus the score distances (X-axis) as defined in [?]. It distinguishes the regular observations from the outliers which can be grouped in three categories. First, there are good leverage points which lie close to the PCA space but far from the regular observations. This corresponds to a large score distance and a small orthogonal distance, i.e. points in the lower right corner of the diagnostic plot. Secondly, the vertical outliers cannot be seen when they are projected on the PCA space (small score distance). Their orthogonal distance is on the other hand large and so, they can be found in the upper left corner of the diagnostic plot. Thirdly, observations in the upper right corner of the diagnostic plot have both large score and orthogonal distances to the PCA space. They are called bad leverage points. A diagnostic plot shows how to distinguish the outliers from the regular observations. Nevertheless, sometimes borderline observations are present in the regular data. Automated procedures may be asked to objectively detect and remove outliers from the regular data. When using a strict cutoff value, this will exclude regular data too. Therefore, we want to cluster the scatter dots of the diagnostic plot to get elliptical separations of the different types of outliers and the regular data. Strong clusters of outliers are prefered to be recognised in advance and, if necessary, to be removed from the data. There exist a wide variety of clustering algorithms, but here we will focus on the PAM method (Partitioning Around Medoids) of Kaufman and Rousseeuw [KR90]. We opted to apply this technique on the scaled robust orthogonal and score distances found by the ROBPCA algorithm. The scaling has been done by dividing the distances by their cutoff value. Consequently, the outliers in the presented diagnostic plots correspond to observations with one of the two distances larger than one. A drawback of PAM is that we preliminary need to give the correct number of clusters, which we will denote here by C. In [KR90] it is proposed to determine the number of clusters by finding the maximal average silhouette width over a range of possible number of clusters. In the remainder of the paper, we will refer to it as ASW. This approach is defended in [PV02] to be favorable above several other methods. Nevertheless, the same authors confirm that it sometimes fails to find relatively small clusters in the presence of one or more larger clusters. Moreover, they present mean split silhouette (MSS) as an objective function to overcome the previously stated problem. As the better clustering corresponds with low MSS, we redefine MSS here as one mines itself. In Section 2 we propose a minor adjustment to the rule of [KR90] to determine the groups of outliers in a diagnostic plot. Section 3 compares the three methods by means of a simulation study, while Section 4 illustrates the methods on a real example. Finally, Section 5 concludes. 2 Weighted average silhouette width (WAS) In case of unbalanced designs ASW may fail due to the fact that the average silhouette width equally weights all observations. When a small cluster of data has a very high clustering structure, this may be masked due to a relatively large number of data in a larger cluster with weak structure. Therefore, we propose to give larger weight to data points lying in smaller clusters. Assume c groups and n observations Finding groups in a diagnostic plot 1661 xi in p dimensions. Let ni be the number of elements in the cluster to which xi belongs and let si be the silhouette width of xi . Then, the weighted average silhouette width (WAS) is defined as: WAS = P P n ni i=1 (1 − n )si . n ni i=1 (1 − n ) It is straightforward to see that WAS may take values between zero and one. A large WAS value indicates better clustering. Consider for example the diagnostic plot in Figure 1 constructed from simulated data. The three panels show the clusters found by PAM with c varying from two to four. It is natural to say that C should be equal to three. Nevertheless, with ASW and MSS the objective function is maximal for c = 2, while for WAS it is maximal for c = 3. Fig. 1. Diagnostic plot with clusters found by PAM for c = 2 (left panel), c = 3 (middle panel) and c = 4 (right panel). 3 Simulation study We conducted a simulation study to compare the performance of ASW, MSS and WAS. Similar as in [?], we generated 1000 samples of size n from the contamination model X N (p̃, Σ̃) m (1 − mε)Np (0, Σ) + ε p (1) i=1 or X t (p̃, Σ̃) m (1 − mε)t5 (0, Σ) + ε 5 (2) i=1 for different values of m, n, p, Σ, µ̃ and Σ̃. Here, we suppose ε = 0.05 and the sum sign is used to represent the addition of observations drawn from different distributions. The selected values are given in Table 1. Here, due to the choice of Σ we will retain five principal components for (a) and three for (b) and (c). Next, we will contaminate the directions corresponding 1662 G. Brys Table 1. Selected values of n, p, Σ and µ̃. n p µ̃ Σ (a) 50 100 diag(17, 13.5, 8, 3, 1, 0.95, ..., 0.01) [c1 , c1 , c1 , c1 , c1 , c2 , 0, ..., 0] (b) 100 4 diag(8, 4, 2, 1) [c1 , c1 , c1 , c2 ] (c) 500 4 diag(8, 4, 2, 1) [c1 , c1 , c1 , c2 ] with the principal components. Therefore, we will multiply respectively the first five or the first three entries of Σ with c3 to get Σ̃. Further, we consider four sets of (c1 , c2 , c3 ), in particular (0, 20, 1) (A set), (0, 10, 10) (B set), (10, 0, 1) (C set) and (0, 10, 100) (D set). In Table 2 we combine m sets (m = 1, 2, 3) to get seven different diagnostic plots which show clusters of vertical outliers (VO), of bad leverage points (BL) and good leverage points (GL) or combinations of them. Note that the cluster of VO, BL or GL is in itself not clustered. As such, the correct number of clusters equals m + 1. Table 2. Combinations of the sets to get seven different diagnostic plots combination sets 1 2 3 4 5 6 7 outliers A VO B BL C GL A,B VO,BL A,C VO,GL B,C BL,GL A,D,C VO,BL,GL Next, we compare the performance of the three methods by measuring the percentage of correct detections of m+1 clusters. Therefore, we ran the PAM algorithm with c varying from two to five. The number of clusters is correctly detected when the objective function (ASW, MSS or WAS) reaches its maximal value for C = m+1. The results are given in Table 3 for the normal distribution (1) and in Table 4 for the t5 distribution (2). The MSS method which never achieves high performance has only to be prefered in a single situation. From the simulations, it appears that the objective function of the MSS differs only slightly across the different chosen number of clusters. Both ASW and WAS are better able to find the correct number of clusters, although the ASW method often fails when different types of outliers are present, especially with the combination of vertical outliers and good leverage points. With the exception of some situations, the WAS method achieves the highest performance of the considered methods. Finding groups in a diagnostic plot 1663 Table 3. Percentage of correct detections of m + 1 clusters (normal distribution). combination 1 2 3 4 5 6 7 (a) ASW 100.0 98.2 100.0 97.7 88.6 91.0 62.5 MSS 23.6 31.3 27.2 37.1 36.9 20.0 45.9 WAS 100.0 96.3 99.9 96.7 99.9 91.4 57.3 (b) ASW 100.0 99.8 100.0 98.1 2.1 71.6 52.5 MSS 30.6 34.1 34.5 30.0 42.8 43.0 43.2 WAS 100.0 99.2 99.9 99.4 67.6 98.3 58.3 (c) ASW 100.0 100.0 100.0 100.0 0.1 76.2 65.7 MSS 22.5 35.0 33.5 31.6 38.4 41.9 47.0 WAS 100.0 100.0 100.0 100.0 63.0 100.0 68.6 Table 4. Percentage of correct detections of m + 1 clusters (t5 distribution). combination 1 2 3 4 5 6 7 (a) ASW MSS WAS 98.7 94.5 99.4 77.4 71.5 38.1 29.7 7.0 20.3 13.2 25.0 25.2 23.9 36.6 97.0 73.8 92.3 76.4 92.0 50.1 35.5 (b) ASW MSS WAS 99.6 96.7 99.7 77.0 1.8 42.6 38.1 36.2 34.2 33.5 41.1 43.1 35.9 30.8 99.6 93.0 99.0 91.8 28.1 77.4 44.0 (c) ASW 100.0 99.9 100.0 92.0 0.1 28.5 49.4 MSS 37.6 37.6 34.6 36.8 35.2 33.5 29.1 WAS 100.0 99.7 99.5 98.5 13.7 85.9 52.7 4 The car data To illustrate the use of the PAM method upon a diagnostic plot we consider the car data which is available in S-PLUS. For 111 cars 11 characteristics were measured, including the length, width and height of the car. Similar to [?], we retained two principal components in the construction of the diagnostic plot with ROBPCA. Figure 2 shows the diagnostic plot with the drawing of two, three and four clusters (c = 2, 3, 4). It is clear that there are two different types of outliers, and it appears from Table 5 that all methods found the correct number of clusters (C = 3). 5 Conclusions Finding clusters of similar outliers in a diagnostic plot is sometimes hard due to the unbalanced design of the different clusters. Indeed, for some situations the technique of the maximal average silhouette width (ASW) or the mean split silhouette (MSS) fails. Therefore, we proposed a slight modification of the ASW method by using the weighted average silhouette width (WAS). In our simulations, it appeared to give 1664 G. Brys Fig. 2. Diagnostic plot of the car data with clusters found by PAM for c = 2 (left panel), c = 3 (middle panel) and c = 4 (right panel). Table 5. Values of ASW, MSS and WAS for the car data. c 2 3 4 ASW 0.842 0.862 0.437 MSS 0.412 0.628 0.609 WAS 0.715 0.798 0.456 generally better results, especially when two or more outlying clusters were present. Note that it is possible to apply WAS in any situation, not only when ROBPCA has been used. Moreover, it leads to the same results as ASW in case of balanced designs. The diagnostic plot with the clusters implied will be available in the LIBRA toolbox, available at: http://www.wis.kuleuven.be/stat/robust.html. References [BHR05] Brys, G., Hubert, M., Rousseeuw, P.J. (2005) A robustification of independent component analysis. Journal of Chemometrics, 19, 1–12. [HRV05] Hubert, M., Rousseeuw, P.J., and Vanden Branden, K. (2005) ROBPCA: a new approach to robust principal component analysis. Technometrics, 47(1), 64–79. [KR90] Kaufman, L., Rousseeuw, P.J. (1990) Finding Groups in Data. John Wiley and Sons, New York (1990) [MC85] Milligan, G.W., Cooper, M.C. (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159– 179. [PV02] Pollard, K.S., van der Laan, M.J. (2002) New methods for identifying significant clusters in gene expression data. In: ASA Proceedings of the Joint Statistical Meetings, 2714–2719.