Clustering Procedure in SAS Cheng Lei Department of Electrical and Computer Engineering University of Victoria, Canada rexlei86@uvic.ca I. Introduction The clustering is to group data into different groups by computing the distance between the datasets. The distance between the datasets is defined the closeness of the datasets. In machine learning, the clustering is also called unsupervised learning, in which the datasets used do not have labeled classes or groups. The methods applied to compute the distance vary from some simple measures to very complicate ones depending on specific research purposes and the only difference is how to compute the distance between two clusters. The general process of the clustering is that each observation or data record begins in a cluster by itself; then, the two closest clusters are merged to form a new one to replace the two old ones; continuing to repeat the merging the two closest clusters until only one cluster is left. The CLUSTER Procedure in SAS hierarchically clusters the data records or observations in a SAS data set by using one of the eleven methods. The data can be coordinates or distances. If the data re coordinates, PROC CLUSTER computes (possibly squared) Euclidean distances. For coordinate data, Euclidean distances are computed from differences between coordinate values as the use of differences has several important consequences: For differences to be valid, the variables must have an interval or stronger scale of measurements. Ordinal or ranked data are generally not appropriate for cluster analysis. For Euclidean distances to be comparable, equal differences should have equal practical importance. Variables with large variances tend to have more effect on the resulting clusters than variables with small variances. The eleven clustering methods supported in SAS are: average linkage, the centroid method, complete linkage, density linkage (including Wong’s hybrid and kth-neighbor methods), maximum likelihood for mixtures of spherical multivariate normal distributions with equal variances but possibly unequal mixing proportions, the flexible-beta method, McQuitty’s similarity analysis, the median method, single linkage, twostage density linkage, and Ward’s minimum-variance method. All the methods are based on the usual agglomerative hierarchical clustering procedure. The CLUSTER procedure is not practical for very large data sets because the CPU time is roughly proportional to the square or cube of the number of observations. But the FASTCLUS procedure requires the proportional to the number of observations. Therefore, for a very large data set, it can be first apply the FASTCLUS procedure to do the preliminary clustering, then, use the CLUSTER procedure to cluster the preliminary clusters hierarchically. Adding some options, the PROC CLUSTER can display a history of the clustering process, showing the statistics useful for estimating the number of clusters in the population from which the data are sampled. II. CLUSTER Procedure The CLUSTER procedure has the following syntax available in SAS. PROC CLUSTER METHOD=name <options>; BY variables; COPY variables; FREQ variables; ID variables; RMSSTD variables; VAR variables; The bold words are the keywords in SAS while the italic words are the parameters names inputted to the procedure. In the above statements, only PROC CLUSTER statement is required, except that the FREQ statement is required when the RMSSTD statement is used; otherwise the FREQ is optional. BY statement is used to obtain separate analyses of observations in groups that are defined by the BY variables. The variables in the COPY statement are copied from the input data set to the OUTTREE= data set. Observations in the OUTTREE= data set that represent clusters of more than one observation from the input data set have missing values for the COPY variables. If one variable in the input data set represents the frequency of occurrence for other values in the observation, specify the variable’s name in a FREQ statement. PROC CLUSTER the treats the data set as if each observation appeared n times, where n is the values of the FREQ variable for the observation. The values of the ID variable identify observations in the sidpayed cluster history and in the OUTTREE= data set. If the ID statement is omitted, each observation is denoted by OBn, where n is the observation number. RMSSTD stands for root mean square standard deviations. If the coordiantes in the DATA= data set represent cluster means, the accurate statistics can be obtained in the cluster histories for MEHTOD=AVERAGE, MEHTOD=CENTROID, OR METHOD=WARD if the data contains both of the following: set a variable giving the number of original observations in each cluster. a variable giving the root mean squared standard deviation of each other. VAR statement lists numeric variables to be used in the cluster analysis. If it is omitted, all numeric variables not listed in other statements are used. observation in the other cluster as the distance between two clusters. It is strongly biased toward producing clusters with roughly equal diameters, and it can be severely distorted by moderate outliers. Density linkage (METHOD = DENSITY) is referred to a class of clustering methods that use nonparametric probability density estimates. It consists of two steps: III. Clustering Methods There are eleven clustering methods provided in SAS. The method of average linkage (METHOD = AVERAGE or AVG) computes the average distance between pairs of observations, one in each other. Average linkage tends to join clusters with small variances, and it is slightly biased toward producing clusters with the same variance. In the centroid method (METHOD = CENTROID), the distance between two clusters is defined as the (squared) Euclidean distance between their centroids or means. This method is more robust to outliers than most other hierarchical methods but in other respects might not perform as well as Ward’s method or average linkage. Complete linkage method (METHOD = COMPLETE) use the maximum distance between an observation in one cluster and an a new dissimilarity measure based on density estimates and adjacencies is computed. a single linkage cluster analysis is performed. There are three types of density linkage: the kth-nearest-neighbor method, the uniform-kernel method and Wong’s hybrid method. EML method (METHOD = EML) joins clusters to maximize the likelihood at each level of the hierarchy under the following assumptions: multivariate normal mixture equal spherical covariance matrices unequal sampling probabilities This method is similar to Ward’s minimum-variance method but removes the bias toward equal-sized clusters. The distance computed by the flexibleBeta method (METHOD = FLEXIBLE) is presented as 𝐷𝐽𝑀 = (𝐷𝐽𝐾 + 𝐷𝐽𝐿 ) 1−𝑏 + 𝐷𝐾𝐿 𝑏 2 where b is the value of the BETA= option, or -0.25 by default. McQuitty’s similarity analysis is represented as METHOD = MCQUITTY and independently developed by Sokal and Michener and McQuitty. The median method (METHOD = MEDIAN) was developed by Gower in 1967. The distance between two clusters in Single linkage (METHOD = SINGLE) is the minimum distance between an observation in one cluster and an observation in the other cluster. The option METHOD = TWOSTAGE is a modification of density linkage that ensures that all points are assigned to modal clusters before the modal clusters are permitted to join. The Ward’s minimum-variance method (METHOD = WARD) computed the distance between two clusters in the ANOVN sum of squares between the two clusters added up over all the variables.