Clustering Large Databases with Numeric and Nominal Values

Clustering Large Databases with Numeric and Nominal Values Using Orthogonal Projections Boriana L. Milenova Marcos M. Campos Oracle Data Mining Technologies 10 Van de Graff Drive Burlington, MA 1803 USA boriana.milenova@oracle.com Oracle Data Mining Technologies 10 Van de Graff Drive Burlington, MA 1803 USA marcos.m.campos@oracle.com Abstract Clustering large high-dimensional databases has emerged as a challenging research area. A number of recently developed clustering algorithms have focused on overcoming either the “curse of dimensionality” or the scalability problems associated with large amounts of data. The majority of these algorithms operate only on numeric data, a few handle nominal data, and very few can deal with both numeric and nominal values. Orthogonal partitioning Clustering (O-Cluster) was originally introduced as a fast, scalable solution for large multidimensional databases with numeric values. Here, we extend O-Cluster to domains with nominal and mixed values. O-Cluster uses a topdown partitioning strategy based on orthogonal projections to identify areas of high density in the input data space. The algorithm employs an active sampling mechanism and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their explanatory power, and OCluster’s good scalability. 1. Introduction Clustering of large high-dimensional databases is an important problem with challenging performance and system resource requirements. There are a number of algorithms that are applicable to very large databases, and a few that address high-dimensional data. Although a large part of the data in data warehouses is nominal, few of these algorithms have addressed clustering nominal data and even fewer mixed data. A survey of numeric clustering algorithms for large, high-dimensional databases can be found in [MC02]. A common approach to cluster databases with nominal attributes (columns) is to convert them into numeric attributes and apply a numeric clustering algorithm. This is usually done by “exploding” the nominal attribute into a set of new binary numeric attributes, one for each distinct value in the original attribute. This approach suffers from a number of limitations. Firstly, it increases the dimensionality of the problem which may potentially be already high. This usually decreases the quality of clustering results [HAK00, HK99]. Secondly, the computational cost incurred may be considerable. A number of recent algorithms have been developed to cluster nominal data [GGR99, GKR98, GRS99, Hua97a, Hua97b]. These algorithms can also be used to cluster databases with mixed data by transforming numeric attributes into nominal ones through discretization [Hua97b]. However, the discretization process could result in loss of important information available in the continuous values and lead to decreased accuracy. This has been found to be true for supervised models [VM95]. k-Prototypes [Hua97a] is a partition-based algorithm that uses a heterogeneous distance function to compute distance for mixed data. The heterogeneous distance function requires weighting the contribution of the numeric attributes versus that of the nominal ones. Another partitioning algorithm, k-Modes [CGC94, Hua97b], does not have the same limitation. k-Modes solves the weighting problem by only working with nominal data. Numeric attributes need to be discretized. Both algorithms are susceptible to the curse of dimensionality common to distance-based algorithms in high-dimensional spaces [HAK00]. STIRR [GKR98] is an iterative algorithm based on non-linear dynamical systems. Nominal clustering is treated as graph partitioning. STIRR running time is linear in number of rows (records) and nearly linear in number of attributes. The algorithm needs multiple scans of the data set in order to converge. This limits its usefulness to databases that fit into memory. The number of clusters does not need to be specified. ROCK [GRS99] is an agglomerative algorithm which uses a similarity metric to find neighboring data points. It then defines links between two data points based on the number of neighbors they share. The algorithm attempts to maximize a goodness measure that favors merging pairs with a large number of common neighbors. ROCK’s scaling with number of rows is quadratic to cubic. In order to handle large databases ROCK requires sampling the data. This can prevent the detection of smaller clusters that may contain important patterns. The exact form of the goodness measure also needs to be specified, which is not trivial. CACTUS [GGR99] is an agglomerative algorithm that uses data summarization to achieve linear scaling in the number of rows. It requires only two scans of the data. However, it has exponential scaling in the number of attributes. This limits the algorithm’s usefulness for clustering databases with large number of attributes. In a variety of databases CACTUS was found to be 3 to 10 times faster than STIRR. Additionally, the number of clusters does not need to be specified. The O-Cluster (Orthogonal Partitioning Clustering) algorithm [MC02] was originally introduced as a fast, scalable clustering solution for large high-dimensional numeric data. The algorithm uses axis-parallel unidimensional data projections. More general projections could be used. However, the current implementation aims for simplicity and efficient computation. The axis-parallel projection-based approach was shown to be very effective in high-dimensional numeric data [HK99]. O-Cluster builds upon the orthogonal projection concept introduced by OptiGrid [HK99] and addresses some of limitations of OptiGrid’s approach, such as dealing with a large number of records that do not fit in memory and identifying good partitions without relying on user-defined parameters. The algorithm requires at most a single scan through the data. The work presented here extends O-Cluster’s functionality to domains with nominal and mixed (numeric and nominal) values. O-Cluster, due to its topdown partitioning approach, has very good transparency and provides cluster descriptions in the form of compact rules. Other algorithms suitable for clustering nominal attributes do not share this useful feature. Section 2 describes the algorithm. Section 3 analyzes the behavior the algorithm on artificial data and discusses O-Cluster’s complexity and scalability. Section 4 presents experiments with real data and comparison with other algorithms. Section 5 concludes the paper and indicates directions for future work. 2. Orthogonal partitioning clustering The objective of O-Cluster is to identify areas of high density in the data and separate them into individual clusters. The algorithm looks for splitting points along axis-parallel projections that would produce cleanly separable and preferably balanced clusters. The algorithm operates recursively by creating a binary tree hierarchy. The number of leaf clusters is determined automatically and does not need to be specified in advance. The topology of the hierarchy, along with its splitting predicates, can be used to gain insights into the clustering solution. The following sections describe the partitioning strategy used with numeric, nominal, and mixed values, outline the active sampling method employed by OCluster, and summarize the main processing stages of the algorithm. 2.1 Numeric values O-Cluster computes uni-dimensional histograms along individual input attributes. For each histogram, O-Cluster attempts to find the ‘best’ valid cutting plane, if any exist. A valid cutting plane passes through a bin of low density (a valley) in the histogram. Additionally, the bin of low density should have bins of high density (peaks) on each side. O-Cluster attempts to find a pair of peaks with a valley between them where the difference between the peak and valley histogram counts is statistically significant. Statistical significance is tested using a standard χ2 test: χ 2 = 2 (observed − expected ) 2 ÷ expected ≥ χ α2 ,1 , where the observed value is equal to the histogram count of the valley and the expected value is the average of the histogram counts of the valley and the lower peak. A 95% confidence level ( χ 02.05,1 = 3.843 ) has been shown to produce reliable results. Since this test can produce multiple splitting points, O-Cluster chooses the one where the valley has the lowest histogram count and thus the cutting plane would go through the bin with lowest density. Alternatively, or in the case of a tie, the algorithm can favor splitting points that would produce balanced partitions. It is sometimes desirable to prevent the separation of clusters with small peak density. This can be accomplished by introducing a baseline sensitivity level that excludes peaks below this count. It should be noted that with numeric attributes, sensitivity (ρ) is an optional parameter that is used solely for filtering of the splitting point candidates. Sensitivity is a parameter in the [0, 1] range that is inversely proportional to the minimum count required for a histogram peak. A value of 0 corresponds to the global uniform level per attribute. The global uniform level reflects the average histogram count that would have been observed if the data points in the buffer were drawn from a uniform distribution. A value of 0.5 sets the minimum histogram count for a peak to 50% of the global uniform level. A value of 1 removes the restrictions on peak histogram counts and the splitting point identification relies solely on the χ2 test. A default value of 0.5 usually works satisfactorily. Figure 1 illustrates the splitting points identified in a one dimensional histogram. This example shows the use of a sensitivity level (marked by the dashed line). the split. The leaf clusters are described in terms of their histograms and/or modes and small bins are considered uninformative. If more than two bins have high counts in a histogram, subsequent splits would separate them into individual partitions. To avoid rapid data decimation, OCluster creates a binary tree rather than one where large bins fan out into individual branches. The top down approach used by O-Cluster discovers co-occurrences of values and each leaf encodes dense cells in a subspace defined by the splits in O-Cluster’s hierarchy. Figure 2 depicts a nominal attribute histogram. The two largest bins (colored dark grey) will seed the two new partitions. Again, the sensitivity level is marked by a dashed line. Figure 1: Numeric attribute splitting points. It is desirable to compute histograms that provide good resolution but also have data artifacts smoothed out. A number of studies have addressed the problem of how many equi-width bins can be supported by a given distribution [Sco79, Wan96]. Based on these studies, a reasonable, simple approach would be to make the number of equi-width bins inversely proportional to the standard deviation of the data along a given dimension and directly proportional to N1/3, where N is the number of points inside a partition. Alternatively, one can use a global binning strategy and coarsen the histograms as the number of points inside the partitions decreases. OCluster is robust with respect to different binning strategies as long as the histograms do not significantly undersmooth or oversmooth the distribution density. Data sets with low number of records would require coarser binning and some resolution may potentially be lost. Large data sets have the advantage of supporting the computation of detailed histograms with good resolution. 2.2 Nominal values Nominal values do not have an intrinsic order associated with them. Therefore it is impossible to apply the notion of histogram peaks and valleys as in the numeric case. The counts of individual values form a histogram and bins with large counts can be interpreted as regions with high density. The clustering objective is to separate these highdensity areas and effectively decrease the entropy of the data. O-Cluster identifies the histogram with highest entropy among the individual projections. For simplicity, we approximate the entropy measure as the number of bins above sensitivity level ρ (as defined in Section 2.1). O-Cluster places the two largest bins into separate partitions, thereby creating a splitting predicate. The remainder of the bins can be assigned randomly to the two resulting partitions. If these bins have low counts, they would not be able to influence O-Cluster’s solution after Figure 2: Nominal attribute partitioning. When histograms are tied on the largest number of bins above the sensitivity level, O-Cluster favors the histogram where the top two bins have higher counts. Since the splits are binary, the optimal case would have all the partition data points equally distributed between these two top bins. We numerically quantify the suboptimality of the split as the difference between the count of the lower of the two peaks and the count of half of the total number of points in the partition. 2.3 Mixed numeric and nominal values O-Cluster searches for the ‘best’ splitting plane for numeric and nominal attributes separately. Then it compares two measures of density: histogram count of the valley bin in the numeric split and the suboptimality of the nominal split. The algorithm chooses the split with lower density. 2.4 Active sampling O-Cluster uses an active sampling mechanism to handle databases that do not fit in memory. The algorithm operates on a data buffer of a limited size. After processing an initial random sample, O-Cluster identifies data records that are of no further interest. Such records belong to ‘frozen’ partitions where further splitting is highly unlikely. These records are replaced with examples 1. Load buffer 2. Compute histograms for active partitions 3. Find 'best' splitting points for active partitions 4. Flag ambiguous and 'frozen' partitions Splitting points exist? 5. Split active partitions Yes No Ambiguous partitions exist? 6. Reload buffer No Yes Yes Unseen data exist? No EXIT Figure 3: O-Cluster algorithm block diagram. from ‘ambiguous’ regions where further information (additional data points) is needed to find good splitting planes and continue partitioning. A partition is considered ambiguous if a valid split can only be found at a lower confidence level. For a numeric attribute, if the difference between the lower peak and the valley is significant at the 90% level ( χ 02.1,1 = 2.706 ), but not at the default 95% level, the partition is considered ambiguous. Analogously, for a nominal attribute, if the counts of at least two bins are above the sensitivity level but not to a significant degree (at the default 95% confidence level), the partition is labeled ambiguous. Records associated with frozen partitions are marked for deletion from the buffer. They are replaced with records belonging to ambiguous partitions. The histograms of the ambiguous partitions are updated and splitting points are reevaluated. 2.5 The O-Cluster Algorithm The O-Cluster algorithm evaluates possible splitting points for all projections in a partition, selects the ‘best’ one, and splits the data into two new partitions. The algorithm proceeds by searching for good cutting planes inside the newly created partitions. Thus O-Cluster creates a binary tree structure that tessellates the input space into rectangular regions. Figure 3 provides an outline of O-Cluster’s algorithm. The main processing stages are: 1. Load buffer: If the entire data set does not fit in the buffer, a random sample is used. O-Cluster assigns all points from the initial buffer to a single active root partition. 2. Compute histograms for active partitions: The goal is to compute histograms along the orthogonal unidimensional projections for each active partition. Any partition that represents a leaf in the clustering hierarchy and is not explicitly marked ambiguous or ‘frozen’ is considered active. 3. Find ‘best’ splitting points for active partitions: For each histogram, O-Cluster attempts to find the ‘best’ valid cutting plane, if any exist. The algorithm examines separately the groups of numeric and nominal attributes and selects the best splitting plane. 4. Flag ambiguous and ‘frozen’ partitions: If no valid splitting points are found in a partition, O-Cluster checks whether the χ2 test would have found a valid splitting point at a lower confidence level. If that is the case, the current partition is considered ambiguous. More data points are needed to establish the quality of the splitting point. If no splitting points were found and there is no ambiguity, the partition can be marked as ‘frozen’ and the records associated with it marked for deletion from the buffer. 5. Split active partitions: If a valid separator exists, the data points are split by the cutting plane, two new active partitions are created from the original partition, and the algorithm proceeds from Step 2. 6. Reload buffer: This step takes place after all recursive partitioning on the current buffer is completed. If all existing partitions are marked as ‘frozen’ and/or there are no more data points available, the algorithm exits. Otherwise, if some partitions are marked as ambiguous and additional unseen data records exist, OCluster proceeds with reloading the data buffer. The new data replace records belonging to ‘frozen’ partitions. When new records are read in, only data points that fall inside ambiguous partitions are placed in the buffer. New records falling within a ‘frozen’ partition are not loaded into the buffer and are discarded. If it is desirable to maintain statistics of the data points falling inside partitions (including the ‘frozen’ partitions), such statistics can be continuously updated with the reading of each new record. Loading of new records continues until either: 1) the buffer is filled again; 2) the end of the data set is reached; or 3) a reasonable number of records (e.g., equal to the buffer size) have been read, even if the buffer is not full and there are more data. The reason for the last condition is that if the buffer is relatively large and there are many points marked for deletion, it may take a long time to entirely fill the buffer with data from the ambiguous regions. To avoid excessive reloading time under these circumstances, the buffer reloading process is terminated after reading through a number of records equal to the data buffer size. Once the buffer reload is completed, the algorithm proceeds from Step 2. The algorithm requires, at most, a single pass through the entire data set. 3. O-Cluster analysis on artificial data The initial set of tests illustrates O-Cluster’s behavior on artificial data. To graphically illustrate the algorithm, we used two-dimensional data. Further tests studied OCluster’s scalability and the active sampling mechanism. to achieve high accuracy on data sets with significantly higher cluster variance than the original DS3 problem. The main reason for the remarkably good performance is that higher dimensionality allows O-Cluster to find cutting planes that do not produce splitting artifacts. 3.1 Numeric example O-Cluster was used on a standard benchmark data set DS3 [ZRL96]. The characteristics of this data set - low number of dimensions, highly overlapping clusters of different variance and size – make the problem very challenging. Figure 4 depicts the partitions found by O-Cluster. The centers of the original clusters are marked with squares while the centroids of the points assigned to each partition are represented by stars. Although O-Cluster does not function optimally when the dimensionality is low, it produces a good set of partitions. O-Cluster finds cutting planes at different levels of density and successfully identifies nested clusters. Axis-parallel splits in low dimensions can lead to creation of artifacts where cutting planes have to cut through parts of a cluster and data points are assigned to incorrect partitions. Such artifacts can either result in centroid error or lead to further partitioning and creation of spurious clusters. For example, in Figure 4, O-Cluster creates 73 partitions. Of these, 71 contain the centroids of at least one of the original clusters. The remaining 2 partitions resulted from artifacts created by splits going through clusters. In general, there are two potential sources of imprecision in the algorithm: 1) O-Cluster may fail to create partitions for all original clusters; and/or 2) OCluster may create spurious partitions that do not correspond to any of the original clusters. To measure these two effects separately, we use two metrics borrowed from the information retrieval domain: Recall is defined as the percentage of the original clusters that were found and assigned to partitions; Precision is defined as the percentage of the found partitions that contain at least one original cluster centroid. That is, in Figure 4 O-Cluster found 71 out of 100 original clusters (resulting in recall of 71%), and 71 out of the 73 partitions created contained at least one centroid of the original clusters (a precision of 97%). The recall and precision measures reflect the tradeoff between identifying as many as possible of the true clusters vs. creating spurious clusters due to excessive partitioning. The use of recall and precision in clustering benchmarks is possible only when the correct number of clusters is available. In order to investigate the benefits of higher dimensionality, additional attributes were added to the DS3 data set. O-Cluster’s accuracy (both recall and precision) improves dramatically with increased dimensionality. For example, five attributes produced recall of 99% and precision of 96%. Ten attributes or more resulted in perfect recall and precision (100%). Increasing the number of dimensions allowed O-Cluster Figure 4: O-Cluster partitions on the DS3 data set. 3.2 Nominal example The next example demonstrates O-Cluster’s behavior on a simple two-dimensional data set where each attribute has five unique values. In Figure 5, darker shade indicates higher density in the cells. O-Cluster places high-density cells in individual partitions. Examination of the attribute histograms within a partition, and in particular their modes, can be used to characterize the found clusters. Increasing the sensitivity level would result in increasing the number of partitions since cells with lower counts would be separated into individual partitions. B5 B4 B3 B2 B1 A1 A2 A3 A4 A5 Figure 5: O-Cluster partitions on a nominal data set. 3.3 O-Cluster complexity and scalability We begin discussing O-Cluster’s scalability behavior under the assumption that the entire data set can be loaded into the buffer. O-Cluster uses projections that are axisparallel. The histogram computation step is of complexity O(N x d) where N is the number of data points in the buffer and d is the number of attributes. The selection of best splitting point for a single attribute is O(b) where b is the average number of histogram bins in a partition. Choosing the best splitting point over all attributes is O(d x b). The assignment of data points to newly created partitions requires a comparison of an attribute value to the splitting predicate and the complexity has an upper bound of O(N). Loading new records into the data buffer requires their insertion into the relevant partitions. The complexity associated with scoring a record depends on the depth of the binary clustering tree (s). The upper limit for filling the whole buffer is O(N x s). The depth of the tree s depends on the data set. In general, N and d are the dominating factors and the total complexity can be approximated as O(N x d). To validate this analysis, a series of tests was performed by increasing (within the limits of O-Cluster’s buffer size) the numbers of records and attributes. All data sets used in the experiments consisted of 50 clusters with equal number of points. All 50 clusters were correctly identified in each test. When measuring scalability with increasing number of records, the number of attributes was set to 10. When measuring scalability with increasing dimensionality, the number of records was set to 100,000. Figure 6 shows a clear linear dependency of O-Cluster’s processing time on both the number of records and number of attributes. The actual timing results shown can be improved significantly because the algorithm was Figure 6: O-Cluster Scalability: (a) with number of records; (b) with number of dimensions (attributes). implemented as a PL/SQL package in an ORACLE 9i database. There is an overhead associated with the fact that PL/SQL is an interpreted language. It should be noted that the linear scalability pattern for the number of records shown here characterizes OCluster’s behavior only with respect to databases that can fit entirely in the buffer. For very large volumes of data, the dependency is usually strongly sub-linear because only a fraction of the data is processed through the active sampling mechanism. This sub-linearity is due to the fact that all partitions are likely to become ‘frozen’ before a full scan of the database is completed. The fraction of the total number of records processed depends on the nature of the data and the size of the buffer. For example, using the same data sets from the above scalability experiments, a fixed buffer size of 50,000 correctly identified all clusters and did not require additional refills. That is, data sets of larger size had the same clustering processing time and incurred additional cost only in randomizing the order of the data. The following section discusses the effect of varying the buffer size as a proportion of the total data. 3.4 Effect of buffer size The next set of results illustrates O-Cluster’s behavior when a small memory footprint is required. The buffer can thus contain only a fraction of the entire data set. This series of tests reuses a data set described in Section 3.3 (50 clusters, 2,000 point each, 10 attributes). Figure 7 shows the timing and recall results for different buffer sizes (0.5%, 0.8%, 1%, 5%, and 10% of the entire data set). Very small buffer sizes may require multiple refills. For example, the described experiment showed that when the buffer size was 0.5%, O-Cluster needed to refill it 5 times; when the buffer size was 0.8% or 1%, O-Cluster had to refill it once. For larger buffer Figure 7: Buffer size: (a) time scalability; (b) recall. odor almond anise ... none population abundant clustered ... cap color gill color several buff grey ... cap shape brown buff grey ... 1 spore print color bell flat ... stalk color above ring cap color brown 0 spore print color 0 buff grey ... convex bell flat ... 1 10 spore print color 0 buff white grey brown ... 49 brown grey buff green ... cap shape cap shape convex pink grey brown buff red ... bell flat ... 91 convex 81 cap surface cap shape convex 0 bell flat ... convex 0 100 bell flat ... stalk surface above ring 100 46 brown 0 cap color stalk color above ring cap shape white buff pink white buff grey ... white cap shape convex white brown brown grey ... smooth scaly silky ... 91 100 bell flat ... 100 smooth grooves scaly ... 85 87 Figure 8: O-Cluster results on mushroom data set (ρ ρ = 0.65). Leaf numbers are the percentage of poisonous mushrooms in a cluster. sizes, no refills were necessary. As a result, using 0.8% buffer proves to be slightly faster than using 0.5% buffer. If no buffer refills were required (buffer size greater than 1%), O-Cluster followed a linear scalability pattern, as shown in the previous section. Regarding O-Cluster’s accuracy, buffer sizes under 1% proved to be too small for the algorithm to find all existing clusters. For buffer size of 0.5%, O-Cluster found 41 out of 50 clusters (82% recall) and for buffer size of 0.8%, O-Cluster found 49 out of 50 clusters (98% recall). Larger buffer sizes allowed OCluster to correctly identify all original clusters (100% recall). For all buffer sizes (including buffer sizes smaller than 1%) precision was 100%. O-Cluster functions optimally when the order of data presentation is random. Residual dependencies within the subsets can result in premature termination of the algorithm and some of the statistically significant partitions may remain undiscovered. 4. O-Cluster analysis on real data This section describes O-Cluster’s results on a set of real world databases. [MC02] demonstrated the high quality of O-Cluster’s solution on high-dimensional multimedia data with numeric attributes. The focus here is on data with nominal and mixed (numeric and nominal) values. The clustering results are evaluated with respect to their accuracy and explanatory power. Comparisons to other algorithms are also provided. 4.1 Mushroom data set The mushroom data set from the UCI repository is a popular benchmark for clustering nominal data. The data set consists of 8,124 records and there are 22 nominal multi-valued attributes. In a classification setting, the objective is to classify the mushrooms as edible or poisonous on the basis of their physical attributes. 51.8% of the mushroom entries are labeled as edible. Although clustering is an unsupervised task, the clusters discovered by the algorithm may be able to capture some underlying structure that strongly correlates with the target labels. Figure 8 depicts the hierarchical tree discovered by OCluster on the mushroom data (the target attribute was not included). The splitting predicates for each node are also included. For lack of space, not all branch conditions are listed - the ellipses (...) stand for all other attribute values that are not explicitly listed. The numbers in the leaves are the percentage of poisonous mushrooms in the partition. It can be seen that the majority of the clusters discovered by the algorithm differ significantly from the global distribution and can be labeled as either edible (circles) or capital gain < 13000 >= 13000 capital loss < 1028 99 >= 1028 capital gain < 7000 52 >= 7000 capital gain < 2000 98 >= 2000 26 education num >= 12 < 12 education num <9 final weight >= 9 < 144803 final weight < 144803 >= 144803 4 final weight < 71182 >= 71182 13 final weight marital status 14 >= 144803 divorced widowed ... education married-civ-spouse 33 HS-grad 2 marital status < 71182 >= 71182 33 39 some-college Bachelor’s ... divorced widowed ... married-civ-spouse 63 13 3 Figure 9: O-Cluster results on the Adult data set (ρ ρ =0.5). Leaf numbers are the percentage of people with income above $50,000 in a cluster. poisonous (octagons). Cluster sizes are reasonably balanced (minimum of 288 and maximum of 656 points per cluster). The nature of the hierarchy allows the extraction of simple descriptive rules. An example rule from Figure 9 is: If there is no odor, and the population is ‘several’, and the cap color is brown, then the mushrooms are edible. According to [GRS99], a standard agglomerative method results in clusters that follow the global distribution and there is no correlation with the target class. ROCK [GRS99] uses an alternative bottom-up agglomerative approach and successfully finds clusters of very high target class purity. While O-Cluster’s results do not have as high purity as the ones reported in [GRS99], the top-down approach used here is advantageous in terms of efficiency and explanatory power. 4.2 Adult data set The adult data set from the UCI repository contains both numeric and nominal values (6 numeric and 8 nominal attributes). There are a total of 48,842 records, split in a 2:1 ratio to form a train and test set. The target class indicates whether a person had an income exceeding $50,000. The data set is unbalanced – only 23.9% of the records belong to the positive class (income above $50,000). Figure 9 shows O-Cluster’s results when the training data (excluding target) was used as an input. The algorithm found 14 leaf clusters. The numbers in the leaves represent the percentage of cases with income above $50,000. The distribution of positive cases in the leaves differs significantly from the global distribution. The tree built by O-Cluster is fairly unbalanced. Initially, the algorithm isolates clusters that are dominated by positive cases. Subsequently, it branches out to differentiate groups among the negative cases. Clusters with above 50% concentration of positive cases are labeled positive. If we apply this ‘classification’ to the test data (that is, assign cases as positive or negative based on the cluster label they would fall in) the test set accuracy is 82.4%. State-of-the-art supervised algorithms have been reported to achieve accuracy in the range of 83-85%. To validate further the quality of the results, we used the same paradigm for a standard bi-secting k-Means algorithm. After exploding the nominal attributes into binary values, the training data were grouped into 14 clusters. Then these clusters were labeled with the predominant class. On the test data, k-Means accuracy was 77.2%. Also, the average cluster purity was 78.6% vs. 81% in the case of O-Cluster. The distance-based kMeans algorithm produced clusters that did not separate well the rare class from the dominant class. 4.3 Magazine marketing data set The proprietary magazine marketing data set contains 111,000 records with mixed values (34 numeric and 10 nominal attributes). The attributes are based on financial and demographic information. The target class encodes whether a person subscribes to a magazine or not. The distribution of the two classes is balanced. O-Cluster produced 24 clusters. We compared the cluster purity (with respect to the target class) to that of bi-secting kMeans on the same data. O-Cluster found clusters that better differentiated the two target classes. Of the 24 clusters found, 22 were clearly dominated by the one of the classes (11 each) and 2 were close to 50% (global distribution). For k-Means, 16 clusters were dominated by the positive class, 2 by the negative class, and 6 were around 50%. O-Cluster’s average cluster purity was 74.1% while k-Means average cluster purity was 69.7%. Similar to the adult data set results, O-Cluster outperformed k-Means in terms of target differentiation. Due to the explosion of nominal attribute into binary values, k-Means operates in a higher dimensional input space where the distance-based metric can become unreliable. A projection-based approach, such as OCluster, can be advantageous for these types of data resulting in a superior clustering solution. References [CGC94] A. D. Chaturvedi, P. E. Green, and J. D. Carroll. K-Means, K-Medians, and K-Modes: Special Cases of Partitioning Multiway Data. Presented at the Classification Society of North America Meeting., 1994. [GGR99] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS - Clustering Categorical Data Using Summaries. In Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining (KDD’99), 1999, pp. 73–83. [GKR98] D. Gibson, J. Kleinberg, and Clustering Categorical Data: Based on Dynamical Systems. Int. Conf. on Very Large (VLDB’98), 1998, pp. 311–323. [GRS99] S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for CategoricalAttributes. In Proc. IEEE Int. Conf. on Data Engineering, 1999, pp. 512-521. [HAK00] A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What Is the Nearest Neighbor in High Dimensional Spaces? In Proc. 26th Int. Conf. on Very Large Data Bases (VLDB’00), 2000, pp. 506–515. [HK99] 5. Conclusions The majority of existing clustering algorithms encounter serious scalability and/or accuracy related problems when used on databases with a large number of records and/or attributes. Only few methods can handle numeric, nominal, and mixed data. O-Cluster is capable of clustering efficiently and effectively large, highdimensional databases with both numeric and nominal values. O-Cluster relies on an active sampling approach to achieve scalability with large volumes of data and requires at most a single pass through the database. The algorithm uses an axis-parallel partitioning scheme to build a hierarchy and identify hyper-rectangular regions of uni-modal density in the input feature space. The topdown partitioning strategy ensures excellent scalability and explanatory power of the clustering solution. OCluster has good accuracy, uses a single tunable parameter, and can successfully operate with limited memory resources. Currently we are extending O-Cluster in a number of ways, including: parallel implementation, probabilistic modeling, and scoring with missing values. These extensions will be reported in a future paper. P. Raghavan. An Approach In Proc. 24th Data Bases A. Hinneburg and D. A. Keim. Optimal GridClustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. In Proc. 25th Int. Conf. on Very Large Data Bases (VLDB’99), 1999, pp. 506– 517. [Hua97a] Z. Huang. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. Research Issues on Data Mining and Knowledge Discovery, 1997. [Hua97b] Z. Huang. Clustering Large Data Sets with Mixed Numeric and Categorical Values. In Proc. First Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1997. [MC02] B. L. Milenova and M. M. Campos. O-Cluster: Scalable Clustering of Large High Dimensional Data Sets. In Proc. 2002 IEEE Int. Conf. on Data Mining (ICDM’02), 2002, pp. 290–297. [Sco79] D. W. Scott. Multivariate Density Estimation. John Wiley & Sons, New York, 1979. [VM95] D. Ventura and T. R. Martinez. An Empirical Comparison of Discretization Methods. In Proc. Tenth Int. Symp. on Computer and Information Sciences, 1995, pp. 147–176. [Wan96] M. P. Wand. Data-Based Choice of Histogram Bin Width. The American Statistician Vol. 51, 1996, pp. 59–64. [ZRL96] T. Zhang, R. Ramakhrisnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proc. 1996 ACMSIGMOD Int. Conf. Management of Data (SIGMOD’96), 1996, pp. 103–114.

Clustering Large Databases with Numeric and Nominal Values

Related documents

Products

Support

Clustering Large Databases with Numeric and Nominal Values

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib