Clustering Large Databases with Numeric and Nominal Values

advertisement
Clustering Large Databases with Numeric and
Nominal Values Using Orthogonal Projections
Boriana L. Milenova
Marcos M. Campos
Oracle Data Mining Technologies
10 Van de Graff Drive
Burlington, MA 1803
USA
boriana.milenova@oracle.com
Oracle Data Mining Technologies
10 Van de Graff Drive
Burlington, MA 1803
USA
marcos.m.campos@oracle.com
Abstract
Clustering large high-dimensional databases has
emerged as a challenging research area. A
number of recently developed clustering
algorithms have focused on overcoming either
the “curse of dimensionality” or the scalability
problems associated with large amounts of data.
The majority of these algorithms operate only on
numeric data, a few handle nominal data, and
very few can deal with both numeric and
nominal
values.
Orthogonal
partitioning
Clustering (O-Cluster) was originally introduced
as a fast, scalable solution for large multidimensional databases with numeric values.
Here, we extend O-Cluster to domains with
nominal and mixed values. O-Cluster uses a topdown partitioning strategy based on orthogonal
projections to identify areas of high density in
the input data space. The algorithm employs an
active sampling mechanism and requires at most
a single scan through the data. We demonstrate
the high quality of the obtained clustering
solutions, their explanatory power, and OCluster’s good scalability.
1. Introduction
Clustering of large high-dimensional databases is an
important problem with challenging performance and
system resource requirements. There are a number of
algorithms that are applicable to very large databases, and
a few that address high-dimensional data. Although a
large part of the data in data warehouses is nominal, few
of these algorithms have addressed clustering nominal
data and even fewer mixed data. A survey of numeric
clustering algorithms for large, high-dimensional
databases can be found in [MC02].
A common approach to cluster databases with nominal
attributes (columns) is to convert them into numeric
attributes and apply a numeric clustering algorithm. This
is usually done by “exploding” the nominal attribute into
a set of new binary numeric attributes, one for each
distinct value in the original attribute. This approach
suffers from a number of limitations. Firstly, it increases
the dimensionality of the problem which may potentially
be already high. This usually decreases the quality of
clustering results [HAK00, HK99]. Secondly, the
computational cost incurred may be considerable.
A number of recent algorithms have been developed to
cluster nominal data [GGR99, GKR98, GRS99, Hua97a,
Hua97b]. These algorithms can also be used to cluster
databases with mixed data by transforming numeric
attributes into nominal ones through discretization
[Hua97b]. However, the discretization process could
result in loss of important information available in the
continuous values and lead to decreased accuracy. This
has been found to be true for supervised models [VM95].
k-Prototypes [Hua97a] is a partition-based algorithm
that uses a heterogeneous distance function to compute
distance for mixed data. The heterogeneous distance
function requires weighting the contribution of the
numeric attributes versus that of the nominal ones.
Another partitioning algorithm, k-Modes [CGC94,
Hua97b], does not have the same limitation. k-Modes
solves the weighting problem by only working with
nominal data. Numeric attributes need to be discretized.
Both algorithms are susceptible to the curse of
dimensionality common to distance-based algorithms in
high-dimensional spaces [HAK00].
STIRR [GKR98] is an iterative algorithm based on
non-linear dynamical systems. Nominal clustering is
treated as graph partitioning. STIRR running time is linear
in number of rows (records) and nearly linear in number
of attributes. The algorithm needs multiple scans of the
data set in order to converge. This limits its usefulness to
databases that fit into memory. The number of clusters
does not need to be specified.
ROCK [GRS99] is an agglomerative algorithm which
uses a similarity metric to find neighboring data points. It
then defines links between two data points based on the
number of neighbors they share. The algorithm attempts
to maximize a goodness measure that favors merging
pairs with a large number of common neighbors. ROCK’s
scaling with number of rows is quadratic to cubic. In
order to handle large databases ROCK requires sampling
the data. This can prevent the detection of smaller clusters
that may contain important patterns. The exact form of the
goodness measure also needs to be specified, which is not
trivial.
CACTUS [GGR99] is an agglomerative algorithm that
uses data summarization to achieve linear scaling in the
number of rows. It requires only two scans of the data.
However, it has exponential scaling in the number of
attributes. This limits the algorithm’s usefulness for
clustering databases with large number of attributes. In a
variety of databases CACTUS was found to be 3 to 10
times faster than STIRR. Additionally, the number of
clusters does not need to be specified.
The O-Cluster (Orthogonal Partitioning Clustering)
algorithm [MC02] was originally introduced as a fast,
scalable clustering solution for large high-dimensional
numeric data. The algorithm uses axis-parallel unidimensional data projections. More general projections
could be used. However, the current implementation aims
for simplicity and efficient computation. The axis-parallel
projection-based approach was shown to be very effective
in high-dimensional numeric data [HK99]. O-Cluster
builds upon the orthogonal projection concept introduced
by OptiGrid [HK99] and addresses some of limitations of
OptiGrid’s approach, such as dealing with a large number
of records that do not fit in memory and identifying good
partitions without relying on user-defined parameters. The
algorithm requires at most a single scan through the data.
The work presented here extends O-Cluster’s
functionality to domains with nominal and mixed
(numeric and nominal) values. O-Cluster, due to its topdown partitioning approach, has very good transparency
and provides cluster descriptions in the form of compact
rules. Other algorithms suitable for clustering nominal
attributes do not share this useful feature.
Section 2 describes the algorithm. Section 3 analyzes
the behavior the algorithm on artificial data and discusses
O-Cluster’s complexity and scalability. Section 4 presents
experiments with real data and comparison with other
algorithms. Section 5 concludes the paper and indicates
directions for future work.
2. Orthogonal partitioning clustering
The objective of O-Cluster is to identify areas of high
density in the data and separate them into individual
clusters. The algorithm looks for splitting points along
axis-parallel projections that would produce cleanly
separable and preferably balanced clusters. The algorithm
operates recursively by creating a binary tree hierarchy.
The number of leaf clusters is determined automatically
and does not need to be specified in advance. The
topology of the hierarchy, along with its splitting
predicates, can be used to gain insights into the clustering
solution. The following sections describe the partitioning
strategy used with numeric, nominal, and mixed values,
outline the active sampling method employed by OCluster, and summarize the main processing stages of the
algorithm.
2.1 Numeric values
O-Cluster computes uni-dimensional histograms along
individual input attributes. For each histogram, O-Cluster
attempts to find the ‘best’ valid cutting plane, if any exist.
A valid cutting plane passes through a bin of low density
(a valley) in the histogram. Additionally, the bin of low
density should have bins of high density (peaks) on each
side. O-Cluster attempts to find a pair of peaks with a
valley between them where the difference between the
peak and valley histogram counts is statistically
significant. Statistical significance is tested using a
standard χ2 test:
χ 2 = 2 (observed − expected ) 2 ÷ expected ≥ χ α2 ,1 ,
where the observed value is equal to the histogram count
of the valley and the expected value is the average of the
histogram counts of the valley and the lower peak. A 95%
confidence level ( χ 02.05,1 = 3.843 ) has been shown to
produce reliable results. Since this test can produce
multiple splitting points, O-Cluster chooses the one where
the valley has the lowest histogram count and thus the
cutting plane would go through the bin with lowest
density. Alternatively, or in the case of a tie, the algorithm
can favor splitting points that would produce balanced
partitions.
It is sometimes desirable to prevent the separation of
clusters with small peak density. This can be
accomplished by introducing a baseline sensitivity level
that excludes peaks below this count. It should be noted
that with numeric attributes, sensitivity (ρ) is an optional
parameter that is used solely for filtering of the splitting
point candidates. Sensitivity is a parameter in the [0, 1]
range that is inversely proportional to the minimum count
required for a histogram peak. A value of 0 corresponds to
the global uniform level per attribute. The global uniform
level reflects the average histogram count that would have
been observed if the data points in the buffer were drawn
from a uniform distribution. A value of 0.5 sets the
minimum histogram count for a peak to 50% of the global
uniform level. A value of 1 removes the restrictions on
peak histogram counts and the splitting point
identification relies solely on the χ2 test. A default value
of 0.5 usually works satisfactorily. Figure 1 illustrates the
splitting points identified in a one dimensional histogram.
This example shows the use of a sensitivity level (marked
by the dashed line).
the split. The leaf clusters are described in terms of their
histograms and/or modes and small bins are considered
uninformative. If more than two bins have high counts in
a histogram, subsequent splits would separate them into
individual partitions. To avoid rapid data decimation, OCluster creates a binary tree rather than one where large
bins fan out into individual branches. The top down
approach used by O-Cluster discovers co-occurrences of
values and each leaf encodes dense cells in a subspace
defined by the splits in O-Cluster’s hierarchy. Figure 2
depicts a nominal attribute histogram. The two largest
bins (colored dark grey) will seed the two new partitions.
Again, the sensitivity level is marked by a dashed line.
Figure 1: Numeric attribute splitting points.
It is desirable to compute histograms that provide
good resolution but also have data artifacts smoothed out.
A number of studies have addressed the problem of how
many equi-width bins can be supported by a given
distribution [Sco79, Wan96]. Based on these studies, a
reasonable, simple approach would be to make the
number of equi-width bins inversely proportional to the
standard deviation of the data along a given dimension
and directly proportional to N1/3, where N is the number of
points inside a partition. Alternatively, one can use a
global binning strategy and coarsen the histograms as the
number of points inside the partitions decreases. OCluster is robust with respect to different binning
strategies as long as the histograms do not significantly
undersmooth or oversmooth the distribution density. Data
sets with low number of records would require coarser
binning and some resolution may potentially be lost.
Large data sets have the advantage of supporting the
computation of detailed histograms with good resolution.
2.2 Nominal values
Nominal values do not have an intrinsic order associated
with them. Therefore it is impossible to apply the notion
of histogram peaks and valleys as in the numeric case.
The counts of individual values form a histogram and bins
with large counts can be interpreted as regions with high
density. The clustering objective is to separate these highdensity areas and effectively decrease the entropy of the
data. O-Cluster identifies the histogram with highest
entropy among the individual projections. For simplicity,
we approximate the entropy measure as the number of
bins above sensitivity level ρ (as defined in Section 2.1).
O-Cluster places the two largest bins into separate
partitions, thereby creating a splitting predicate. The
remainder of the bins can be assigned randomly to the two
resulting partitions. If these bins have low counts, they
would not be able to influence O-Cluster’s solution after
Figure 2: Nominal attribute partitioning.
When histograms are tied on the largest number of
bins above the sensitivity level, O-Cluster favors the
histogram where the top two bins have higher counts.
Since the splits are binary, the optimal case would have
all the partition data points equally distributed between
these two top bins. We numerically quantify the
suboptimality of the split as the difference between the
count of the lower of the two peaks and the count of half
of the total number of points in the partition.
2.3 Mixed numeric and nominal values
O-Cluster searches for the ‘best’ splitting plane for
numeric and nominal attributes separately. Then it
compares two measures of density: histogram count of the
valley bin in the numeric split and the suboptimality of
the nominal split. The algorithm chooses the split with
lower density.
2.4 Active sampling
O-Cluster uses an active sampling mechanism to handle
databases that do not fit in memory. The algorithm
operates on a data buffer of a limited size. After
processing an initial random sample, O-Cluster identifies
data records that are of no further interest. Such records
belong to ‘frozen’ partitions where further splitting is
highly unlikely. These records are replaced with examples
1. Load buffer
2. Compute histograms for active
partitions
3. Find 'best' splitting points for active
partitions
4. Flag ambiguous and 'frozen'
partitions
Splitting
points
exist?
5. Split active
partitions
Yes
No
Ambiguous
partitions
exist?
6. Reload
buffer
No
Yes
Yes
Unseen
data
exist?
No
EXIT
Figure 3: O-Cluster algorithm block diagram.
from ‘ambiguous’ regions where further information
(additional data points) is needed to find good splitting
planes and continue partitioning. A partition is considered
ambiguous if a valid split can only be found at a lower
confidence level. For a numeric attribute, if the difference
between the lower peak and the valley is significant at the
90% level ( χ 02.1,1 = 2.706 ), but not at the default 95%
level, the partition is considered ambiguous. Analogously,
for a nominal attribute, if the counts of at least two bins
are above the sensitivity level but not to a significant
degree (at the default 95% confidence level), the partition
is labeled ambiguous.
Records associated with frozen partitions are marked
for deletion from the buffer. They are replaced with
records belonging to ambiguous partitions. The
histograms of the ambiguous partitions are updated and
splitting points are reevaluated.
2.5 The O-Cluster Algorithm
The O-Cluster algorithm evaluates possible splitting
points for all projections in a partition, selects the ‘best’
one, and splits the data into two new partitions. The
algorithm proceeds by searching for good cutting planes
inside the newly created partitions. Thus O-Cluster
creates a binary tree structure that tessellates the input
space into rectangular regions. Figure 3 provides an
outline of O-Cluster’s algorithm. The main processing
stages are:
1. Load buffer: If the entire data set does not fit in
the buffer, a random sample is used. O-Cluster assigns all
points from the initial buffer to a single active root
partition.
2. Compute histograms for active partitions: The
goal is to compute histograms along the orthogonal unidimensional projections for each active partition. Any
partition that represents a leaf in the clustering hierarchy
and is not explicitly marked ambiguous or ‘frozen’ is
considered active.
3. Find ‘best’ splitting points for active partitions:
For each histogram, O-Cluster attempts to find the ‘best’
valid cutting plane, if any exist. The algorithm examines
separately the groups of numeric and nominal attributes
and selects the best splitting plane.
4. Flag ambiguous and ‘frozen’ partitions: If no
valid splitting points are found in a partition, O-Cluster
checks whether the χ2 test would have found a valid
splitting point at a lower confidence level. If that is the
case, the current partition is considered ambiguous. More
data points are needed to establish the quality of the
splitting point. If no splitting points were found and there
is no ambiguity, the partition can be marked as ‘frozen’
and the records associated with it marked for deletion
from the buffer.
5. Split active partitions: If a valid separator exists,
the data points are split by the cutting plane, two new
active partitions are created from the original partition,
and the algorithm proceeds from Step 2.
6. Reload buffer: This step takes place after all
recursive partitioning on the current buffer is completed.
If all existing partitions are marked as ‘frozen’ and/or
there are no more data points available, the algorithm
exits. Otherwise, if some partitions are marked as
ambiguous and additional unseen data records exist, OCluster proceeds with reloading the data buffer. The new
data replace records belonging to ‘frozen’ partitions.
When new records are read in, only data points that fall
inside ambiguous partitions are placed in the buffer. New
records falling within a ‘frozen’ partition are not loaded
into the buffer and are discarded. If it is desirable to
maintain statistics of the data points falling inside
partitions (including the ‘frozen’ partitions), such
statistics can be continuously updated with the reading of
each new record. Loading of new records continues until
either: 1) the buffer is filled again; 2) the end of the data
set is reached; or 3) a reasonable number of records (e.g.,
equal to the buffer size) have been read, even if the buffer
is not full and there are more data. The reason for the last
condition is that if the buffer is relatively large and there
are many points marked for deletion, it may take a long
time to entirely fill the buffer with data from the
ambiguous regions. To avoid excessive reloading time
under these circumstances, the buffer reloading process is
terminated after reading through a number of records
equal to the data buffer size. Once the buffer reload is
completed, the algorithm proceeds from Step 2. The
algorithm requires, at most, a single pass through the
entire data set.
3. O-Cluster analysis on artificial data
The initial set of tests illustrates O-Cluster’s behavior on
artificial data. To graphically illustrate the algorithm, we
used two-dimensional data. Further tests studied OCluster’s scalability and the active sampling mechanism.
to achieve high accuracy on data sets with significantly
higher cluster variance than the original DS3 problem.
The main reason for the remarkably good performance is
that higher dimensionality allows O-Cluster to find
cutting planes that do not produce splitting artifacts.
3.1 Numeric example
O-Cluster was used on a standard benchmark data set DS3
[ZRL96]. The characteristics of this data set - low number
of dimensions, highly overlapping clusters of different
variance and size – make the problem very challenging.
Figure 4 depicts the partitions found by O-Cluster. The
centers of the original clusters are marked with squares
while the centroids of the points assigned to each partition
are represented by stars. Although O-Cluster does not
function optimally when the dimensionality is low, it
produces a good set of partitions. O-Cluster finds cutting
planes at different levels of density and successfully
identifies nested clusters. Axis-parallel splits in low
dimensions can lead to creation of artifacts where cutting
planes have to cut through parts of a cluster and data
points are assigned to incorrect partitions. Such artifacts
can either result in centroid error or lead to further
partitioning and creation of spurious clusters. For
example, in Figure 4, O-Cluster creates 73 partitions. Of
these, 71 contain the centroids of at least one of the
original clusters. The remaining 2 partitions resulted from
artifacts created by splits going through clusters.
In general, there are two potential sources of
imprecision in the algorithm: 1) O-Cluster may fail to
create partitions for all original clusters; and/or 2) OCluster may create spurious partitions that do not
correspond to any of the original clusters. To measure
these two effects separately, we use two metrics borrowed
from the information retrieval domain: Recall is defined
as the percentage of the original clusters that were found
and assigned to partitions; Precision is defined as the
percentage of the found partitions that contain at least one
original cluster centroid. That is, in Figure 4 O-Cluster
found 71 out of 100 original clusters (resulting in recall of
71%), and 71 out of the 73 partitions created contained at
least one centroid of the original clusters (a precision of
97%). The recall and precision measures reflect the tradeoff between identifying as many as possible of the true
clusters vs. creating spurious clusters due to excessive
partitioning. The use of recall and precision in clustering
benchmarks is possible only when the correct number of
clusters is available.
In order to investigate the benefits of higher
dimensionality, additional attributes were added to the
DS3 data set. O-Cluster’s accuracy (both recall and
precision) improves dramatically with increased
dimensionality. For example, five attributes produced
recall of 99% and precision of 96%. Ten attributes or
more resulted in perfect recall and precision (100%).
Increasing the number of dimensions allowed O-Cluster
Figure 4: O-Cluster partitions on the DS3 data set.
3.2 Nominal example
The next example demonstrates O-Cluster’s behavior on a
simple two-dimensional data set where each attribute has
five unique values. In Figure 5, darker shade indicates
higher density in the cells. O-Cluster places high-density
cells in individual partitions. Examination of the attribute
histograms within a partition, and in particular their
modes, can be used to characterize the found clusters.
Increasing the sensitivity level would result in increasing
the number of partitions since cells with lower counts
would be separated into individual partitions.
B5
B4
B3
B2
B1
A1
A2
A3
A4
A5
Figure 5: O-Cluster partitions on a nominal data set.
3.3 O-Cluster complexity and scalability
We begin discussing O-Cluster’s scalability behavior
under the assumption that the entire data set can be loaded
into the buffer. O-Cluster uses projections that are axisparallel. The histogram computation step is of complexity
O(N x d) where N is the number of data points in the
buffer and d is the number of attributes. The selection of
best splitting point for a single attribute is O(b) where b is
the average number of histogram bins in a partition.
Choosing the best splitting point over all attributes is O(d
x b). The assignment of data points to newly created
partitions requires a comparison of an attribute value to
the splitting predicate and the complexity has an upper
bound of O(N). Loading new records into the data buffer
requires their insertion into the relevant partitions. The
complexity associated with scoring a record depends on
the depth of the binary clustering tree (s). The upper limit
for filling the whole buffer is O(N x s). The depth of the
tree s depends on the data set. In general, N and d are the
dominating factors and the total complexity can be
approximated as O(N x d).
To validate this analysis, a series of tests was
performed by increasing (within the limits of O-Cluster’s
buffer size) the numbers of records and attributes. All data
sets used in the experiments consisted of 50 clusters with
equal number of points. All 50 clusters were correctly
identified in each test. When measuring scalability with
increasing number of records, the number of attributes
was set to 10. When measuring scalability with increasing
dimensionality, the number of records was set to 100,000.
Figure 6 shows a clear linear dependency of O-Cluster’s
processing time on both the number of records and
number of attributes. The actual timing results shown can
be improved significantly because the algorithm was
Figure 6: O-Cluster Scalability: (a) with number of
records; (b) with number of dimensions (attributes).
implemented as a PL/SQL package in an ORACLE 9i
database. There is an overhead associated with the fact
that PL/SQL is an interpreted language.
It should be noted that the linear scalability pattern for
the number of records shown here characterizes OCluster’s behavior only with respect to databases that can
fit entirely in the buffer. For very large volumes of data,
the dependency is usually strongly sub-linear because
only a fraction of the data is processed through the active
sampling mechanism. This sub-linearity is due to the fact
that all partitions are likely to become ‘frozen’ before a
full scan of the database is completed. The fraction of the
total number of records processed depends on the nature
of the data and the size of the buffer. For example, using
the same data sets from the above scalability experiments,
a fixed buffer size of 50,000 correctly identified all
clusters and did not require additional refills. That is, data
sets of larger size had the same clustering processing time
and incurred additional cost only in randomizing the order
of the data. The following section discusses the effect of
varying the buffer size as a proportion of the total data.
3.4 Effect of buffer size
The next set of results illustrates O-Cluster’s behavior
when a small memory footprint is required. The buffer
can thus contain only a fraction of the entire data set. This
series of tests reuses a data set described in Section 3.3
(50 clusters, 2,000 point each, 10 attributes).
Figure 7 shows the timing and recall results for
different buffer sizes (0.5%, 0.8%, 1%, 5%, and 10% of
the entire data set). Very small buffer sizes may require
multiple refills. For example, the described experiment
showed that when the buffer size was 0.5%, O-Cluster
needed to refill it 5 times; when the buffer size was 0.8%
or 1%, O-Cluster had to refill it once. For larger buffer
Figure 7: Buffer size: (a) time scalability; (b) recall.
odor
almond
anise
...
none
population
abundant
clustered
...
cap color
gill color
several
buff
grey
...
cap shape
brown
buff
grey
...
1
spore print
color
bell
flat
...
stalk color
above ring
cap color
brown
0
spore print
color
0
buff
grey
...
convex
bell
flat
...
1
10
spore print
color
0
buff
white grey
brown
...
49
brown
grey
buff
green
...
cap shape
cap shape
convex
pink
grey
brown
buff
red
...
bell
flat
...
91
convex
81
cap surface
cap shape
convex
0
bell
flat
...
convex
0
100
bell
flat
...
stalk surface
above ring
100
46
brown
0
cap color
stalk color
above ring
cap shape
white
buff
pink
white
buff
grey
...
white
cap shape
convex
white
brown
brown
grey
...
smooth
scaly
silky
...
91
100
bell
flat
...
100
smooth
grooves
scaly
...
85
87
Figure 8: O-Cluster results on mushroom data set (ρ
ρ = 0.65). Leaf numbers are the percentage of poisonous
mushrooms in a cluster.
sizes, no refills were necessary. As a result, using 0.8%
buffer proves to be slightly faster than using 0.5% buffer.
If no buffer refills were required (buffer size greater than
1%), O-Cluster followed a linear scalability pattern, as
shown in the previous section. Regarding O-Cluster’s
accuracy, buffer sizes under 1% proved to be too small for
the algorithm to find all existing clusters. For buffer size
of 0.5%, O-Cluster found 41 out of 50 clusters (82%
recall) and for buffer size of 0.8%, O-Cluster found 49 out
of 50 clusters (98% recall). Larger buffer sizes allowed OCluster to correctly identify all original clusters (100%
recall). For all buffer sizes (including buffer sizes smaller
than 1%) precision was 100%.
O-Cluster functions optimally when the order of data
presentation is random. Residual dependencies within the
subsets can result in premature termination of the
algorithm and some of the statistically significant
partitions may remain undiscovered.
4. O-Cluster analysis on real data
This section describes O-Cluster’s results on a set of real
world databases. [MC02] demonstrated the high quality of
O-Cluster’s solution on high-dimensional multimedia data
with numeric attributes. The focus here is on data with
nominal and mixed (numeric and nominal) values. The
clustering results are evaluated with respect to their
accuracy and explanatory power. Comparisons to other
algorithms are also provided.
4.1 Mushroom data set
The mushroom data set from the UCI repository is a
popular benchmark for clustering nominal data. The data
set consists of 8,124 records and there are 22 nominal
multi-valued attributes. In a classification setting, the
objective is to classify the mushrooms as edible or
poisonous on the basis of their physical attributes. 51.8%
of the mushroom entries are labeled as edible. Although
clustering is an unsupervised task, the clusters discovered
by the algorithm may be able to capture some underlying
structure that strongly correlates with the target labels.
Figure 8 depicts the hierarchical tree discovered by OCluster on the mushroom data (the target attribute was not
included). The splitting predicates for each node are also
included. For lack of space, not all branch conditions are
listed - the ellipses (...) stand for all other attribute values
that are not explicitly listed. The numbers in the leaves are
the percentage of poisonous mushrooms in the partition. It
can be seen that the majority of the clusters discovered by
the algorithm differ significantly from the global
distribution and can be labeled as either edible (circles) or
capital
gain
< 13000
>= 13000
capital
loss
< 1028
99
>= 1028
capital
gain
< 7000
52
>= 7000
capital
gain
< 2000
98
>= 2000
26
education
num
>= 12
< 12
education
num
<9
final
weight
>= 9
< 144803
final
weight
< 144803
>= 144803
4
final
weight
< 71182
>= 71182
13
final
weight
marital
status
14
>= 144803
divorced
widowed
...
education
married-civ-spouse
33
HS-grad
2
marital
status
< 71182
>= 71182
33
39
some-college
Bachelor’s
...
divorced
widowed
...
married-civ-spouse
63
13
3
Figure 9: O-Cluster results on the Adult data set (ρ
ρ =0.5). Leaf numbers are the percentage of people with income
above $50,000 in a cluster.
poisonous (octagons). Cluster sizes are reasonably
balanced (minimum of 288 and maximum of 656 points
per cluster). The nature of the hierarchy allows the
extraction of simple descriptive rules. An example rule
from Figure 9 is: If there is no odor, and the
population is ‘several’, and the cap color
is brown, then the mushrooms are edible.
According to [GRS99], a standard agglomerative
method results in clusters that follow the global
distribution and there is no correlation with the target
class. ROCK [GRS99] uses an alternative bottom-up
agglomerative approach and successfully finds clusters of
very high target class purity. While O-Cluster’s results do
not have as high purity as the ones reported in [GRS99],
the top-down approach used here is advantageous in terms
of efficiency and explanatory power.
4.2 Adult data set
The adult data set from the UCI repository contains both
numeric and nominal values (6 numeric and 8 nominal
attributes). There are a total of 48,842 records, split in a
2:1 ratio to form a train and test set. The target class
indicates whether a person had an income exceeding
$50,000. The data set is unbalanced – only 23.9% of the
records belong to the positive class (income above
$50,000). Figure 9 shows O-Cluster’s results when the
training data (excluding target) was used as an input. The
algorithm found 14 leaf clusters. The numbers in the
leaves represent the percentage of cases with income
above $50,000. The distribution of positive cases in the
leaves differs significantly from the global distribution.
The tree built by O-Cluster is fairly unbalanced. Initially,
the algorithm isolates clusters that are dominated by
positive cases. Subsequently, it branches out to
differentiate groups among the negative cases. Clusters
with above 50% concentration of positive cases are
labeled positive. If we apply this ‘classification’ to the test
data (that is, assign cases as positive or negative based on
the cluster label they would fall in) the test set accuracy is
82.4%. State-of-the-art supervised algorithms have been
reported to achieve accuracy in the range of 83-85%. To
validate further the quality of the results, we used the
same paradigm for a standard bi-secting k-Means
algorithm. After exploding the nominal attributes into
binary values, the training data were grouped into 14
clusters. Then these clusters were labeled with the
predominant class. On the test data, k-Means accuracy
was 77.2%. Also, the average cluster purity was 78.6%
vs. 81% in the case of O-Cluster. The distance-based kMeans algorithm produced clusters that did not separate
well the rare class from the dominant class.
4.3 Magazine marketing data set
The proprietary magazine marketing data set contains
111,000 records with mixed values (34 numeric and 10
nominal attributes). The attributes are based on financial
and demographic information. The target class encodes
whether a person subscribes to a magazine or not. The
distribution of the two classes is balanced. O-Cluster
produced 24 clusters. We compared the cluster purity
(with respect to the target class) to that of bi-secting kMeans on the same data. O-Cluster found clusters that
better differentiated the two target classes. Of the 24
clusters found, 22 were clearly dominated by the one of
the classes (11 each) and 2 were close to 50% (global
distribution). For k-Means, 16 clusters were dominated by
the positive class, 2 by the negative class, and 6 were
around 50%. O-Cluster’s average cluster purity was
74.1% while k-Means average cluster purity was 69.7%.
Similar to the adult data set results, O-Cluster
outperformed k-Means in terms of target differentiation.
Due to the explosion of nominal attribute into binary
values, k-Means operates in a higher dimensional input
space where the distance-based metric can become
unreliable. A projection-based approach, such as OCluster, can be advantageous for these types of data
resulting in a superior clustering solution.
References
[CGC94] A. D. Chaturvedi, P. E. Green, and J. D.
Carroll. K-Means, K-Medians, and K-Modes:
Special Cases of Partitioning Multiway Data.
Presented at the Classification Society of North
America Meeting., 1994.
[GGR99] V. Ganti, J. Gehrke, and R. Ramakrishnan.
CACTUS - Clustering Categorical Data Using
Summaries. In Proc. 1999 Int. Conf.
Knowledge Discovery and Data Mining
(KDD’99), 1999, pp. 73–83.
[GKR98] D. Gibson, J. Kleinberg, and
Clustering Categorical Data:
Based on Dynamical Systems.
Int. Conf. on Very Large
(VLDB’98), 1998, pp. 311–323.
[GRS99] S. Guha, R. Rastogi, and K. Shim. ROCK: A
Robust
Clustering
Algorithm
for
CategoricalAttributes. In Proc. IEEE Int. Conf.
on Data Engineering, 1999, pp. 512-521.
[HAK00] A. Hinneburg, C. C. Aggarwal, and D. A.
Keim. What Is the Nearest Neighbor in High
Dimensional Spaces? In Proc. 26th Int. Conf.
on Very Large Data Bases (VLDB’00), 2000,
pp. 506–515.
[HK99]
5. Conclusions
The majority of existing clustering algorithms encounter
serious scalability and/or accuracy related problems when
used on databases with a large number of records and/or
attributes. Only few methods can handle numeric,
nominal, and mixed data. O-Cluster is capable of
clustering efficiently and effectively large, highdimensional databases with both numeric and nominal
values.
O-Cluster relies on an active sampling approach to
achieve scalability with large volumes of data and
requires at most a single pass through the database. The
algorithm uses an axis-parallel partitioning scheme to
build a hierarchy and identify hyper-rectangular regions
of uni-modal density in the input feature space. The topdown partitioning strategy ensures excellent scalability
and explanatory power of the clustering solution. OCluster has good accuracy, uses a single tunable
parameter, and can successfully operate with limited
memory resources.
Currently we are extending O-Cluster in a number of
ways, including: parallel implementation, probabilistic
modeling, and scoring with missing values. These
extensions will be reported in a future paper.
P. Raghavan.
An Approach
In Proc. 24th
Data Bases
A. Hinneburg and D. A. Keim. Optimal GridClustering: Towards Breaking the Curse of
Dimensionality
in
High-Dimensional
Clustering. In Proc. 25th Int. Conf. on Very
Large Data Bases (VLDB’99), 1999, pp. 506–
517.
[Hua97a] Z. Huang. A Fast Clustering Algorithm to
Cluster Very Large Categorical Data Sets in
Data Mining. Research Issues on Data Mining
and Knowledge Discovery, 1997.
[Hua97b] Z. Huang. Clustering Large Data Sets with
Mixed Numeric and Categorical Values. In
Proc. First Pacific-Asia Conference on
Knowledge Discovery and Data Mining, 1997.
[MC02]
B. L. Milenova and M. M. Campos. O-Cluster:
Scalable Clustering of Large High Dimensional
Data Sets. In Proc. 2002 IEEE Int. Conf. on
Data Mining (ICDM’02), 2002, pp. 290–297.
[Sco79]
D. W. Scott. Multivariate Density Estimation.
John Wiley & Sons, New York, 1979.
[VM95]
D. Ventura and T. R. Martinez. An Empirical
Comparison of Discretization Methods. In
Proc. Tenth Int. Symp. on Computer and
Information Sciences, 1995, pp. 147–176.
[Wan96] M. P. Wand. Data-Based Choice of Histogram
Bin Width. The American Statistician Vol. 51,
1996, pp. 59–64.
[ZRL96] T. Zhang, R. Ramakhrisnan, and M. Livny.
BIRCH: An Efficient Data Clustering Method
for Very Large Databases. In Proc. 1996 ACMSIGMOD Int. Conf. Management of Data
(SIGMOD’96), 1996, pp. 103–114.
Download