BIRCH: An Efficient Data

advertisement
BIRCH:
An
Efficient
Data
Clustering
Method
for
Very
Large
Databases
Tian
Zhang
(“lornputer
Sciences
[Jniv.
Raghu
Dept.
Ramakrishnan
Computer
of Wisconsin-Maciison
of
[Jniv.
zhang@cs. wise.edu
Sciences
useful
considerable
patterns
interest
problems
work
datasets
Hierarchies),
of 1/0
presents
a data
Iterative
for very large databases.
incoming
to produce
able resources
BIRCH
ditioual
are not part
We evaluate
order
sensitivity,
experiments.
recently
and
for
large
1
Introduction
this
paper,
a particular
set
spare
of
is usually
identifies
NASA
data
through
a performance
S11OW that
The
amount
proposed
BIRCH
is
the
of
not
clatla
mining
the
and
overall
the
it
efficiently
dataset[Lee81,
has
been
144-EC
and
effectively
can
than
than
of
be
the
tzrne
by
NSF
Grant
We
present
Its
1/(>
a .szngle
order
with
we argue
tering
method
also
SIGMOD ’96 6/96 Montreal, Canada
IQ 1996 ACM 0-89791 -794-4/96/0006 ...$3.50
103
or
=
X)
such
and
that
is willing
to
linear
wait
is
the
for
the
in
passes
further.
triangular
any
quality
eficieucy,
for
inequality
XI
,XZ,X3,
an
clusar-
parallelism,
is the first
i.e.,
(there
and
based
the
whose
self
exists
on
course
clustering
attribote
ill-
com-
BIRCIPs
over
space,
can
experi-
available
tuning
gained
is
and
through
best
the
data
quality,
databases.
Eucltdtan
of
a good
additional
performance
attribute
size
yields
large
the
is the
BIRCH
and
very
the
algorithms
large
BIRCH
for
dataset
clustering
ciataset,
of
for
to
point,
account
named
suitable
opportunities
Finally,
metmc
ule rvant
A related
time/space
BIRCH
very
requirements
Statistics,
(t~yptcall~y,
and
into
more
or dynamic
a
1/0.
the
existing
the
in
to take
is
of
and
offers
about
1 Informally,
[1(X1,X3)).
trying
constraint,:
.sw~)
to improve
that
for
execution.
definition
cost
one
other
interactive
X
is typically
without
used
method
BIRCH’S
ments,
X,
that
This
algorithm.
sensitivity,
paring
any
for
a user
scan
be used
the
clus-
problem.
there
M ltrntted
set
it is especially
ancl
satisfy
tn
optimization
available
a clustering
(optionally)
the
a
the uletghted
functton.
solution
data
clustering
that
knowledge
Permission to make digitalhard copy of part or all of this work for personal
or classroom use is ranted without fee provided that copies are not made
or distributed for pro 7tt or commercial
advantage,
the aopyright notice, the
title of the publication and its date appear, and notice is given that
copying is by permission of ACM, Inc. To copy otherwise, to republish, to
post on servers, or to redistribute to lists, requires prior specific permission
andlor a fee.
and
of the dataset
minima,
to Lre able
clustering,
for
IRI-9057562
(e.g.,
definition
required
ttme that
of
of the
chitecture
78.
rlum-
potnts,
database-oriented
the
is desirable
amount
put
D,J80].
supported
problem
By evaluating
and
prol>-
destred
N
measurement
minimal
of memory
results
clustering
patterns
clusters
a
of
[KR90]
additional,
the
dataset:
data
places,
distribution
derived
the
Data
crowded
is
(~iven
points,
occupied.
the
which
problem.
data
uniformly
sparse
more
(;rant
data clustering,
examine
Besides,
research
the
an
-smaller
that
comparisons
method
adopt,
with
rninzmt.ze
several
clustering
pazr-s of pozrtts
of local
a global
most,
partitions.
but
much
input
to find
possible
We
a few acl-
effectively.
quality
rnulti-clirnensional
discovers
*This
and
all
databases.
we
dataset.
original
no way
a single scan
with
of thf
ln
as in
the
a partatzon
dtscrete
demonstrate
kind
large
visualized
the avail-
betuleeri
to an abundance
the
G’tven
functzon
to find
involved
nonmetrzri.
attributes,
a dataset
dwtunce
the value
and
where
measurement
is a nonconvm
Due
metric
and
rue are asked
constraints).
efficiency,
and
K
of attributes
metrtc
as follows:
clusters
rrl~nrmizes
clustering
algorithm
“noise)’ (data points
NS, a clustering
datasets,
of
superior.
In
hence
CLARA
with
pattern)
clustering
We also present
consistently
with
further
time/space
suitable
data points
and time
find a goocl clustering
BIRCH’S
ber
wise.eclu
types
literature,
is formalized
ters),
are two
consider
Statistics
lern
totrd/average
using
and clynami-
metric
clustering
of the underlying
of BIR (;’H versus
the
incrementally
the quality
named
Clustering
it is especially
memory
we
Dept.
of Wisconsin-Maclison
to be clustered:
paper,
of the
of large
method
and
that
quality
and improve
widely
of clusters,
there
data
dzstance-based
clustering
scans. BIRCH
is also the first
in the database
area to handle
proposerl
that
the best
(i. e., available
this
costs.
multi-dimensional
can typically
of the data,
most
the problem
Reducing
BIRCH
attracted
Sciences
mironf~cs.
in the
nensional clataset.
address
and demonstrates
call y clusters
has
one of the
in a multi-dir
does not adequately
paper
datasets
and
regions,
(;”H (Balanced
to try
large
recently,
and minimization
This
Bfll
in
in this area is the identification
or deusel y populated
Prior
LJniv.
Wisconsin-Maciison
Generally,
Finding
Livny”
Computer
raghuf~cs.wise.edu
Abstract
st,udied
Miron
Dept.
identity
of
al-
values
(for
a distance
d(XI , X2) + d(X2, X3) ~
goritlhm
proposed
in
the
datlahase
outl~~rs (intuitively,
data
a,s “noispi’ ) and proposes
1.1
Outline
The
rest
of
of the
surveys
4 introduces
au(i
( ~F tree,
of BIll(’H
Sec.
ture
2
which
Finally
rmearch
D.J80,
Fis87,
Fis95,
study
to
has
of
been
Lel>87]
I)ased
(like
and
most
in
consider
the case that,
memory.
that,
to work
the
is typirally,
mu[-h
smaller
do the clustering
the
1/()
than
on
independent
of
this
for.
make
eslxv-ially
The
for
often
(e. g.,
hllilt
to
skpwed
points
also
A
related
the
data,
what
very
is not,
this
may
far
on
we are
the
on
number
of
problem
tree
that
height, -balanced,
cause
the
bor
which
performance
is,
for
l)oints
how
in
each
close
ignore
are eclually
purpose,
ancl d;nse
the
the
numlocol
or
far
decision,
away
they
that
are
of
data
[EK.X95h]
that
data
to
objects
a sample
dat, a
data
global
or
page;
and
may
that,
time
that
ch-awbacks
as the
addition,
it, may
In
to
1s improved
the
Later
techniques
on clisks
is drawn
searching
[EKX9,5aj
(based
to deal
by
each
data
Their
experiments
with
a small
on
with
( 1) clustering
from
on relevant
updates,
node
stops
loccd nLznz71w
N,S’s ability
reside
(2) focusing
quality
selected
R*-tree
points
for
show
loss of quality,
points.
they
inspect,
all
2.1
Cent
eclually
no
An
important
and
ancl
clist, ance
the
focusing
of the ciat, aset
clusters
are,
that,
same
due
CLARA
contiulles;
best, of these,
mmmezghbor.
propose
improve
ancl
neigh-
as a 10CCZ1T7LZnZ-
of the so-called
minimum
by
For
n~om~e~ghbor
{“?LARAN,S
the
two
medoid,
if a better
randomly
the
is a Ii”and
node.
the
node
efficiency
local
uocle
selectecl
data
is formali-
by one
mtntmum,
from
wrt.
controlled
differ
most
al-
is repre-
loc-ated
neighbor
returns
proposes
rnedoi(ls,
ancl
the
number
method
R*-trees)
at
a new
, and
a real
only
current,
local
RA N,S suffers
find
to
with
and
each
Ii
a randomly
the
another
found
IO
of
randomly,
restarts
been
which
data
(’L,4BAN,$’
process
in
checks
moves
have
and
he considered
They
existing
all
irrrportant
and
granularity
clustering
currently
that
it
after
not
it
records
for
(YA
be scanned
the fact
shoulci
of individually,
at
assume
can
clataset
clustering
are close
and
or partially
in the
the
They
advance
totally
points
or all
and
shove
to
it
trimming
given
insteacl
is founcl,
to search
is
with
node,
spatial
centrally
a set
if they
of neighbors
mum,
For
by
Scj it is
a rluster
clustering
a graph
starts
current
number
is that
The
neighbors
otherwise
only
or the most,
a
com-
clustering
CLARANS,
medotd,
representeci
are
search,
traditional
cluster
as searching
the
of values
not,
the
CLARANS
and
expensive,
number
dependent
apprc)aclles:
to
by its
time
presents
ranclornizecl
Witlh
best
as a useful
outperforms
“lwst,”
(or splitting
N.
[NH94]
In
clustertn[g
to find
is 0(N2).
large
to
case tirnp
clusters,
algorithm
with
of
sensitivp
pair
the
recognized
in Statistics.
in
zed
on
try
to form
recently.
based
partition
of c-lusters
probability-based
clusters
methods
matter
a large
but
.se771z-globol
data
have
clusters
attribute.
collectively
That,
the
attributes,
They
respect
(_’LARAN,$
nodes
is
exists,
representations
storing
[Fis87]),
this
is exactly
of
are
is
that
sented
statistically
has been
that
not
value
minimum,
worst,
the closest,
HC
well
to
the
Hierarchtml
measurement,,
to scale
method
gorithms
drarnatlically,
not, all clata
points
to
typically
attributes
each
identify
freclueutly.
with
between
probability
Distance-based
data
of
that,
probahilit,y
reality,
are
input
(Iqiyade
In
complexities
number
values
are
of correlation
and
their
attributes
other,
if the attributes
because
the
each
kind
updating
that
of objects
of a practical
mining
keeping
They
assumption
separate
( correlation
sometimes
looking
the
pair)
distance
unable
point
make
distributions
true.
while
approaches:
( ‘KSt38]
from
in terms
merging
Clustering
do not
memory
as possible
plexity
keeps
or
is vpry
does
with
to another
a local
ancl the
is
moving
improves
It can find
Mur83]
it
small.
starts
group
minimum
So in
extremely
possible
one
exponential.
KR90,
subsets.
KR90]
all
partition,
is still
farthest,
still
low.
Probability-based
[FisH7,
or
the size of the dataset)
as accurately
costs
the
can he too
they
(e.g.,
of the local
but
[DH73,
(EE),
of par-
minimum,
are
or swapping
selected
[DH73,
K
from
hut, the c!uality
K
of
quality,
ways
global
tries
function,
reasonable
, C1O not,
be viewed
resources
dif-
Learning)
Statistics)
the clataset
must
a limited
and
a moving
initially
into
(10)
data
none
eTmnltratzort
the
IV and
points
stal)le
points
then
of data
clusters,
probability-
In particular,
problem
with
methods
when
partition,
with
! [DH73]
find
all
Hence
IiN/I<
data
can
the measurement
(HC)
EKX95a,
it
scanning
clusters.
exhausttvt
optzmwatzon
initial
complexity
[CKS88,
of N
except,
see if such
fw
Statistics
[NH94,
in
in main
rerognize
the
Learning
Machine
work
though
the
approaches,
large
how
in
different,
adequately
to fit
for
7,
Database
approaches
(like
directions
a set
swapping
in
using
titioning
require
scalability
approximately
practice,
an
a pre-
time
example,
are
Iterattve
details
Research
Previous
most
The
currently
linear
infeasible
((~F)
is presented
Machine
with
contri-
5, and
BIRCH
studied
and
communities
distauce-basecl
have
there
2
material.
BIRCH.
in Sec.
Mur83],
Sec.
feature
in Sec.
conclusions
Lee81,
emphases,
them
BIRCH’S
of clustering
of Relevant
ferent
be regarded
existing
background
are central
our
clustering
E1iX951]]
some
are presented
[DHi’3,
which
or all
as follows.
is described
Summary
Data
measurements,
points
For
the concepts
performance
6,
glol]al
solution.
ancl summarizes
algorithm
liminary
a plausible
3 presents
Ser.
should
is organized
work
Sec.
that,
addresses
that
Paper
paper
relat,ed
l)utlunh
points
area
they
use
clustering,
104
ributions
of
BIRCH
contribution
is
problem
in
a way
our
that,
formulation
is
appropriate
of
the
for
very
large
ry
clatasets,
constraints
following
by
making
explicit.
In
advantages
over
the
time
addition,
and
memo-
BIRCH
previous
has
~he average
mtra-cluster
distance
D4
the
distance-based
ap-
in$er.clust
distance
of the two
er distancs
D%, average
D3 and variance
increase
clusters
are defined
as:
proaches.
(6)
●BIRCH
is
local
clustering
points
(as
decision
or
all
currently
rneasnrements
points,
that
and
at
maintained
●
is
every
data
are treated
. BIRt~H
finest
while
clustering
and
balanced
. If
by
and
ornit
an
time
optional
method
and
does
only
space
not
clustering
D3
is
treated
in sparse
regions
of
accuracy)
is organized
not
scans
5,
Due
the
clataset
two
different
4
The
to
is
at
A
an
f,.
Ztl
whole
N
(1)
(3)
N(N–1)
centroid.
D
a cluster.
the
They
tightness
of
between
two
for
is
average
are
the
two
cluster
clusters,
measuring
their
(iiven
the
the centroid
Manhattan
defined
as:
pairwise
alternative
around
we define
points
distance
the
of
centroid.
i =
(CF)
vector
of
=
Ixbl
the
vector
two
disjoint
the
Next
cluster:
NI d-dimensional
i = 1,2, . . . . N],
{X-’}
where
j
=
proof
N1 +
l,N1
that
a
number
of
of the
square
sum
❑
X-iz.
dw~oint
as
sum
S’5’ is the
=
Assu7ne
(N2,
L3’2, ,$,$’z)
clusters.
is formed
Then
by merging
+ N2, L~l
+ L~2,
the
the
,$,$1 + ,5’,5’2) (9)
D1,
the
(5)
+ N2,
can
summary
less
105
is
than
accurate
that
corresponding
all
in BIRCH.
as clusters
are
the
CF
D,
DO,
R,
quality
diameter
rnet,rics
of clusters)
easily.
vector
not
that
data
it
as a set
stored
only
the
because
measurements
and
given
XO,
as the usual
of a cluster
CF
we
can be stored
prove
as well
[]
theorem,
accurately
to
total/average
think
the
and
the
D4,
algebra.
additivity
of clusters
easy
be calculated
decisions
and
vectors
also
as weighted
all
CF
the
is
D3 and
only
much
CF
of clusters,
but
also
definition
incrementally
D2,
(such
can
of straightforward
CF
It
vectors
in a cluster:
{.~i }
points
in another
+ 2, . ..)Nl
is the
Theorem);
that
in
is:
(Nl
consists
the
One
data points
and N2 data
points
Clustering
linear
a91~ CF2
of two
cluster
clusters,
calculated
(4)
– xb2(t)\
~~=1
the
is defined
N
and
, ,$,$1),
vectors
=
cluster
Additivity
, L~l
of the
+ CF2
know
of two clusters:
X~l
ancl X_62,
distance
DO and centroid
D1
of the
two
clusters
are
[Xhl(t)
i.e.,
(CF
(Nl
the
L~$ is the
~ ~~,
points,
data
1, 2, . . . . N,
where
tree
a cluster.
d-dimensional
the
CF
clustering.
summarizing
about
cluster,
~~
and
a triple
L%’, S’S),
i.e.,
CF
merged.
– X7121 = ~
(N,
in the
=
are
From
,=1
(iiven
where
=
data
CF
is
we maintain
4.1
CF1
d
1)1
N
incremental
N
points,
that
The
distances
Do = ((X731 – xi12)’)+
[Jsers
the
within
measures
5 alternative
to
closeness.
centroids
Euclidian
distance
data
Feature
BIRCH’S
Given
CF
CF1
member
a
Tree
Clustering
where
Theorem
~,
from
of
the relative
CF
of
{ii}
points
of the
D of the cluster
Jx’mt-m’)+
distance
sake
or shifting
affecting
and
Feature
that
4.1
triple:
data
(2)
average
the
separately.
by weighting
Feature
Clustering
Definition
once.
AT
R is the
without
core
information
Assume
that readers
are familiar
with
the terminology
of vector
spaces, we begin
by defining
centroid,
radius
and diameter
for a cluste~.
Given N d-dimensional
data
points
in a cluster:
{.Xi}
where
i =
1,2, . . . . N, the
diameter
For
as properties
them
dimensions
of
the
Feature
R and
state
data
concepts
are
Background
centroid
XO, radius
are defined
as:
and
preprocess
Clustering
a cluster:
3
D
and
1, D2, D3 and D4 as properties
DO, D
clusters
optionally
cluster.
R
and
scalable.
require
the
and
merged
X-O,
placement.
height-
BIRCH
treat
cluster,
along
efficiency).
structure.
4
can
D of the
we
between
ensure
is linearly
is actually
clarity,
single
to derive
in-memory,
Phase
that
in advance,
for
ensure
tree
its running
the
incremental
clataset
(to
of
of
optionally.
process
use
uses
hence
memory
(to
highly-occupied
features,
we
removed
reducing
It
data
points
Points
costs
the
the
and
of
subclusters
1/0
characterized
that
use of available
possible
data
process.
region
and
each
all
incrementally
important
as outlters
that
closeness
be
occupied,
cluster.
minimizing
these
clustering
as a single
full
can
observation
dense
makes
natural
time,
is equally
in
scanning
clusters.
the
uniformly
A
collectively
The
the
point
global)
existing
same
the
not
purposes.
the
the
exploits
usually
to
without
reflect
during
BIRCH
opposed
is made
as
efficient
points
is sufficient
we need
for
of data
points,
summary.
because
in
for
the
This
it
stores
cluster,
hut
calculating
all
making
clustering
4.1
CF
A
CF
et(ers:
is a height-balanced
branching
nonleaf
node
[CFi,
rhddi],
to
i-th
its
clust, er
factor
node,
l>y its
each
addition,
the
efficient
ter
made
scans.
up
entries.
of
But
thrr.shold
all
tree
slllaller
size
the
of size
tree
is.
the
the
sizes
of leaf
B and
Such
tree
the
the
same
il]tu
the
corrert
tree
is
a very
correct
will
into
we only
rep-
addition
leaf
at most,
“prev”
represents
by
must]
of
its
satisfy
T.
The
larger
‘T is, the
a node
to it,
in a pa~e
of T.
d of the
nonleaf
data
entries
by P.
space
are
So P can
dynamically
to guide
for
is
compact
entry
a leaf
just
ldentzf~ymg
tile
the
closest
for
entry
data
metric:
inserting
finc]s
the
the
whether
threshold
new
leaf
entry,
it
an
pair
rluster
we
the
threshold
cal}
say
“ Ent”
cluster
eomputecl
and
CF
the
in Sec.
L,,
as seeds,
the
a node
for
NJ,
postponing
later
entries
with
splits;
other
seed.
In
page,
we
entry
utilization
we
closest,
of
a page,
one more
space
otherwise
in the
one
to fill
the
child
merging
case
fit, on a single
use, create
two
two
the
in
increasing
of entries
the
closest
to the
in the
we split
entries
future
distribution
entries
merged
thereby
from
two
the corresponding
resplitting,
entries
N,T can
corresponding
and
hold,
the
rest
space
in node
pair
of this
i.e.,
the
can
if the mergecl
NJ,
resulting
the
enough
Suppose
entry
are more
page
problems:
propagation
N,T to find
them
During
put
then
vector
for
Li
leaf:
n)erged
with
that
the
After
the
CF
CF
“Ent”
the
we must,
.sp/it
by choosing
the
two
improve
children.
vectors
L,
n]ust
vector
for
of
L!
are
the order
but
that
same
the
times,
with
leaf
that
data
the
point
two
input
order,
it, should
can
he
addressed
the
data
(Phase
5
The
not
with
might
caused
nodes
a point,
further
be
is
hut
at,
into
occasionally
enter
This
refinement
in Ser.
(Phase
entered
might
entered.
Ijy
artifact,
twice,
word,
node.
semi-glo]>al)
across
is inserted
have
4 discussed
(or
that
same
anomalies
in another
nodes.
the degree
undesirable
copies
or,
and
in the
a global
Another
that
across
subclust,ers
are kept
leaf entries
5),
entries.
a skewed
entry
input
of
correspond
subclusters
two
unclesirahle
arranges
in Sec.
two
that
with
number
always
are split,
of data
possible
3 discussed
a limited
not
cluster
remedied
that
hold
does
occasionally,
in one
is also
infrequent,
distinct
it
be in one cluster
size
if
only
a leaf
problem
passes
over
5),
“Ent”
information
and
CF
not
different,
criteria.
inserting
upon
it
algorithm
redistributing
closest
been
page
is
size,
cluster.
have
These
tests
can
its
to a natural
should
it
node
to
should
Depending
3.
node,
each
clue
of skew,
iolatingthe
and
on the
update
from
one
seed attracts
Since
distance
a leaf
otherwise
based
Note
again.
entries
by choosing
withoutv
is done
to the
must,
condition.
be
done,
of entries
entries
than
the
as
from
tree
to a chosen
If SO, the
th~ ~mth
a leaf,
is,
Starting
CF
reaches
entry,
we are
nodes
entry
proceeds
as defined
it
,Nocle splitting
remaining
2Tl}at
the
“ahsorh”
node.
.x. Modtf;jzn!g
mt,o
leaf
can
If there
and
T).
to reflect
this,
If not,, a new entry for “Ent,”
to the leaf.
If there
is space on the leaf for
farthest
the
or D4
When
conclitionz.
ul~dated
is addecl
this
leaf:
rlosest
L!
leaf:
the
according
DO, D1 ,D2, D3
Modtf?ytn~g
the
node
nodes.
space
points
threshold
“Ent”,
descends
child
child
free
data
are not
to merge
also
merging
Tree
appropriate
it recursively
If they
node
input,
and
additional
nocle
additional
scan
page
data
quality,
the
root,,
properties
skewed
these
and
nonleaf
we try
we just
dat,aset
the
of
clustering
split,
the
hy the
clustering
A simple
some
now
result
the
have
by one.
are caused
presence
the
we may
increases
of the
ameliorate
at
We
summary,
CF
Splits
If
levels,
to reflect,
however,
ancl so on up to
split,
the
helnw:
root,,
stops
The
the
many
is a leaf
insertion
a single
a specific
algorithm
(~iveu
of
is not
absorbs
under
a CF
the
a new
purposes.
node
(which
into
tree.
sorting
representation
in
present,
to guide
USd
for
insertion
purposes
helps
there
at all higher
vectors
the
leaf,
height
tree
utilization.
often
split,
into
as well,
the
that
A leaf
the
affect
step
entries.
as new data
clustering
space
accommodate
known,
CF
the
“Ent”.
created
the
In
CF
newly
entry,
is independent
In
leaf.
entry
In general,
can
the
adding
of
this
to update
data.
to
involves
nonleaf
the
for
parent
, this
split.
be varied
a new
the
split
1s
describe
is split,
which
recluce
value
less than
a new
to
the
order
a
addition
Mergz?~gRefi?lelrte?tt:
size,
a clus-
to a thresholci
to &
4.A
the
has space
root
path
simply
insert
of “Ent”.
If the
and
to
the
this
reflect
need
to split
In
together
represented
node
L
1, 2, . . . . L,
nodes
be built
(or radius)
a CF
node
i =
also
has
a nonleaf
pointers,
all
sub-
to
subclusters
contains
where
require
position
each
now
So
the
It, is used
as a B+--tree
Insertion
We
the parent,
respect
suhcluster
diameter
4.2
of the
a leaf
and
hut, a subcluster
with
node,
dimension
are inserted.
illtlo
l)oint
parent,
tuning.
a CF
I>eralme
is a pointer
node
L are determined
performance
ohjectls
in
We
once
“~hildi’)
two
is a function
P,
B,
node
radtus)
(or
us
subclusters
with
requirement,
The
g]veu,
the
entries
tht- dtametrr
requires
to chain
A leaf
all
vectors
form
CF
on
a split,
Each
child.
has
used
entry
of
T.
[CFi],
node
are
nonleaf
absence
of the
is the
A leaf
pararrl-
entries
up of all
form
leaf
which
for
this
two
threshold
CF,
made
entries.
of
each
“nrxt”
and
with
B
and
most
1,2,...,
by
a cluster
entries,
for
r’ =
represented
resented
then
at
where
tree
B
contains
child
represents
T:
each
Tree
tree
and
for
Fig.
satisfy
the
of
new
1 presents
Phase
memory
“Ent”.
106
BIRCH
the
Clustering
overview
1 is to scan
CF
tree
using
all
Algorithm
of BIRCH.
data
the
and
given
The
build
amount,
an
main
initial
task
in-
of memory
J/
Data
of
a subcluster,
single
point
we
and
modification;
Initial
CF
Phase
.n,
2
aller
tree
can
<optional):
C-F
Condense
by
buildkpj
tit.
desirable
sm.11.r
.
rang.
tree
CF
F’base.z
Global
a subcluster
repeating
(3)
Clu
n
steting
1
or
to
existing
clusters
be
the
information
calculating
and
recycling
reflect
the
space
as possible
data
will
5.1.
this
phase
details
to
tering,
became
(a) no 1/0
sparse
a in-
of Phase
the
problem
of clustering
to a smaller
the
leaf
problem
the
original
data
of clustering
the
and
(b)
Sec.
is reduced
subclusters
the
ancl (b)
the
(a)
a lot
remaining
granularity
that
of outliers
data
can
be
are
is reflected
achieved
eliminated,
with
given
the
available
3. less
minor
initial
sensitive
tree
locality
input,
form
entries
of
may
isting
compared
1 results
and
it scans
the leaf
a smaller
CF
once,
have
does
or
the
gap:
This
We observe
that
a set, of
subclusters,
vector.
For
example,
naively,
by calculating
actual
in
with
each
the
the centroid
4.2)
Phase
that
and
included
exist
correct
Note
only
been
information
produced
data
clusters.
4 can
Not
data
point
be extended
with
user,
and
it has been
As a bonus,
can be labeled
with
to identify
4 also provides
That
to
only
to rnigrat,e,
[G G92].
point
Ly
points
of a given
by the
Phase
entails
to
further.
to a cluster
Phase
outliers.
in
and
has
the
all copies
the
the data
us with
is, a point
seed can be treated
in the
might
data
of the clusters
a set of new
its closest,
data,
on a coarse
outlier
to, if we wish
of discarding
that,
the
times.
to a minimum
in each cluster.
not
(or
mentioned
data
and
belonging
it belongs
order,
5.1
Phase
causes
Fig.
2 shows
3 by
cluster
vectors
an initial
using
all
points
leaf
the
which
is
as an outlier
result.
into
value,
1 Revisited
the
rebuilds
the leaf entries
its
CF
have
known,
(1)
as the representative
107
details
been
tree.
a new,
re-inserted,
new
runs
data,
into
tree)
at which
it was interrupted.
out
CF
tree.
the
1.
the
It
data,
starts
and
of memory
it increases
smaller
insertion
Phase
scans
If it
the
of the old
the
of
value,
scanning
to work
by
the
threshold
it finishes
algorithms
adapted
described
CF
cluster
the
redistributes
that
in
clusters
multiple
pass each data
too far from
patterns
clustering
be readily
1,
and
input
(Sec.
to
this
option
to rebuild
outliers
clustering
existing
can
Phase
to converge
during
points
2 serves
the
tree
if desired
proved
the
clusters
4 is optional
original
the
and
of
3 is applied
over
refine
the
cluster.
passes
either
ones.
skewed
size
algorithm
points
tree
larger
the
remedied
semi-global
to
more
into
page
the
is
CF
quality.
size of Phase
Similar
initial
of
by
and
3. Phase
removing
effect
to
speed
additional
which
Phase
points
same
set,
problem
passes
scanned
allow
he
clus-
diameter
pattern
Phase
seed to obtain
this
can
whole
inaccuracies
4 uses the centroids
go to the
ex-
that
although
also it ensures
a
localized
and
3 as seeds,
the
uses
It also provides
desired
misplacement
point,
been
to
It
which
the
to specify
or the
obtain
data.
this
its closest
applied
within
between
subclusters
the
methods
of Phase
in the
a set of data
with
to
user
distribution
and
of the
Phase
data
that
size ranges
this
while
unfaithful
observed
of both
range
triggered
data.
a global
terms
entries
undesirable
be
have
is a gap
tree,
splitting
entries.
the
data
original
clustering
bridges
crowded
The
for
better
arbitrary
input
the input
as a cushion
the
in
there
and
grouping
We
different
well
So potentially
in
containing
the
or semi-global
3 have
perform
to
leaf
but
global
us
order
with
2 is optional.
in Phase
and
an input
the
order.
Phase
they
because
we
of additional
Phase
order
3,
D4,
during
of O(lV2).
the
hierarchic-
directly
clusters.
major
of the rare
scanned
rnernory;
for
sufficient,
vectors.
or
an
because
metrics.
it
CF
D2
of clusters,
the
up
quality
vectors,
of allowing
inaccuracies
that
the finest
CF
apply
an agglomerative
metric
the
4.2, and the fact
those
because
distance
account;
can
is usually
and
their
Phase
cost
vectors
distance
has a complexity
summary
in
entries;
2. accurate
we
algorithm
subclusters
by
threshold
However
CF
the
represented
radius)
After
to
applying
number
captures
1, subsequent
are needed,
a
cent, roid
into
we
by
desired
1
be:
operations
as its
existing
accurate,
we adapted
flexibility
because
1fast
an
algorithm
from
and
and
in their
accurate
the
crowded
creates
Phase
will
as
without,
sophisticated,
points
information
directly
paper,
calculated
as fine
and
The
After
phases
tries
With
subclusters,
data.
in Sec.
tree
dataset
limit.
as outliers,
in later
CF
general
clustering
the
of the
memory
of the
be discussed
This
as fine
removed
summary
computations
disk.
the
grouped
points
memory
OvervZcuI
information
under
points
data
on
clustering
more
modify
counting
most
subclusters
1: BIRCH
the
algorithm
for
al
+~
Figure
and
subcluster
algorithm
of n data
times
to take
In this
Better
each
existing
(2) or to be a little
treat
slightly
tree
3:
treat
an
1
+
r
can
use
tree,
the
before
threshold
by re-inserting
After
the
old
scanning
of
the
data
from
the
is resumed
with
inserts
leaf
entries
(and
point
proceeds
as below:
1 Creatr
,
f
out
(.ontmue
data
and
insert
to +1
1
Fuush scamm~
..
of mem”l-v
v
(1)
scanning
new
I
the
same
that
the
2 Insert
Rebudd (LF txe t2 of new T from (.F tree tl:
leaf entry of tl k potential outher and disk space
t2.
write
to disk; othewise
use it to mbudd
(3) tl <- tz
(2)
.wadabks,
Otherw’,.e
Out of disk
Re-absmb
tmtentid
space
outiiers
fit
into
tl
1
Re-.bamb
potemi.+1 outkrs
(!ontrol
Flou)
of Phase
New
1
Tree
is
h,
and
3: Ftebuddtmg
T,+l
>
T,,
rebuild
size
size
of
ifi the
tree
tree.
rebuilding
to
within
each
contiguously
the
number
of entries
entry
in
the
on
ll...,)/
that
be
the
path.
any
Its
in
is
leaf
,’i’t,
all
in the
“olCi-
to see if it,
is found
new
top
tree.
If yes
“New(.~urrentPa t,h”,
and the space
available
to “New(
leaf
entries
T,+l
Path>’
can
be freed.
“NewC;urrentPath”
that
for
;urrent
later
F’ath”
use;
without
forward”.
too.
than
that
,$’~.
the
Following
of CF
tree
the
the
the
old
the
the
threshold,
tree
limited
to
tree
only
and
“New(
the
extra
nodes
leaf
path
empty
pdh
leaf
en-
are
now
nodes
can
z71 the> old
the abmw
stq3s.
entries
are
become
nodes
corresponding
extra
re-
larger
k~rrentPath”
is h pages.
rebuild
some
never
maximal
transforrnation
we can
old
!urrent,-
because
next
can
Since
Path”
that
the
thr
steps,
simultaneously,
for
case
are
“01[1(
to this
a71d repeat
new
tree.
“OldC!urrent
empty
M set
071r,
rebuilding
but
likely
are
correspond
rxzsts
the
along
is also
In this
“Ne W( ‘ur7rnt-
“OICI( ;urrentF’ath”
nodes
It
originally
tf ther-r
and
in
un-needed
“pushed
exist
of tz to
such
entries
along
to
(iiven
consequent
in
that
(level
1 is the
t,,
entries
O to nk — 1, where
71k
node,
from
1) to
by
label
So naturally,
and
algorithm
path
leaf
path
is
order
then
a leaf
a path
node
is
need
space
to
needed
So hy increasing
a smaller
CF
tree
with
a
memory.
(level
of
h)
ifi\l)=i~2)
5.1
we rekld
CF
node
interchangeably
by path,
path
old
tree.
illustrated
hy path.
For
in
above,
and
The
Fig.
at, the same
new
starts
“old(!urrentpat
3.
it, scans
tree
with
h’”,
With
and
time,
starts
the
frees
creates
with
leftmost
the
y
now
T% by the about,
and t,+l
then
,S1+l < ,S%, and
the transforniatzo71
5.1.2
Threshold
A good
choice
number
of rebuilds.
sets
the
path
if the
leaf
108
resprctzuel~j.
(“’F t,rr~
let ,5’,
If T,+l
from
of 7r~e7n ory,
entry
value
Since
initial
CF
GItd
> T,,
t% to t,+,
71111we
can greatly
initial
TO is too
tree
by
the
we can adjust
than
So To should
it, t,o zero
3Eit11er
algorithr~l
and
h
1s the
Values
of threshold
dynamically,
But
change
NIJLL,
.+!,ssunt c
oft,.
memory.
the
):
Ti+ ~ from
al~gorlthm,
be the szz~s oft,
less detailed
the
Theorem:
S,+l
Ileaght
..,.,
from
(Tkducibilit
t me t ~+1 of thrr.shold
71eeds at Tnost h ext r-o pages
the j-th
level entry
.(1) .(1)
(tl
,Z2 , . . ..zl~_l
‘(’) ) is
. . ..i~~l)
Theorem
t, of threshold
(il , i2, ,.., i}~_, ), where
path
i$),
defined
“ol~l(.~urrentpat n’”
the
tree
that,
neti)
in
node.
the
low.
tree
criteria
is before
processed,
than
height
as the
node
)pat,h(i\2),
use path
new
old
thy
entry
the new
“OldCurrentPath”
iIt
Once
will
an[l
closest
new
spare
tnw
on.
tree
the
it is inserted
is increased
old
the
to
Pat, h”
is left
= ~(~)
z.(1)
l_l,
and i~l) < iJ2)(0 <j
~ h-l).
It is obvious
,7–1
that, a leaf node corresponds
to a path uniquely,
and we
The
is no chance
than
leaf
“NewCurrentPath”
From
as well
representeci
–
(or<
natural
against
“NewC1osest
be freed
NewCurrentPath
T,.
nodes)
larger
from
root
he uniquely
befc)re
of
algorithm
labeled
i,) , :1 =
is tested
each
to “NewClosestPath”,
Tree
threshold
use all
not,
are
an
the
exactly
theorem.
Assome
(an
<’F
t%+l , of threshold
tt+ 1 should
reduril>ility
of
(numl>er
we want
a CF
there
larger
it is inserted
inserted,
a CF
its
tn
tree
“OldCurrmtI)atli”
threshold,
4. “OldCurre71tPath”
NeWClOSeStPath
Reducibility
is
so that
hecornes
‘LNew(~losestPath”
tries
OldCurrentPath
t,
’atll”
new
mto tl
A-A
Assume
the
and
Path”:
5.1.1
tree,
tn
new
3 in the
creating
Tree
e7itrze.s
with
3, Frer
2:
tree
the
otherwise
I
Figure
ever
to
then
in
Figure
old
with
down
1
c
old
as in the
(.~urrentPath”
~
(
“Nru)(!urre~it[
added
new
leaf
tree:
can
(
are
tree.
if.
c
corrrspondL7ig
nodes
data
Result?
T.
Increase
thr
tree:
he set
default;
high,
is feasible
reduce
threshold
for
its lwing
we will
with
To
tc)o
obtain
the
conservatively.
a knowledgeable
the
value
a
available
BIR(~H
user
could
this,
absorbed
witllOut
by
an
splitting.
existing
leaf
entry,
or
created
as a IIeW
Suppose
that
subsequently
have
T,
run
turns
out
been scanneci,
(eaeh
satisfying
tree
that,
we have
next
threshold
built
a full
have
hem
wrt.
are
sphere,
the
is
to choose
following
T,+ I at most
thus
2“%+1 so that,
N,+l
N
we choose
is known,
in proportion
5.1.3
some
we
want
measure
The
heuristie
Min(2N,,
N).
which
we have
that
is average
r is the
CF
tree,
and
d is the
the portion
oft he data
A second
the
seen
of leaf
actual
(since
make
number
the
of r and
ri+ 1 using
the
as a heuristic
is
growing.
observation
input
assumption
We
two
should
at least
these
D~i~
To
monotonically,
the
Ti+l
in
root
with
the
leaf.
value
that
the
very
the
in
all
making
5.1.4
Ni,
When
we
most
CF
tree,
points
the
to build
to expeet
value
that
a
are
we
run
to D~Zn,
through
ancl
causing
due
to
the
chance,
new
the
(Iat, a
well
as an
potential
be scanned
If a potential
last
tree
value
qualities
scanneci,
must
the
out] could
no longer
has been
the
he re-
threshold
is written
space
and
can
to verify
outlier
it, is very
ran
likely
not
a real
be removed.
entire
cycle
several
This
—
insufficient
of the tree,
of
times
effort
the
outliers,
before
must
of scanning
disk
spare
—
could
etc.
the
datlasetf
be considered
data
memory
insufficient
in order
is fldly
in a[l(lit,ion
to assess
t,he
t,c
(-Ost,
1 accurately.
Delay-Split
still
in the
(3F
and
of disk
tree
before
6
Performance
we
tree,
(in
threshold.
read
require
idea
he the
more
The
data
outliers
until
advantage
data
points
of
sl)lit
writ, e such
to how
the
CF
some
us to
is to
similar
reading
as well.
in general,
well
However,
may
A simple
to proceed
we have
it may
can fit, in the current,
a manner
space
is that
the
memory,
points
the
that
to disk
written),
out,
data
changing
points
points
of main
more
without
data
Option
out
we
of this
ean fit
in
to rebuild.
linear
adjustecl
Mwr(DTntn,
value
case
out,
in the
outlier
a re-absorbing
approach
he merged.
unlikely
1s
fewer”
is of course
run
without
outlier
a rebuilding
case that
in the
If we want
threshold
tree
outliers.
the
we run
a node
=
can
that
cost
data
f,
fewer”,
may
distribution
at this
repeated
the
factor
“far
to see if they
increase
data
are indeed
and
leaf
obtained
the
When
crowded
threshold
space
An
left, in the disk
of Phase
regression.
most
old leaf entry
(unless
with
the
An
more
as potential
if it has
“Far
value,
“ahsorh”
entries
to disk.
is reduce(l
threshold
to
outlier
current
—
outlier.
the
the
leaf
scanned
the potential
scanned.
our
some
disk
are
outliers
be
by re-inserting
tree
average.
a potential
that
( ~),nin ) between
Tt+l
the
by
by
linearly
child
expansion
after
triggering
the observeci
to a leaf
can
into
triggering
footprint
quickly
the
as follows:
ensure
data
motivated
linear
the
the
in size.
Note
use
the
outliers
outlier
define
and
Similarly,
distance
entries
the
with
1.01 *),
it is reasonable
increase
two
4. We multiplied
using
tree,
We
squares
on this
than
tree
new
entry
out
are
CF
leaf
them
that
elllst,ering
of the
increase
we treat
density
overall
heuristics.
if they
the
estimate
quite
to find
the
entries
condensed
regression
the
attempt
calculate
we
we
each
write
a change
mean
~~z
with
we ean
the
t)ll,
space for handling
the
the
size
to be a potential
points
read
of memory
grows
datasets,
VT, grows
to
and
or
of the
Ni,
is
skewed).
least
outliers
absorbed
Since
out
Second,
to grow
a rerord
how
Max
large
from
going
so that
T%+,).
is
a path
in a “greedy”
of
aII iar:~
:tI>l]roxiI1l:iti
of low
wrt.
the
First,
points.
of
maintaining
Maz(
of
most
that
always
=
crude
of disk
entries
we rebuild
Periodically,
volume
leaf
allowing
tdum~,
of memory),
r
a
that
a d-(dimensional
for. )
unimportant,
ways.
potential
~1~ is the
regression.
a constant
Ti+ 1 using
3. We traverse
more
f
for
order
estimate
linear
use
that
the
that
in
= T! Y
Option
entries,
he absorbed
By
measure
heeomes
the
rlosest
squares
The
we run
of points
factor
it
node.
two
data
by
“footprint”
maximal
amount,
Ni.
number
ezpnrmon
tree,
leaf
another
Ti d.
points
least,
footprint
a fixed
assumption
the
space.
oeeupied
is a measure
whenever
VP by
of data
of
by the leaf clusters.
the same
approximate
in the
the
be
considered
as t~ = rd
packed
Tt d is the
this
occupied
(the
distinct,
cluster
(~, * T%(i, where
and
with
far
of volume
Intuitively,
we work
We
seen thus
on
are
When
old
in
threshold,
of the space
notion
entries
volume
is essentially
two
is defineci
(Iimensionality
as Vp =
entry.
based
are
of the root
is a measure
is defined
of a leaf
which
radius
seen data).
number
There
volume,
this
threshold
we use in estimating
average
Intuitively,
which
to
pattern.
to estirnat,e
data
increase
volrme.
of volume
first
where
of
to
just
called
we c-an use R bytes
thereby
notions
it
to the
really
Outlier-Handling
Optionally,
far.
2. Intuitivelyj
can
=
to assuming
distributed
it is rarely
Tt+l
a
the scope
judged
is, whether
is
we choose
to estimate
estimation
approarh:
That,
T~ then
is equivalent
uniformly
and
however,
outlters,
1. We try
is less than
(This
points
Based
and
thus
(~)~.
formed
fi).
is beyond
we use the
obtained
we
points
scanned
This
solution
and
data
we need
T,+l
( !urrently,
Nt
we have
up so far,
and
small,
after
condition
that
value
i>roblemi
paper.
lx= too
and ~~%leaf entries
of the data
the
of this
to
the threshold
on the portion
diflirult
out
of memory
that,
f
*
We present
grows
Ti+ 1
109
Studies
a complexity
analysis,
experiments
that
we
have
(~LARAN,S)
using
synthetic’
and
conducted
as well
then
on
as real
discuss
BIRC’H
dataset,s.
the
(an(l
6.1
Analysis
First
we analyze
the
size of the tree
is #.
from
nodes.
At each node
the
root
the dimension
is O(d
the
tree,
ES
be
upon
our
never
is O(d
threshold
value
farther
threshold
So the
()(d*N*B(l+logB
The
of Phase
the
from
the
twice
of
points
loaded
total
*)+log2
analysis
* B( 1 + logB
are
at
of
re-
~)).
depends
it
is about
that
into
cost
for
at
1/0,
all
we scan
in
delay-split
Phase
options
writing
out
back
of disk
a rebuilt.
more
The
*)).
re-builds,
the
and
the
from
on
above
the
the
cost
the
no
1/0
input,
size
and
Phase
data
point
in
cost
depends
proportional
“nearest
to
[(i(~92]
to be almost
linear
N.
input
to
for
time
taken
the
can
be
Synthetic
To study
of
Dataset
the sensitivity
a wide
range
of
collection
of synthetic
that,
have
we
controlled
Each
have
used
a
by a generator
data
generation
that
is
dataset
consists
of 1{ clusters
its
it,(n),
radius(r),
of [7~/,nk],
placed,
the
4Note
tllat
=
r,,
and
and
wl)en
cover
?LL =
radius
its
r is in the
clusters
tlle
by the
TLh tl]e
of 2-d data
number
center(c).
range
a range
nun)her
n is in
in
words,
of [rl ,rh]4.
of
of
values
points
is
fixed
center
the
of A,
110
x
and
is
y di-
places
the
of the dataset
is
y locations
within
the
the
variance
to the
point
center
we refer
to the
clustered
uniformly
placement
of
the
option
are
the
is used,
be-
to cluster
to the
as “outsiders”.
points,
noise
in the
throughout
can be added
to the dataset.
of data
points
noise.
points
in
parameter
the data
randomized
a
In
its
B than
distributed
data
from
belongs
the percentage
order
of the
between
far
points
are considered
di-
is unbounded.
that
data
each
properties
of cluster
to such
points
by
center
acwhose
in
distance
be arbitrarily
to the
deter-
distribution
the
So a data
are
are generated
may
that
noise
whose
due
rn controls
controlled
cluster
maximum
of the dataset
dataset
randomized
and
the
cluster
normal
that
the
each
the
and
parameter
The
and
is
dataset
distributed
cluster
and
of data
in the
is fixed,
y location
the the c and
a point
cluster.
is
a
of
pattern
overview
of
for
c, and
Note
be closer
the overview
each
the
on
randomly
points
A may
once
in
since
longing
The
the
dimensions
characteristics
In addition
points
The
are
on
c location
of a sine
,+~]
randomly.
is $.
other
center
points.
of data
The
random
cen-
clusters
is placecl
whereas
[–~
both
cluster
K
which
overview
The
distribution,
point
are summarized
i is 2ni
and
on
The
of
clus-
distance
on the same
the
function.
The
data
is the
form
in
rl
The
a set of parameters
is characterized
WlIeII
we
generated
1.
A cluster
range
datasets,
datasets
developed.
by
in Table
input
sine
the
The
is set to k{+.
places
each
are both
the
mension
to the characteristics
is used,
[O,~kj~]
function.
to a 2-d independent
mean
Generator
of BIRCH
of the
by
clusters
szne pattern
of cluster
the
cording
N.)
of
—
supported
grid.
x @
overview
groups,
71,
pattern
is deter-
patterns
[O, K].
Once
newest
gnd
cluster
currently
by kg, and
into
on both
normal
6.2
an
centers
mined,
improved
the
respectively.
range
is
Values
“overview”
Three
are
on a ~
of sine
of the centers
each
—
a curve
center
[O,K]
Thetr
as the
of each
of neighboring
The
cycle
cluster
this
puts
with
it
wrt.
and
the
(However
techniques,
with
chosen
on
rections
the maximum
again
cluster;
IV * K.
—
3 is therefore
upon
algorithm
proper
neighbor”
of Phase
clataset
results
the
to
~ * sine (2ni/(~)).
nc
therefore
[0,2mK]
rather
linearly
Since
that
the
the
3.
the cpu
global
actually
scale
the
Based
ranges
center
is controlled
different
significantly
experimental
Phase
the
is
u
and
parameter.
random
the centers
divided
log2 ~
the
When
leads
ters
to these
are placed
dimensions.
delay-split)
dataset.
of
generator.
This
amount
about
1 is not
which
with
Parameters
pattern
and
row/column
them
the
are
in the
2 should
4 scans
into
there
of our
and
refer
the
,stne,
between
1 and
reading
that
Phase
—
1 and
hy a constant
phase.
and
(and
of reading
light
3 is bounded,
hounded
of
cost
of Phases
is
that
analysis
in
There
Phase
cost
the
pessimistic,
and
Phase
Generation
We
by
grzd,
hence
cost associated
disk
outlier-handling
M,
1/0
different
is some
to
in
r)
location
mined
1 is
outlier-handling
Considering
for
than
once
the
entries
available
is not
data
With
on, there
outlier
during
the
2.
0.. 2500
50.. 2500
0.. d2
n)
dataset.
ter centers
As
-
256
Experimented
dimension.
omitted.
not
4..
h-
n)
r-~ (Lower
1: Data
of the
memory
of Phase
is similar,
clusters
(Lower
or Ranges
size,
~*i*#$*B(l+logB
2 cpu
Table
we
cm-rent
CPU cost
u
The
tree
fact
the
of
or
u...
rebuild
cost
Currently,
of data
To.
the
nt
to
There
so
2 arises
than
NO is the number
size.
~..-,
points
we must
to re-builcl
heuristics.
the
estimate
with
* &
we have
, where
entry
re-insert,
Values
. . . . . . ..
?Lh(Higher
looking
all data
Paran3eter
Number
~
is proportional
In case
CF
to
entries
of times
logz &
1 + logB
B entries,
entry
$)).
the
entries
leaf
number
about
examine
per
maximal
we need to follow
touching
cost
1. The
a point,
we must
+ logB
leaf
inserting
of Phase
d. So the cost for inserting
let
&
leaf,
the
* N * B(l
most
and
to
“closest”;
cost
To insert
a path
for
cpu
points
throughout
o.
the
dataset
When
the
of all clusters
the
entire
Scope
Parameter
(;lobal
Memory
Disk
Default
(M)
8OX1O24 bytes
clef.
[)2
Quality
clef.
Threshold
Initial
for ~
Phase
size (P)
0.0
here.
011
once
1024 bytes
outlier-handling
01)
outlier
Leaf
clef.
contaias
the
< ~5Y0 of
average
of
fair
which
first
J31R(.’H
to cluster
Euclidian
distance
to the closest seed
in the
is larger
a cluster
2: BIR(?H
Parameters
and
Fig.
Tlimr
u
Dclault
the
Fig.
very
dataset.
the
Whereas
data
c-lusters
the
when
points
of
order
is placed
at the
end.
2 lists
in
D3
difference
there
effects
ot, hms.
P
given
points
~
those
of the
concentrate
of
and
The
BIRCH
1.40
with
actual
are
smaller
BIRCH
analyzing
used
R
=
for
of DS2
20%
a much
13S3 (but
correspon(ling
clusters
of 1.32)
that
actual
omitted
This
conclusions
visual
here
to,
the
of an art, ual
Similar
the
all
radii.
“outsiders”
cluster.
(ranging
are close
Note
the
0.07
cluster
presentations
clue
to the
lack
of
space).
5 distance
that
and
cluster
and
in a BIll(”~H
the
are
average
actual
are 0.17
from
the
and
an
BIR(~H
than
assigns
by
of location,
( 1.41).
to a proper
be
clusters
of points
rlusters
BIRCH
BIR(’H
of
an average
in
presented
in terms
cluster
of the
radius
of points
maximal
centroids
radii
centroid,
of DS 1 are
the
raclii.
number
of 1)S 1 hy plotting
is the
clusters
the
The
raclii
3
weight,wi
, ~)<l,.t are also
is the number
t$han 4°i1 ciifferentl
specified
BIR(~H
clusters
that
actual
reached
As summarized
(1)
50 seconds
higher
of poorer
performance
in Table
(on an HP
4, it
9000/720
100,000
data
points
of each
dataset
haci
almost
no
Table
4 also
to choose
additional
we choose
correspon(i
as quality
that
as D)
is, the
as the
the
presents
ciatasets
to
het,ter
the
quality
As demomtrateci
threshold
for
cluster
data
points
had
– DS 10,
the
in Table
less than
to cluster
pattern
of the
clustering
results
DS20
anti
o of the
almost
The
on
performance
DS 1, DS2
parameter
BIRC’H
dataset,.
impact
the
took
workstation)
and
time,
for
three
DS30
– which
13S3, respectively
exceljtl
generator
is set, to ordered,
4, changing
no impact,
the
on
the
order
of the
performance
of BIRCH.
threshold
1024.
The
a threshold,
and
option
observe
can
tradition,
is default
page size affects
=
We
cluster.
Table
The
clusters5
of
times
synthetir
used.
them.
clusters
clusters
clusters
(denoted
smaller
BIR(”’H
5%
So we decided
Statistics
1, the initial
of how
more
for
label
our
of the
is no distinctive
is defined
The
The
1.25 to
indicate
in
anti
used
set
for
the
Three
were
center
is about
is just
and
produces
diameter”
The
we selected
amount
M
whose
its corresponding
from
in
(Iat, a
the ability
All
paper.
actual
the actual
between
effecting
as default.
on a study
handling
(R)
<
2 results
hence
Following
threshold
In Phase
and
I and
the
average
whirh
workload
space
R
on the
and
measurement.
so that
and
under
only
all
measurement,
datasets,
settings
of the
of points,
actual
is conducted
3 phases[ZR,L9.5]
Phases
L)2 as default.
base
clisk
first
among
“weighted
the
experiments
the
in
diameter
[Jnless
clusters
table.
to the
settings.
their
values.
that,
(2) however,
is. The
number
difference
is no
Setting
under
various
he 80 kbytes
in
assume
threshold,
quality;
data
to
Since
we
The
using
and
7.
similar
is because
size
experiments.
en(ling
are generated,
pattern,
ra[iius,
respectively.
an experiments
selected
dataset
metrics
the
setting.
was
outliers,
is selected,
together,
of BIRCH,
default,
otherwise.
the
of M.
parameters
option
placed
they
and
Default
of working
their
default
flf
of
the
and
explicitly
this
are
in the
6.3
Parameters
t? IR.(~H is capable
scopes
ordered
are placed
noise
Table
the
a cluster
as
to use
so that
was to evaluate
lar~e
in tlus
as a circle
cluster,
in
Wue.s
the
off,
quality
each
6 visualizes
is the cluster
cluster
the
1000
range
Performance
generator
diameters
that
Table
the
average
of
4 refine
second
for
inclucie[i
radius
in
handle
input
H(7 algorithm
option
various
in
one
presents
of the
Phase
Workload
are presented
twire
to let
set of experiments
datasets,
than
the
the adaptecl
be counteci
Base
The
per leaf
pOints
can
default,
its ciiscarcl-out,lier
will
is less than
comparisons.
6.4
aumt)er
chosen
points
algorithms
So we
We deciclecl
with
of ciatla
as an outlier.
global
well.
We have
points
entry
number
average
3, most
cluitle
1000.
tbresbold
the
of the
ol>jectls
clef.
Delay-split
~age
In
L
::~~,,o,d
of which
a quarter
20% M
(R)
Dista;lc~
F’hasel
entry
Value
reaches
the
a higher
is on so that
on
resources.
the
For
perforrnance[ZR,
delay-split,
CF
tree
simplicity,
of the
change
accepts
more
the
lack
outlieroutliers
with
the
treat
conclusions
5From
given
to
the
the
The
we
Sensitivity
We studied
is on
can remove
places
6.5
L95],
option
capacity.
BIRL’H
dense
to O. Based
of space,
(for
now on,
Parameters
sensitivity
values
here
of BIRCH’S
of some
we can
details,
we refer
111
only
present
to
Due
some
to
major
see [Z RL95]).
to
generator
as the “actual
clusters”
by BIRCH as “t?I~
CH clusters”.
a leaf
performance
parameters.
the
clusters
whereas
generated
the clusters
by
tile
identifimi
Dataset
C;enerator
DSI
grid,
f<’ = 100, nt = 71h = 1000, r~ = rh = 42,
Setting
kg = 4, r-n = 070,0
DS2
sine,
K = 100)711
n.
DX3
random,
L)act
=
7Lh =
Initial
threshold:
as long
high
wrt.
little
extra
To,
the
can
Page
Size
(lower)
In
Phase
4, the
almost
the
saving
final
all
the
show
that
with
but
faster,
is much
Memory
to
Size:
tree
of processing
because
BIRCH
was
P
from
hence
caused
are
results
hy
some
to achieve
6.6
Time
distinct
ways
the
each
except
N.
The
for
all
the
of
DS 1, DS2
keeping
for
changing
running
are
4. Both
for
Increasing
all
the
keeping
and
the
DS3,
1{ to change
as well
and
be improved
N
Phase
in
settings
N.
as for
of
the
The
DS20
1405.8
179.;:3
2:390.5
6.9:3
=
Fig.
5.
Since
Table
3
are
“ noisy’)
The
(or
5:
4’s complexity
to be almost
linear
CLA RANS
wrt.
Ttme,
time
is not
time
for
~
first
wrt.
6.7
N
I
D
II
II 1525.7
I 1(3.75
11
Performance
Input
Order
linear
wrt.
3 phases
on
N.
I
for
Base
However
is again
consistently
Comparison
this
the
Workload
the
running
confirmed
all
three
to
grow
patterns.
value
time
same
71, and
hence
as well
size
cluster
points
radii
actual
(but
N
a range
For
Table
is at
each
tive
of datasets
same
time
except
for
0(1<
in the future),
for
against
and
*N)
the
K
the
the
the first
are plotted
N
for
are
(can
total
no more
datasets
pattern
C’LA RANS
BIRCH
and
DS30
the
time
with
pattern
is distorted.
(2)
cluster
the
show
that
(2)
is much
larger
In conclusion,
much
less memory,
of
for
hut
compared
the
BIRCH,
results
data
CLARAN,S
the base
is faster,
with
1.15
of the
DS2 and
of
dataset.
when
(3)
from
those
for
than
The
cluster.
DS3
of space).
of
(3)
of
as
can be observed
slower
clusters
for
number
largely
performance
the
clusters
be as many
of the base workload,
quality
cally.
The
than
clusters
numlocal
clusters
in the actual
lack
of
of the location
behaviors
to the
Its
RANS’
can
varies
its
100 (newly
actual
of 1.44 (larger
clusters.
and
the
set
(instead
than
(1) The
Similar
we
50
CLA
more
CLARAN,S
by Ng).
the
of CLARAN,S
15 times
the
order-sensitive
112
but
clusters
due
for
of
for
much
time,
larger
the number
5 summarizes
least
to
that:
1.41).
here
order
of
First
enough
needs
the
them
an average
omitted
is
it
recommended
from
clusters,
memory
In
8 visualizes
of CLA RANS
the visualization
wrt.
be
centers
to 1.94 with
workload.
running
in a CLARAN,S
For all three
ers:
limit
2. Fig.
we can observe
The
as
to
DS 1. Comparing
of the
of
the
linearly
upper
57% different
a range
does.
performance
base
so
acceptable
value
is still
data
Cluster:
dataset
BIRCH
an
CLARANS
the
the
the
1.2.5~o of K(N-K),
enforced
another
size are used
to grow
that
than
and
on
clataset,
7naxnezghbor
250)
BIRCH
whole
after
and
compare
assumes
the
stop
BIRCH
we
and
memory
to
of
experiment
C’LARAN,$
phase,
and
per
is now
and
exactly
the
linearly
compensated
In
both
II T;,>
. .me
”.
256
inaccuracy
3 phases,
running
all 4 phases
wrt.
D%30
next
the
be
the
Workload
2.56
but
Clwst
Base
3.36
patterns.
we create
generator
3 phases,
size
three
on
15’20.2
only
the
Performance
arid171put Order
777.5
but
settings
against
BIRCH
DS3
more
to change
are shown
3.26
(2)
we create
first
Number
changing
dataset
nk
the
plotted
of them
consistently
DS 1, DS2
for
48.4
rebuilt,
Points
DS3,
DS30
DS’2
its quality
memory
generator
711 and
time
4 phases
in Fig.
and
the
3.39
~
in
of BIRCH.
by
49.5
CLARANS
the clataset
of
1.99
DS3
44:
DSI
Number
46.4
degrades
quality.
scalability
DS20
Time,
for
of increasing
1.99
DS ’10
Scalability
Increasing
clatasets
final
between
47.5
holding
(3)
can
1.87
DS2
2.11
size
4 refinements.
tradeoff
similar
Two
to test
can
Phase
L)
47’.4
increases
to feed
quality;
Time
DS1O
on, BIRCH
time
per
Dataset
1.87
&39.5
o~.
time,
memory;
memory
by
BIRCH
tree
D
47.1
DS1
on
1, as memory
in
Time
DS1
Dat==-~
.-.
of Phase
options
Dataset
D
and
the running
generated
insufficient
extent
2.00
4.18
Time
refinement
on,
and at the same
in better
ra7ut07nized
Dataset
P
end
tested
all the outlier
is clone
0 =
(more)
less
refinement
the
options
Phase
070,
Workload
Table
requires
hence
that
after
a larger
it
subclusters
growing,
(larger)
the
at the
qualities
increases,
because
word,
qualities
In
size)
=
10%
In
slightly
finer
and
with
as Base
a
better.
maximal
4, rn
a good
time,
produces
entries,
outlier
=
is
with
up
smaller
running
suggest
the
is not slower
the
Userl
same.
with
results
1,
However
Options:
datasets
by
by
the
experiments
the
Outlier
of
know
leaf
, although
are different,
N
(3) If a user does
Phase
quality.
~~,
excessively
well
threshold,
(finer)”
the
to 4096
For
is not
0.0 works
(increase)
ending
“coarser
(improves)
to
To =
be rewarded
P:
to ciecrease
higher
and
=
time.
tends
but
r~
3: Datasets
performance
threshold
(2)
time.
she/he
BIRCH’S
initial
dataset.
running
then
of the
(1)
as the
r~ =
[
2.00
randomized
A“ = 100, nt = O, n}, = 2000, rt = O, rh = 4, T-n = rn = 0~0, o = randomized
Table
stable
1000,
=
for
points
and
The
than
is sensi~
value
that
for
are ordered,
dramati-
BIRCH
accurate,
C’LARAN,S,
RAN,$.
DS 10, DS20,
degrade
workload,
more
(’LA
(1) (TLARAN,5’
uses
and less
DS1 : Phase
D S2:
P base
DS3:
Phase
DS1 : Phase
D S2:
Phase
DS3:
Phase
1-3
1-3
1-3
1-4
1-4
1-4
~
-----I+---......= .
G
--------------
1
0
Figure
0
Figure
200000
100000
Number
of Tuples
4: scalability
wrt.
120
(N)
I?tcreasing
DS1 : Phase
D S2:
P base
DS3:
Phase
DS1 : Phase
D S2:
P base
DS3:
Phase
140
ILL, n},
~~
—
-----u---- ,.;2
------=-.; j’\
- ok,”
----7~~
--; ~,.J’-
1-3
i -3
1-3
1-4
1-4
1-4
100
g
~
./
,/
80
i%
.—
l-c
z
of D,5’1
Actual C’lustmx
6:
4,
I
L
0
m
20
1.
Figure
7: BIRCH
Clusters
of DiSl
,#’
,/.’”
/
,/
60
40
20
I
0
Number
Figure
6.8
BIli(”’H
has
similar
as the
The
to
been
one
used
bottom
one
image
contains
has a pair
Soil scientists
and
then
to
first
filter
receive
the
statistical
We applied
BIRCH
in an image
khytes
of rnernory
khytes
of disk
to the
pixel
corresponding
the
leaves,
Each
5%, of the dataset
of the
the
out
( 146707
pairs
time,
background,
branches
(1)
easier
from
the
2-d
because
and
NIR
to tell
image;
part
of sl{y,
leaves
This
step
apart
(2)
because
it
of memory.
branches
Fig.
size),
113
used
from
(5)
took
processed
and
10 shows
NIR
a smaller
The
shadows
the
branches
ended
two
were
parts
sinlilar
we COU1[l
categories,
BIRCH
the
too
So we
corresponding
10 times
that
BIRCH
were
although
[’luster
data
was weighted
amount,
80
the
and
we observed
other,
other
of
tuples)
for
rnernory
trees.
shadows
each
the
part
400
and
We obtained
bright
(4) sunlit
on the
and
from
pairs
size)
equally.
(3) clouds,
shadows
hy using
value
values
to ( 1) very
of sky,
and
them
pulled
and
shadows
branches
separate
act)llally
image
part
to be distinguished
VIS
(NIR,, VIS)
20%
correspond
However
the
to NIR,
of such
(512X 1024 2-d tuples)
(about
NI R and VIS
5 clust, ers that,
284 seconds.
11s.
and
(VIS).
each
from
9
sky
wavelengt
(NIR),
band
and
sunlit
and weighting
tree
analysis.
(about
space
hand
Fig.
cloudy
different
trees
into
images,
a partly
hundreds
the
trees
for
all pixels
real
with
pixels,
values
filter
K
(2) ordinary
wavelength
of brightness
try
filtering
in two
512xl1324
and
[nmmsing
near-infrared
is in visible
VIS.
branches
for
(N)
Datasets
of trees
taken
is in
7mt.
Real
images
background,
top
of Tuples
5: ,Scalabilit~j
Application
are two
200000
100000
o
again,
But,
heavier
than
and
image
with
obtained
of image
shadows
than
a finer
dataset
clusters
to
wit,h
from
(.5)
this
VIS
were
the
threshold
the
same
corresponding
to
with
that,
71 secon(ls,
correspond
to
studying
(1)
threshold
dynamically,
outlier
more
criteria,
me.nts,
tors
and
explore
allel
directly
as well
from
its clustering
will
also study
mation
for
drive,
how
are
to
with
to help
optimization,
or from
network
As
to read
by
an
data
matchi-
speed.
clustering
problems
data
will
of par-
reading
use of the
solve
indica-
learnings.
be able
of
We
opportunities
the data
and
good
will
to make
the
rneasure-
perform.
as interactive
speed
obtained
or query
that
BIRCH
a tape
ng
quality
is likely
algorithrr,
increasing
adjustment
accurate
architecture
executions
of
dynamic
parameters
BIRCH
BIRCH’S
ways
the
more
data
well
incremental
(2)
(3)
(4)
of how
reasonable
such
We
infor-
as storage
compression.
References
[CKS88]
Peter
(Ubeeseman,
James
: A Bayesian
Auto Class
5tb Int’1 Couf.
Kelly,
Matthew
Self,
et al.,
SUstem,
Proc.
of the
Kaufmau,
Jun.
(Ylassijlcation
on Machine
Learning,
Morgan
1988.
[DH7:3]
Figure
9:
The
ima~ges taken
in NIR
and
VIS
Richard
and
[DJ80]
Duda,
Sce7ze
f%. Dubes,
.h.
dt... .
,:.<.., ,,, = ,
. ... ... ).. ..,..
Yovits,
Martin
A
Database
Proc.
Miuiug,
Martiu
leaves,
branches
and
Douglas
[(1(+92]
sunlit
leaves,
obtained
hy
see that,
branches
clustering
and
using
it is asatisfactory
according
7
tree
to the
BIRCH.
intention.
and
Future
on
the
Visually,
filteringof
user’s
Summary
shadows
trees,
BIli(~H
is a clustering
makes
a large
centrating
on
compact
ture
the
can
he stored
natural
balanced
one scan
image
icantly
and
and
the
superior
of data.
These
with
complexity
on several
to CLARANS
in
any
datasetsj
in terms
more
[NH94]
than
diiciency.
parameter
In
the
near
future,
is important
we
will
to
:
R. C. T.Lee,
uia lncr.mew
2(2),
1987
Simp[ijica-
and
Report
(} S-95-01,
Nashville,
quantization
Kluwer
and
and
and
Academic
[ZRL95]
speed
on
114
J.
signal
F)ublishers,
Mathematical
with
anrdgsis
Systems
Plenum
T. Ng and
for
Wiley
1990.
Incremental
Learuiug,
ar,d
New
(;on-
1987.
application.,
its
,Science, Edited
Press,
Finding
Analysis,
Statistics,
Machine
Clustering
Methods
Rousseeuw,
toCluster
Experiment.
169-292,
Raymond
(blustering
Peter
UNIMEM,
in Information
8, pp.
IIuiv.
BIR(;H’s
concentrate
of 4tb
F)ortlaud,
IIuiver-sity,
Vector
Ma.:
Lebowitz,
Formation
A+
by J ,T. Toum,
York,
1981,
of Recent Advance.
in HierarchiThe (.;omputer
Jourmal,
1%3:3.
Jiawei
Hau,
Spatial
Ejficimt
Data
and
,Eflectiue
F’roe,
Mining,
of
1994.
of California
Tian
BIRCH:
setting
Xu,
Focusing
Proc.
Technical
Vanderbilt
I?Ltroductio?L
- An
[01s9:3] Clark
F. L31son,
C’[usteri?~g, Technical
to
order-sensitivity.
Proper
Xiaowei
Databases,
(optimization
R. Gray,
in F’robability
VLDB,
and is signif-
of quality,
and
Identijlcation,
Clusterings,
[Mur8:3]
F. Murtagb,
A Survey
cal 6’lustering
A~g Orithms,
amount
is shown
Discuvery
f)atabas~s:
Spatial
Iterative
Science,
and
Data
Michael
Vol.
a
cap-
a height-
given
~lass
Kaufman,
in
vauces
measurements
BIRCH
large
using
Kriegel,
Spatial
Large
Boston,
Leonard
[Lee81]
con-
that
is a little
Experimentally,
well
hy
and
incrementally
can work
1/()
datasets.
measurements
updated
BIRCH
large
ou Kuowledge
Xu,
Spatial
:372:35.
[Leb87]
tractable
portions,
utilizes
closeness
of data.
very
occupied
It
and
tree.
of rnernory,
perform
densely
summary.
very
Largr
in
1992.
[KR90]
Research
for
Conf.
19/30.
Xiaowei
1995.
A. Gersbo
Series
problem
ou
of Computer
Groups
method
clustering
Clustering
Larg?
Eficie~~t
of Hierarchical
cept
It
in
H. Fisher,
compression,
we can
the original
for
in
Edited
York,
aud
Douglas
H. Fisher,
Knowledge
Acqui~itzon
Clustering,
Machine
Leamiug,
f~l (Tone-rptual
TN
shadows
F’ress, New
Kriegel,
Haus-Peter
for
tion
sunlit
Ester,
Symposium
Dept.
The
Methodologies
1995.
Techniques
[Fis95]
10:
C/aSSijiCatiOn
in (.~omputers,
19, Academic
of 1st Int’1
Discouery
I-J.S. A.,
G’Imter-ing
Advances
Haus-Peter
Knowl?dgr
Maine,
[Fis87]
Jaiu,
Interface
Databases,
Int’1
Figure
Vol.
Patter,L
E. Hart,
1973.
Anczlgsis
Ester,
aud Data
[EKX95b]
A.K.
Data
[EKX95a]
Peter
Wiley,
and
Ezplorator~
by M.C.
and
Analysis,
Zbau.g,
An
Databases,
Dept.,
(Juiv.
Algorithms
Ragbu
for
Computer
at Berkeley,
Ef%cient
Largr
Parallel
Report,
Technical
Divisiou,
Dec.,1993.
Rau)akt-ishuan,
Data
Hierarchical
Scieuce
aud
Clustering
Report,
of Wisconsiu-Madison,
Mirou
Mtthod
Computer
1995,
Liv,,y,
for
VPTV
Scieuces
Download