A Fast Algorithm for Indexing,

advertisement
A Fast
Fastlk?ap:
Algorithm
Visualization
of Traditional
Christos
AT&T
for
Indexing,
and
Dept.
Laboratories
Murray
Hill,
Multimedia
King-Ip
Faloutsos”
Bell
NJ
Univ.
Abstract
N),
A very
promising
space,
using
domain
fine-tuned
[25].
spatial
translates
(which
translates
or best-match
Thus,
of Maryland,
College
while
to preserve
it manages
several
Example’
‘all
type
pairs’
The
objective
and
visualization
as
query
extraction
functions
easier for a domain
it is not
expert
Given
obvious
only
how
can
storing,
for
objects
of this
to map objects
into
(k is user-defined),
preserved.
There
and
now
(b)
that
We introduce
namely,
for
met hod.
Then,
problem
proposed
a much
while
and
faster
in addition
is
as opposed
[51];
data
for
algorithm
to solve
faster
on the
show
than
database
[44],
tion
our
behind
size
we try
*On
was
by
leave
partially
the
8803012,
matching
Machines
from
Univ.
supported
National
Science
EEC-94-02384,
funds
from
of
by
..-
Maryland,
the
Institute
Foundation
of
and
Software
Park.
Systems
under
IRI-8958546
Empress
College
Grants
Research
No.
of insertion,
one
the
string
features
digitized
do
some
to design
to map
that
objects
a domain
that
this
setting
timefeature-
distance
into
the
function
motivaby Ja-
k-dimensional
has
only
function
distance
the
the approach
expert
includes
Euclidean
is exactly
Generalizing
a distance/dis-similarity
eg., the
This work
it difficult
eg.,
function
in matching
to
above
case,
distance
have
difficulties
work.
with
as the
the
the
which
Similarly,
points,
derive
transform
clear
makes
these
this
assuming
MDS,
in
to
number
typically
which
to
the
to
not
case.
a point
is reduced
functions.
gadish,
that
is
we
Overcoming
for indexing.
indeed
significantly
to quadratic,
be in this
excerpts,
extraction
although
yardstick
it allows
synthetic
other
where
to
available.
Consider
(minimum
It
was
k-dimensional
easy
substitutions
other).
voice
pat te~n recognition,
as
the
should
revealing
distance
and
warping
(MDS)
it
to
the objects
and
editing
words,
into
problem
of algorithms
always
An
[25],
object
the
functions.
English
deletions
as discussed
space,
typed
is the
(a)
for.
from
use
of
datasets.
k feature-extraction
displaying
is not
a retrieval
of traditional
Jagadish
each
is a plethora
it
provide
derive
Then
and
there
to
by
mapping
feature-extraction
ies are
mapping:
attributes
Scaling
we
algorithm
linear,
or 3-d
is looklng
we propose
real
2-d
method
indexing,
on
this
a SAM,
among
data-mining
in hand,
Experiments
(being
correlations
MultLDirnensionral
unsuitable
from
with
in
a fast
in some k-dimensional
and data-mining:
as points
an older
We describe
the dis-similarit
benefits
visualization
clusters,
regularities
points
in conjunction
be plotted
pot ential
paper.
such that
are two
ret rieval,
before
the
the topic
and the overall
collections
multimedia
to
space.
retrieving
which
and
experts
thus
the distance
is
suggested
domain
to assess the
work
for large
‘exotic’
However,
is exactly
efficient
on
this
tool
idea,
k-dimensional
to map
of
as
functions,
be
points.
space
well
excellent
the nearest-neighbor
[8]);
of two objects.
though,
algorithm
the
By
distances
Introduction
rely
feature
It is relatively
This
can
join
to answer
the
Park
of the data-set.
1
by a
use highly
etc.
similarity/distance
into
‘Query
query);
to a spatial
designing
information
the
Lin
Science
in k-d
provided
(SAMS),
(David)
and
points
functions,
a range
query,
However,
into
we can subsequently
including
to
in traditional
objects
access methods
of queries,
(which
hard.
searching
is to map
k feature-extraction
expert
types
idea for fast
databases
and
Datasets
of Computer
structure
multimedia
Data-Mining
points,
provided
D(*,
*).
case of features,
between
between
two
the
us
Notice
by using
feature
vectors
corresponding
ob-
jects.
and
Given
CDR-
220,
IRI-9205273),
with
and Thinking
Inc.
given
Inc.
such
users
query
are most
the
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the ACM copyright notice and the
title of the publication and its date appear, and notice is given
that copyin is by permission of the Association of Computing
Machinery. ? o copy otherwise, or to republish, requires
a fee and/or specific permission.
SIGMOD’95,San Jose, CA USA
@ 1995 ACM 0-89791-731 -6/95/0005 ..$3.50
a set of objects
would
like
object,
similar
(b)
space,
to
of
objects
in order
and
to
the
find
to find
to each other,
distribution
chosen
(a)
the
pairs
as well
into
check
distance
objects
of objects
to
that
as (c) to visualize
some
for
function
similar
appropriately
clusters
and
other
regularities.
Next,
Definition
163
we shall
1
use the following
The
k-dimensional
terminology:
point
Pi
that
corre-
a
spends
to
object
Oi.
the
Definition
Oi,
will
Some
will
be called
‘the
image’
search
of
is, P; ~ (z; ,1, z;, z, . . .,x~,~ )
2 The
‘images’
work
object
That
k-dimensional
be called
of the
target
next.
containing
the
space.
applications
are listed
space
it.
that
Some
motivated
distance
the
present
functions
are
also
and,
a collection
shapes
multimedia
shapes
[25]
to
images,
similar
the
shapes
feature
inertia
(voice,
that
are
clip,
Once
video
immediately
2-d
databases,
images
and
of
typically
(eg.,
the
we consider
warping
makes
it difficult
describe
a point
series,
prices,
in feature
sales
eg.
series
geological,
etc.,
of
In
‘jind
companies
such
whose
typical
stock
past days in which
showed
patterns
goal
patterns
we
that
used
errors)
similar
is to aid
the
to
have
the
Euclidean
as the
distance
solar
today’s
appeared
distance
function
find
ones
the
the
case of spelling,
tion
[26].
There,
typing
given
string
[30]
databases,
and
a wrong
OCR
string,
we
(with
at-
etc.),
detect
any
we
clusters,
demographic
data
two
types
of
queries
requests
Specifically:
query-by-example
‘similarity
form:
are
(or,
equiva-
will
signify
a desirable
object
query’)
Given
search
a collection
of objects
within
a user- dejined
to
distance
e
object.
4 The term
join’)
wiIl
of objects,
distance
all pairs
signify
query
(or,
equivalently
of
the
form:
queries
find
the pairs
c from
of objects
other.
each
In
a
which
Again,
are
c is user-
the above
into
two
can
major
is
Access
[7] and
the
the
we
search
can
Methods
z-ordering
searching
by a mapping
Such
a mapping
for
time
for
employ
highly
(SAMS),
like
[37].
range
These
queries
queries.
The
fine-tuned
the
R*-trees
methods
provide
as well
as spatial
[8].
can
dataset,
[53].
general
linear
help
with
visualization,
Plotting
can
objects
reveal
such
as the
shape
clustering
as points
much
of
existence
Gaussian)
powerful
etc..
insights
discovering
rules.
as discussed
before,
in
the
and
k=2
(linear
These
data-
or
structure
of major
of the distribution
versus
provide
and
benefit
k-d space.
benefits:
that
Spatial
would
in some
accelerate
mensions
[1]
applications
points
3 diof
clusters,
versus
curvi-
observations
in formulating
the
the
can
hypotheses
time
as in
error
that
query
mining:
squared
two
object),
as follows.
in
applications.
patients
‘query-by-example’
or
query
wind
In
query’
(termed
Thus,
searching
to trans-
blood-pressure
descriptions,
following
2. it
be
series.
Similarity
of
symptoms,
desirable:
of the
joins
similar
of
of insertions,
visualization
physician
among
queries
fast
similarly’,
past.
(sum
the
3 The term
reason
data,
would
pattern>
between
In
the
second.
age,
queries.
‘range
1. It
[11],
magnettc
in the
[4].
are needed
records
gender,
above
pairs’
provides
as stock
by examining
[2] and
to help
the
‘all
All
map
[53]
move
against
is typically
number
that
is a
alphabet
candidates
distance
to the
given
to be very
of objects
databases,
queries
there
diseases.
within
would
weather
prices
forecasting,
may
that
scientific
data,
where
best
to
approxi-
defined.
This
therefore,
such
like
collection
that
[50].
astrophysics
databases,
or ‘find
The
or
sensor
environmental,
etc.,
be
are properly
data,
[3],
like
Definition
space).
financial
numbers
time
(and
example,
‘spatial
some
sure
features
image
MRI
distance
differences
to find
each
with,
bones)
mining
For
from
teaching
make
Data
Definition
be
would
the
string
of
a four-letter
minimum
first
lently
retrieve
requiring
to
structures
before
adequately
that
images,
aligned,
(eg.,
as for medical
ie.,
form
and
ECGS),
to
the
the
or substitutions
From
video
can
symptoms
Notice
two
anatomical
images
distance
strings
case
has to be matched
find
deletions
seem
function
Ability
as well
complicated,
the
3-d
from
string
to
closest
the
databases,
applications,
the
the
is
clips
or
(eg.,
these
and
users
method
all
a new
strings,
or correlations
audio
or video
objects
similar
purposes.
are
warping
with
we
is
with
score
old
would
moments
dis-similarity)
stored.
for diagnosis,
functions
Time
There
example,
music
and
are
cases with
research
it into
For
scores,
l-d
X-rays)
[5]
past
valuable
the
(eg.,
scans)
quickly
with
appropriately
proposed
where
of strings
tributes
applied.
Medical
brain
[35].
[33].
our
of
images
databases,
(or
determined,
find
attributes,
a target
in DNA
collection
editing
find
Search-by-content
music
matching
large
the
find
In
to
collection
between
(color
similarity
like
a
texture
etc.
to
the
in
etc.)
to retrieve,
similar
been
or
in multimedia
music),
want
databases:
would
to
distance
shape,
desirable
might
like
vectors
for
we
one;
would
Euclidean
selected
highly
a give
we
colors,
used
has
general,
of
similar
color
of
in
to
identical
mate
(A, G, C,T);
described.
Image
a dictionary
Conceptually
to highlight
the
correc-
known:
should
General
164
We
the
the general
shall
refer
to
fact
that
only
Problem
(’distance’
it
problem
as the
the
distance
case)
is defined
‘ distance
case’,
function
is
Given
N objects
and distance
them
(eg.,
N
simply
the distance
an
N
x
information
distance
function
D(*,
has
about
matrix,
or
*) between
two
been
ences,
solve
the
the
objects)
Find
N
points
such
that
in a k-dimensional
the
distances
are maintained
been
as well
We
expect
ity.
In
the
Euclidean
tions.
that
the
symmetric
‘target’
it
distance
like
the
the
space,
because
Alternative
function
obeys
(k-d)
distance,
Lp metrics,
distance
and
V()
is
triangular
we
is invariant
metrics
L1 (’city-block’
use
under
could
features
case
from
projection,
is
the
when
the
of the
or ‘Manhattan’
‘features’
have
but
we
objects,
usually
(’dimensionality
we
because
curse’).
the
We
already
still
features
shall
are
refer
to
dis-
from
There
to
Given
N
Find
N
such
that
do
a
(’features’
vectors
with
vectors
n attributes
distances
too
many
it
as the
dimensionality
Then,
choose
distance
spaces
are maintained
to use the
In
the
fulfill
Lp
where
dij
but
not
the
(L2
ideal
and
for
should
large
gives
the
suffer
from,
crepancies
(low
To
should
or O(N
because
distances,
‘stress’
should
a new
The
provide
the
leading
- see (Eq.
object
cost
to
a
random).
dis-
smal.
k-d
The
is vital
outline
(MDS),
of this
a brief
related
SVD
etc)
for
O(1)
and
our
method.
In
results
on real
list
conclusions.
to
methods.
section
and
its
O(log
is as follows.
pointers
access
or
to
N).
its
In
Here
distances
the
point
older
present
its
to
attempts
cuss the
to
solve
In
section
literature
on
section
in
function:
datasets.
background
(K-
information
(MDS)
First
The
above
have
termines
5 we
[28]
different
we
MDS
dis-
including
that
165
space
algorithm
heuristic,
or
point,
to
even
at
and
k-d
method
of
moves
‘steepest
k-d
points.
as a ‘spring’
algorithm
points
the
distances.
of
the
distance
the
k-d
item
between
estimated
then,
is
computes
– 1 points
each pair-wise
and
works
each
the
of the
a guess
improvement
assigns
positions
version
of
scaling
[51],
MDS
Several
proposed
tries
to re-
to minimize
the
the
to the
for
that
the
metric
basic
[48],
MDS
where
MDS,
semantic
algorithm:
de-
and
Kruskal
the
distance
Young
which
[55]
incor-
corresponding
of the
has been used in numerous,
are
exten-
automatically
messures,
perception
mul-
distances
and
qualitatively;
difference
distance
following:
above
k; Shepard
specified
individual
observers’
called
generalizations
non-metric
are
is
because
a method
value
multiple
the
with
discrepancy
the
the
points;
a good
the
in k-d
springs.
been
items
N
the
[29] proposed
proposed
function
‘stress’
every
other
employs
as numbers.
Kruskal
about
method
two
distance
The
the
some
and
positions
of the
sions
experimental
the
it treats
‘stress’
given
present
In section
problem.
Scaling
object
(Euclidean)
starts
examines
minimize
tidimensional
clustering
3 we
some
2
Scaling
methods
4 we give
the
to a point
between
no further
originally
using
it
MDS
the
porates
Multidimensional
items,
desirable
stress
distances
version,
It
update
the
arrange
Pj.
the
until
simplest
from
to
between
some
the
is the
MDS
dissimilarities
between
Survey
we
each object
and
that
it,
(eg.,
the
describes
2
‘distance’
of N
(c) the
y measure
&
goal,
Then,
Intuitively,
This
reduction
In
synthetic
the
(eg.,
average.
improves
descent’
image.
of Multi-Dimensional
dimensionality
spatial
object)
map
to
‘queries-by-example’,
paper
survey
and
the
be
map
Pi
error
on the
point
actual
algorithm
and
as follows:
will
be
Oi
achieve
log N),
1)).
fast
a query
should
requirement
L,
(eg.,
algorithm
we present
a very
data
them.
method
a set
and
to minimize
‘images’
Technically,
3. It
will
dissimilarity
relative
iteratively
databases.
preserve
basic
(a)
(dis)similarities
of
among
Following
expects
space,
their
of
metric).
mapping
O(N)
or higher,
the
discover
set
k.
object
possible.
to compute:
0(N2)
prohibitive
2. It
method
is the
we
in either
As before,
requirements:
be fast
but
next.
the algorithm
roughly
1, It should
to
a
as well
vectors
metric.
distance
problems,
following
two
be any
Euclidean
above
the
between
could
of
each,
space,
Oi
the
two
the
used
information
variations,
described
pair-wise
between
Again,
access meth-
case)
in a k-dimensional
the
has
case).
(MDS)
is
structure
(dis)similarity
are several
[29] ) is
(b) their
as possible.
the
the
a k dimensional
Problem
closely
that
algorithms.
(MDS)
(spatial)
case setting,
case:
Specialized
the
of spatial
to
present
(’features’
Scaling
scaling
underlying
items
extracted
want
survey
we
and
to clustering
sci-
[55])
(SVD)
reduction
Multi-Dimensional
see
special
transform
a brief
social
Then,
Decomposition
as pointers
Multidimensional
tance).
A
problem.
(K-L)
we provide
as well
2.1
the
rota-
be any
case
Value
(eg.,
physics
non-
inequal-
typically
‘distance’
fields
research,
used for dimensionality
ods,
negative,
diverse
market
Singular
Finally,
as possible.
in several
Karhunen-Lo&we
related
space,
used
psychology,
to
dat a’s difference.
diverse
applications,
structure
analysis
of
words;
perceived
operating
on 60 different
perception
of
‘trusting’);
what
shifts)
However,
the
our
‘warm’
texture
analysis
applications,
and
of
of
spins
from
Value
Appendix
as
well
the
K-L
Our
the
related
[49, 39, 19]
implementation
[54]
is
‘mosaic’
/pub/SRC/
transform
on
(SVD)
as on
approx-
is closely
Mathematical
edu
and
projections
operation
matrix.
in
A,
order,
its
Decomposition
cs. umd.
However,
drawbacks:
with
The
transform
//olympos.
two
eigenvalue
vector
object-feature
K-L
in
suffers
Singular
the
the
[40].
MDS
data
k eigenvectors.
to the
(determining
in decreasing
each
first
pattern
type
science
them
imates
people’s
spectra
different
sorts
[41],
and
(like
gamma-ray
political
[55];
for
traits
together
(nuclear
relationships);
ideological
relationships
personality
recognizing
their
trait
goes
physics
recognition,
and
personality
kl
of
available
(URL:
f tp:
two
draw-
m).
suffers
from
backs:
It
●
requires
In
the
Its
) time,
it
for
‘query-by-example’
be
mapped
algorithm
is
search/add
a point
this
an
a new
at best.
query
would
Thus,
query
k-d
the
number
●
it can not
●
even
the
complexity
latter
situation
trieval
and
MDS
spond
to
to
lary
be
sands).
would
of answering
as sequential
this
above
two
present
MDS
and
drawbacks
paper.
Despite
as a yardstick,
‘stress’
are the
the
against
of our
a
motivation
behind
problems,
such
the
‘features’
we measure
extensively
in statistical
algebra,
The
(’K-L’)
transform
the
where
and
sense
the
its
Figure
it
the speed
to map
is
and matrix
points
a
method
1 shows
a set
of
2 directions
suggests:
If we are allowed
to project
(x’
K-L
mean
between
on is the
2-d
only
direction
etc.
i
points,
and y’)
that
is optimal
square
each
and
typically
large
1)
information
being
the
in the
tens
experimental
recorre-
vocabuof thouresults
on
the
corre-
next
, .
.J2cz
best
three
the
handle
R-tree
(b)
methods
the
[14,
be
and
points,
The
(a)
its
most
tree-based
variants
(R+-
[24], R*-tree
[7], Hilbert
using
quadtrees
linear
[37,
finally
38],
(c)
a
definition,
shapes.
classes:
z-ordering
23] and
will
by
k-dimensional
[20],
[31], P-tree
engine
which,
form
curves
or other
methods
R[18]
spacethat
use
[36, 22].
are
only
the
[6].
All
also
retrieval
triangular
these
methods
in
none
‘target
for
to
exploit
them
space’,
to
[46],
the
space
tries
nor
case where
[10],
the search
of
the
holds
try
to prune
However,
points
methods
inequality
in order
query.
is y’
retrieval
(SAM),
complicated
equivalently,
into
the
can
[27] etc.)
inequality
direction
Clustering
more
trees
There
point
to
[47],
triangular
on a range
map
objects
provide
a tool
to
research
for
visualization.
Finally,
our
clustering
x,
been
. .
.
proposed.
could
be beneficial
where
See, eg.,
application
Information
. .
work
algorithms,
a recent
0-
..O
for
(n>
documents
(V
that
like
grid-files
transform
k= 1, the best
.
in
where
or even
[45], hB-tree
error,
n-d
the
the K-L
of z’;
[13],
Method
methods
filling
image.
sponding
be slow
attributes
eg.,
4 we provide
before,
tree
or,
ATar%unen-Lo2ve
[17]).
the
distance
studied
n-dimensional
see [12],
it may
many
vectors
and
Access
popular
been
recognition
minimizes
is the
has
(k < n) is the
(eg.,
that
error
k-d
way
points
[16],
collection,
mentioned
Spatial
techniques
problem
pattern
optimal
to k-dimensional
in
the
filtering
Retrieval
rectangles,
reduction
case,
case,
1) with
appears,
In section
methods
In
case
a dataset.
As
we use
method.
Dimensionality
2,2
‘distance’
scanning.
above
which
>
V-dimensional
size of the
2.3
The
‘features’
(N
The
to
algorithm
database
in the
databases
is not
the
at all on the
the
has
MDS
that
be applied
of
In
item
space.
Given
in
be as bad
the
incremental
item
of
datasets.
10-100).
the
in
number
large
questionable:
operation:
0(N2),
O(N)
is
setting,
to
for
N=
retrieval
is the
above,
(typically,
fast
prepared
N
for
presented
was small
use
where
is impractical
applications
items
●
0(N2
Thus,
items.
several
[32],
in GIS,
[21]
[43]
approaches
for
surveys,
[52] for
on
have
[34] for
applications
in
Retrieval.
.
Figure
1:
Illustration
transformation
- the
of
‘best’
the
Karhunen-Lo&ve
axis
to project
Proposed
3
(K-L)
is x’.
In
the
which
‘K-L’
the
is often
most
nations
computes
used
important
of
features),
the
in pattern
features
for
eigenvectors
matching
(actually,
a given
of
the
set
of
first
so that
linear
arithmetic
combi-
vectors.
covariance
larger
It
and
matrix,
166
part,
achieves
[17] to choose
we describe
a fast
distances
mapping
are preserved
example
example
their
Method
with
definitions.
with
a small
real
data.
the
proposed
algorithm,
of objects
well.
Then,
distance
Table
into
1 lists
points,
we give
matrix,
the
an
and
symbols
a
01
] Definitions.
Symbols
N
I
I Number
n
of ob.iects
\ dimensionality
I
I
of original
(’features’
*0,
in database
space
Oa
case only)
dimensionality
distance
of ‘target
function
space’
between
two
xi
objects
~
dab
Table
1: Summary
of Symbols
and
Figure
Definitions
the
3.1
goal
case,
is
that
that
objects
are
For
indeed
these
projections
rest
this
discussion,
of
it
were
heart
of the
objects
on
a carefully
choose
two
objects’
from
through
the
n-
cl
points
challenge
distance
a point
an
in
an
Oa
in
is
will
space,
be
and
(referred
to
space.
algorithm
of
the
do that,
the ‘line’
is discussed
using
ob
To
The
later
the
(see Figure
objects
cosine
on
law.
as
the
Figure
3: Projection
we
to the
line
passes
to
choose
The
4).
line
See Figure
1 (Cosine
cosine
law
Law)
In
any
are
2 for
triangle
an
and
eventually,
and
the
are
indeed
O~);
oaoiob,
Oi’
gives:
n
k
rectangles
the
OaEOi
2 can
Pythagorean
and
be
in
the
two
The
ObEOi.
solved
for
~i,
the
first
coordinate
of
da,i2
the
above
distance
V(O,,
OJ)
the computation
objects,
points
pivot
on
problem
thanks
a line,
For
0.,
for
(for
i, j
is
a shorthand
=
1, ...,
needs
the
N.
for
Notice
distances
the
Eq.
3, we can
preserving
example,
some
if Oi
x; wi 11be small.
map
of the
is reasonably
Thus,
we have
part
‘H, such
3
two
apply
observation
is the
the
two
on
Oj’
next
the
‘l-i
with
with!
the
Once
previous
N).
create
distance
projections
Oi’.
objects
Let
1, ...,
not
to begin
and
1)-d
(0=,
problem,
should
the
–
line
i =
determine
of
as, Oil
depicts
Oi’,
original
is to
the
hyper-plane.
(for
This
n was unknown
between
projections
Lemma
tance
to
‘D’()
on this
objects
a (n
to
of Oi
one.
the
consider
as the
by
missing
is affirmative,
on
this
the
is done,
steps.
Oi,
Oj,
and
their
A
hyper-plane.
key
Lemma:
that
between
are given.
that,
information:
di,j
of xi only
which
Observe
2da,b
equations,
because
Figure
(3)
same
decreased
we can recursively
+ da,b2 _ db,i2
z
is the
objects
method,
in 2-d space,
that
is perpendicular
our
the
this
points
answer
space,
projection
hyper-plane
The
n-d
project
only
function
Oi:
Xi
the
theorem
in
into
Pretending
for
problem
and
space.
‘H that
then,
we can extend
is as follows:
points
problems,
From
M, perpendicular
figure.
the objects
k-d
idea
stand
The
(2)
db,i2 = da,i2 + da,b2 – 2xida,b
Proof
previous
is whether
we can map
hyper-plane
Theorem
on a hyper-plane
of the
question
so that
that
o.ob
‘piuot
that
illustration.
into
1
I
1
I
I
1
I
1
(with
is to project
‘line’.
and consider
n-d
projections
computed
obiect
08
I
I
I
1
matrix
object
n-d
method
selected
now on),
objects
Eq.
!4=%
Oj
we have.
proposed
objects
them
The
the
Ob
n).
The
pivot
from
input
I
xi .Xj
unknown,
The
on
D
pretend
these
- projection
Oi
of a given
project
directions.
whose
is to
some
to
only
as if
unknown
in
try
it is the
the
idea
law’
E
‘distance’
space,
distances
key
points
to
the
k-d
the
The
orthogonal
to compute
in
match
and
for
points
matrix.
on k mutually
treated
problem
N
will
space,
since
the
find
distance
dimensional
only,
solve
to
distances
N
x
‘cosine
?
to
is,
Euclidean
In
of the
Oaob.
Algorithm
The
N
2: Illustration
line
objects
1 On
V’()
computed
the
between
from
the
hyper-plane
the
‘H,
projections
original
dist ante
the
Euclidean
0~1 and
D(),
01’
discan
be
as follows:
distance
close
solved
to
(’D’(Oi’,0j’))2
the
= (D(Oi)Oj))2-(~i-~j)2
i, j = 1,..
.,N
(4)
k=l.
167
Proof
From
the
Pythagorean
OiCOj
(with
the
right
theorem
angle
at ‘C’)
on the
triangle
(AB)
AB.
indicates
Since
(OiC)
the
=
length
(DE)
=
of
the
Global
line
(5)
N
segment
[ [~~ – x~ 112, the
to
project
and,
on
compute
the
a second
therefore,
distance
line,
lying
orthogonal
D’()
proof
on
to the
the
first
allows
is
us
line
(O.,
int
At
PAD
array
/*
currently
of the
algorithm,
of the
stores
the ids of the pivot
per
points
end
image
/*
- one pair
=0;
the
is the
array
col#
by
] /*
row
objects
fi,
0~)
X[
i-th
*I
2 x k pivot
to
hyper-plane
variables:
k array
x
the
complete.
Ability
2 FastMap
begin
(o,’oj’)’ = (Coj)’ = (o,oj)’ - (OK7)2
where
Algorithm
we have:
recursive
to
the
being
i-th
call
*/
column
updated
object.
of the
Xo
*/
construction.
Thus,
we
space.
can
More
The
k times,
point
that
‘pivot
to find
each
not
ob.
on which
the
projections
0.
and
ob
such
distance
computations.
heuristic
algorithm
same
the
this
we would
that,
we
{ return;
k.
{col#
like
2) /*
apart
‘D(O.,
require
propose
choose-distant-objectso,
Ob)
3) /*
record
1
( 0,
5) /*
dist () )
begin
1) Choose
arbitrarily
second
pivot
let
=
O.
from
an object,
object
(the
and
that
is farthest
to the
6)
apart
distance
/*
function
object
that
is farthest
apart
from
report
the
pair
objects
0.
and
ob
as the
the
steps
to choose
Now
In
we
to
algorithm
two
distant
typed
n-d
vectors)
the
triangular
words,
(b)
of dimensions
space,
desired
variable,
the
to
2 x k array
objects’
PAD.
(a)
*/
FastMap(
the global
= z~
projections
of the
objects
perpendicular
to
line
function
is given
the
on
(O.,
D’ () between
by Eq.
two
4 */
1,D’(),0)
k –
and
it maps
(c)
the
distances
output
vectors
Xo.
each
Figure
([dist
objects
3.1 gives
call,
the
a k-d
lines
in a global
of the
global
for
function
168
that
5, each
query
‘pivot
each
the
time.
respect
objects’
The
is mapped
it
object
mapping
to the
into
on
the
the appropriate
is, we repeat
query
of the
search
‘query-by-example’
with
the
‘pivot
projecting
That
for
is O(iVk)
the longest
is O(N).
the
0~
by
objects’,
complexity
with
call,
queries.
object
space’,
algorithm
(0(1))
object.
of which
a
point
j] is the j-th
algorithm
to record
when
of the
k recursive
to the
X[i,
i-th
is to facilitate
‘target
FastMap
is constant
of the
we need
the
in
is mapped
k]) where
‘FastMap’
follows:
of appropriate
Notice
FastMap.
as
of the
At each recursive
call
arrives,
point
distance
also records
pseudo-code
is
request
in
image
2 and
that
each
object
of the
recursive
algorithm
as well
in the
reason
the coordinates
after
X[i,
P;, the
are steps
in each
number
i-th
calculations:
The
obeys
the
complexity
steps
or
axis,
1], X[i,2],
distance
case),
points
Therefore,
(X[i,
The
objects
that
into
algorithm
recursive
‘ FastMap’
determines
on a new
co-ordinate
images,
preserved
are written
The
ante’
desired
are
F’i=
algorithm.
color
the
5: Algorithm
the algorithm
objects
calls.
of the
a set O of N
D()
on
5 iterations.
basic
function
the
for
linearity
documents,
a distance
the N x k array
‘pivot
our
definition
as input
ASCII
k, and
The
the
linear
N
a constant
we have
describe
problem
are
be repeated
maintaining
inequality
so that
as possible.
Ob)
objects.
algorithm
can
experiments,
accepts
(eg.,
k-d
still
ready
the
above
steps
all our
are
According
the
two
of times,
heuristic.
the
in
middle
number
(O.,
of objects.
4: Heuristic
The
return
are O */
end
Thus,
All
i and
Eq. 3 and update
dist ante
Figure
Figure
every
distances
coi#]
the
the
end
N.
X[i,
consider
call
*/
b;
Oi,
x, using
projections
0.)
for
on line
object
a hyper-plane
Ob);
let ob = (the
=0
all inter-object
compute
ob
objects
col#]=
= O)
objects
each
of choose-dwtant-
pivot
= a; PA[2,
array:
object
ob ) (according
ids of the
i, col#]
project
let it be the
dist ())
4)
the
col#]
since
for
*/
result
‘D());
4) if ( D(O.,0~)
illustrated
choose-distant-objects
objects
ob be the
0,
PA[l,
linear
/*
3)
pivot
and
objects(
4
2)
++;}
set X[
Algorithm
O )
}
choose
let 0.
0(N2)
the
k, D(),
else
we need
distance
would
Thus,
any
I’astlfap(
1) if (k < O)
steps
for
are as far
achieve
that
Algorithm
‘target’
is how to choose
Clearly,
To
However,
the
discussed
and
as possible.
a 2-d
the probiem
0.
maximized.
in Figure
for
apply
solving
we have
other
to choose
is
problem
we can
thus
objects’
a line
from
the
importantly,
recursively,
the
solve
database
step
5
only.
operation
size
N.
More
detailed,
calculation
distance
of the
objects.
algorithm
the
object
if
decide
we
pivot
distance
requires
El(k)
we need
to compute
because
query
Even
between
more
the
operations,
objects
from
each
to
compute
on the
fly,
calculations
to
the
distance-
of the
the
13
to add
of
Due
to
space
limitations,
as well
method
for
omit
an
of how
of documents.
report
[15],
also
our
details
are
The
available
We implemented
our
on a DECStation
two
5000/25
groups.
method
In
with
speed
and
The
distance,
quality
we use
the
STAT/LIBRARY
The
the
1).
For
of
for
real
MSIDV
The
centers
be the
points
(0,0,10,0,0)
to
standard
ance
by
13
the
sample
of
we used
the
each
This
IMSL
designed
to
abilities
of
We used
attribute
real
datasets
documents
in
7 groups
i #
and
j.
Hart
on
a 3-d
chosen
data
axis
points
with
and
covari-
the
distance
Euclidean
version
6
each
(0,10,0,0,0)
The
distance.
of the
textbook
[12, p.
were
Again,
is the
is a simplified
form
in
distribution,
1 on each
Recognition
30 points
Duda
clusters
(10,0,0,0,0)
points
N=120
points
(0,0,0,0,10).
any
such
dataset
the
of
points
of
a Gaussian
a =
O for
The
number
of
follow
deviation
SPIRAL:
several
The
same
(0,0,0,10,0)
two
a dataset
space.
(0,0,0,0,0)
cluster
p~,~ =
from
clustering
applications.
normalizing
generated
cluster.
between
the
We
in a Pattern
as synthetic.
of
specific
measure,
5-dimensional
the
in each
respect
is
has
each
are as follows:
with
implementation
experiments
the
of
learning
row
interval.
datasets
in
our
routines.
and
several
as well
compared
as measured
the
in
after
unit
clusters,
to
UNIX(TMIJ
with
synthetic
points
experiments,
we
output,
procedure
visualization
algorithm
datasets,
of
(Eq.
group
the
several
MDS,
FORTRAN
second
illustrate
and
group
traditional
function
of MDS,
run
first
to the
of machine
amount
dis-similarity
GAUSSIAN5D:
on ‘mosaic’.
in ‘C++’
and
the
the
to
‘stress’
method
the
found
the
ory
theories.lEach
arithmetic
to apply
Experiments
4
our
we
as an illustration
a collection
in a technical
the
For
domain
example,
e reposit
domain
indicating
Euclidean
3*k.
in
and
constituents
wine.
k
a total
UC-Iruin
attributes,
distances
for
the
databases
2 * k pivot
we have
count,
from
the
one
used
[17, p. 46].
spiral,
as suggested
by
243]:
zl(i)
=
cos zs(i)
z?(i)
=
sin #s(i)
Zs(i)
=
i/ti,
i=
with
MDS
are:
D OCS:
It
(each
consists
with
ABS:
of 35 text
Abstracts
of computer
science
technical
re-
BBR:
Reports
about
CAL:
‘Call
papers’
for
Portions
(taken
from
Cooking
WOR:
‘World
SAL:
for
Gospel
games.
technical
conferences.
in King
James’
News’:
Version
of Matthew).
documents
about
the Middle
1994).
advertisements
for
computers
and
dat asets
or text
is available
taken
repositories
edu).
The
of the
distance
from
vectors;
it
is closely
function
(for
see the
WINE:
N=
analysis
but
details,
154
derived
we expect
records,
of wines
from
grown
three
is the
vectors,
‘cosine-similarity’
more
various
soft-
(eg.,
after
related
Euclidean
dataset.
To
with
different
to see 3 clusters.
of
same
The
105.
3.
each
method
N,
in
of
UNIX,
was
and
(6)
...29
and
Thus,
169
For
both
We
report
a linear
scales,
and
become
‘O(X’2)’
respectively,
are
highlight
the
MDS
while
that
FastMap
requires
lines,
labeled
linear
with
in
slopes
as ‘O(x)’
roughly
time
6 we
which,
as visual
requires
utility
Figure
curve,
lines
by
records
time
In
intended
both
N = 45,
of
the
times.
straight
run
required
number
a quadratic
These
fact
time
used
user
we
we experiment
the
the
our
‘WINE’
namely,
methods,
of
the
N,
sizes,
6 plots
compare
using
on
of varying
2, respectively.
1
and
aids,
to
quadratic
on the database
size N.
The
important
time
we
as
increases.
obtained
dependency
scales.
we
we
MDS,
as a function
plotted
from
in Italy,
experiments,
Figure
logarithmic
Next,
a chemical
region
see the
on subsets
k=2
method
[15]).
cultivars.
file
popular
of
traditional
with
dramatic
Retrieval
report
results
in the
dis-
the
60, 75, 90 and
time
.
normalization
to the
technical
MAT
Wustl
of Information
group
logarithmic
news-
wuarchive.
function
document
from
on the Internet
electronically
tance
to unit
are
first
with
and
above
groups
the
method
also
ware
The
Comparison
algorithms
recipes.
(October
Sale
basketball
of the Bible
the
REC:
East
4.1
In
ports.
MAT:
0,1,
5 documents):
versus
want
the
to
Figure
again
is that
over MDS,
study
in
the
the
of
subset
the
logarithmic
1 ics . uci . edu: // ftp/pub/machine-learn
achieves
performance
k
60-point
7 shows
FastMap
even for small
dimensionality
We used
2 to 6.
k,
conclusion
savings
time
scales.
the
and
for
datasets.
of
target
each
space
we varied
each
Notice
k
method
that
ing-databases/wine
the
1
time
of
while
our
the
method
time
Fasilfap
increases
for
provides
MDS
with
grows
dramatic
k,
even
savings
as
expected,
faster.
Again,
.
‘t
in time.
i,
*
+
+
01
Ii
-2
c
i?
~
001
,4
,y
o all
1
01
,,.,
1
~ ; .:.,,,.,.,
-.-..---”$:-’-
.,,.,.*-----
A
.’
.. .. .
I
.
.. . ..
.
,,-
Figure
6:
dataset;
axes
Response
time
MDS
vs.
and
Figure
8:
for
WINE
the
Response
FastMap
Io?J(N)‘b
WINE
database
FasWap,
size
with
N
for
k=2,3.
The
to
savings
in time,
In
its
first
k for
the
WINE
FastMap
The
each
The
gives
the
three
variable
‘ideal’
it
time,
closer
(0,0).
quality),
almost
the
for
a method
much
but
give
lower
the
FastMap
of magnitude
Thus,
The
‘stress’,
origin
closer
present
and
then
same
can
zero
(O ,0), the
value
produce
8
of the
graphs,
of
the
that,
fz,
(=
a mapping
faster.
170
ask
j~
stand
several
for
k=3
for
the
the
all
the
all
even
with
the
detect
(b),
any
first
two
the
for
plots,
by
the
of
.fl
~3,
only
6
the
same
vs $2,
and
(c)
dimensions
4 clusters;
can
clusters
Figure
showing
‘target’
In
clusters
two
k=3,
forming
‘ FastMap-attributes’.
roughly
the
for
of ~1 vs.
that
using
be
are
disjoint
9(c)
all
the
completely
in
at
confirms
the
6 clusters
are
space.
it uses a fictitious
ability
synthetic
points,
indicated
three
scatter-plots.
observation,
the
the
scatter-plot
scatter-plot
with
3-d
with
ones.
cluster).
are
gives
because
in the
and
The
that
and
on
we
mapping
per
cluster
we can
Although
tion
point
results
(N=120
plot
one of the
previous
show
.fZ and
resulting
9(a)
scatter-plot
least
is to
visualization
stated,
real
dataset
gives
Notice
next
goal
results
the
20 points
3-d
~1 and
lustrates
better
‘stress’
(b)
the
of
Data
the
Figure
while
significant
quality.
for
the
.fI,
the
with
of a given
disjoint
time.
‘ideal’
gives
with
gives
our
present
that
we
separated,
reduce
Figure
independent
in
to the
MDS
of each
k. In these
zero
of
stress.
as a function
scales.
goes to the
for
we see that
stress
algorithm
of time.
points
is that,
achieves
‘ FastMap-attributes’.
9
clusters,
and
k,
it gives
can each
method,
is in general
Alternatively,
the
k,
and
properties
otherwise
GAUSSIAN5D
letter.
dimensionality
dimensionality
should
(solid)
‘price/performance’
in logarithmic
FastMap
an order
estimate
amount
each
dimensions
- MDS
to
same
the
is, how
was the
method
is.
is
of
of experiments
is useful
Synthetic
Figure
axes logarithmic.
as we saw,
in a given
‘stress’
response
Both
is to find
that
‘stress’,
The
longer,
question
algorithm,
the
For
takes
number
(N=60)
line).
experiment
method.
clearly
subset
(dashed
final
vs.
varying
(solid)
logarithmic.
loss in output
we
Recall
4.2.1
the
group
of experiments
Unless
First
time
axes
FastMap
algorithm
Here
datasets
56
of this
without
group
dimensions.
Response
with
- MDS
Map
proposed
datasets.
7:
Both
linearity,
clustering.
Figure
stress
N=60
Clustering/visualization
this
the
4
vs.
with
line).
conclusion
thanks
Fast
3
Iw(k)
time
subset
(dashed
4.2
2
lC.W
the
Both
logarithmic.
:-
Iwl
Ikw%)
I
*--,:.,-:.:
dataset,
of FastMap
this
to help
example
with
il-
visualiza-
clustering.
next
Figure
10(a)
shows
the
Notice
that
experiment
plots
result
the
the
of
involves
original
FastMap
projections
the
SPIRAL
dataset
in
for
(Figure
k=2
10(b))
dataset.
3-d
and
(b)
dimensions.
give
much
f3
fz
17.5
~
17.5
15
“$
15
D
12,5
‘*~:B
12.5
B
10
B
10
&!A
7.5
,F%$
‘
~;
WfFpF ~
A
f+
;+:
5
~
.%8;.
F
2.5
7.5
5
10
15
12.5
2.5
=
ccc
17.5
Figure
9: FustMap
information
to form
2.5
5
7.5
type
GAUSSIAN5D
the original
curve,
10
12.5
15
fl
17.5
(c)
(b)
on the
about
a l-d
cc&
&
fl
(a)
some
‘:@@q’%
E
PF
&
‘-”p:
z,:
~
,51
with
dataset
dat aset:
(a)
the points
no obvious
clusters,
~z vs ~1 (b)
and
~3 vs jl
(c) the
3-d
scatter-plot
(~1, .f2, ~3)
seems
and
with
.,
,! L._._.L
of oscillation.
,~%
,,,
‘,,
, ,,!,
.
.
‘“
,..
,5
. .
.
.
.
.
!,
.,
!8
,,
,,
,“
,,,
,,,
(b)
(a)
Figure
Figure
and
10: (a)
(b)
the
3-d points
result
on a spiral
for
of FastMap,
(SPIRAL
12: The
space
dataset)
box
k=2
(a) The
Real
Next,
in
Data
present
we
11.
Figure
The
GAU~SIAN5D
using
gives
the
two
scatter-plot
previous
two
first,
the
first
the
The
second
three
clusters
separate
gives
for
third
class,
one
Figure
9 for
a 2~d
~1,
~s and
points
the
scatter-plot
(c)
~z,
labeled
even
picture
to
Notice
separate
with
more
‘?’).
The
and separates
of the
The
3-d
our
Figure
in
its
last
12.
dataset,
The
entirety
to illustrate
documents
separated
figure
and
that
shows
(b)
after
FastMap
of each
well,
DOCS,
class.
in only
k=3
the
the
results
almost
3-d
zooming
manages
Notice
that
are shown
scatter-plot,
into
the
to cluster
the
objects
space,
in
(a)
center,
well
the
7 classes
are
to
171
dashed
our
for
each
queries,
‘all
queries
[42]
optimized
spatial
access
(R-trees
[20],
is useful
visualization
The
can
the
R*-trees
for
has
the
or spatial
etc.),
[7]
etc.).
data-mining,
algorithm
of this
that-
two
searching
or
joins
because
methods
contribution
a linear
appropriate
following
accelerate
are
for
‘range’
[9, 8], nearest
several,
highly
readily
available
Secondly,
such
cluster
of a high-dimensionality
main
the
function,
(’query-by-example’
queries
neighbor
mapping
algorithm,
infer
points
it
of queries
pairs’
expert
functions.
a distance
will
into
Firstly,
types
in non-
a domain
extraction
provide
algorithm
as possible.
object.
objects
several
[25],
into
distances
searching
cFastMap’
only
objects
the
as well
feature
proposed
need
which
features
provide
the
that
for similarity
databases
expert
FastMap,
dimensions!
in k=3-d
of the
to map
so
are preserved
approach
applications.
help
scatter-plot
the clusters
to
domain
Mapping
jl-~s
to
completely.
For
the
expected
from
that
one
information
better.
between
was
of the
FastMap
the contents
algorithm
traditional/multimedia
combines
members
a fast
k-dimensional
In an earlier
(b)
after
(b)
detail,
proposed
in
Thanks
denote
respectively.
some
clusters
the whole
‘?’)
manages
(the
dataset,
a 3-d scatter-plot.
provides
the
~1 and
‘Cl’,
scatter-plot
scatter-plot
gives
WINE
‘ FastMap-coordinates’
(’+’,
and
the
as in
is
(a)
into
symbols
~1-~z
layout
dataset:
the
for
results
dataset,
Conclusions
We have
the
DOCS
big picture
in more
5
4.2.2
(b)
(a)
;f
analysis
a
and
dataset.
paper
f~lfills
is the
all
design
the
d&ign
of
f3
+
1.5[
@
+
1.5
1.25
1
@
0 75
@
0.5
0.25
I
I
0.25
0.5
0 75
1.25
1
0.25
1.’75’1
1.5
0.75
0.5
(a)
1
1.25
1.5
1.75f]
(b)
Figure
11:
on WINE
FastMap
dataset
(a)
(c)
k=2
with
jz
vs .fl,
Finally,
goals:
we
output
1. it solves the general
eg.,
the
Value
than
the
database
time,
it
able
to
map
a new,
point
in O(k)
distance
the
only
(while,
synthetic
separate
solve
the
low
size,
and
therefore
Scaling
leads
to
fast
arbitrary
(MDS)
and
indexing,
object
calculations,
●
work
into
a k-d
regardless
of the
●
The
algorithm
uses theorems
(such
as the
cosine
each
object
on an appropriate
recursive
calls.
With
sured
the
‘stress’
by
FastMap
the
from
law),
it
respect
we
datasets:
The
‘stress’
levels
as MDS,
projects
●
at each of the
to quality
function),
k
of output
(mea-
experimented
with
result
is that
for
should
for
given
second
contribution
tools
matrix
from
algebra,
Scalzng
form
and
method
(or
though
(MDS)
these
database
research,
tion
function.
study
of its
is a good
SVD
tion
the
the
We
with
would
choice
and
for
the
the
general
algorithm.
to handle
for
K-L
problem
and
MDS
has been
into
Being
transform
provide
case
(although
of the
‘distance’
used
in
quadratic
on N
easily,
MDS
datasets.
The
optimal
solu-
unable
of
interactive
the
the
the
dis-
given
data
algorithm
for
thank
Labs
Dr.
for
mining
for
and
document
them;
and
Howard
for
Patrick
maintaining
Learning
Joseph
providing
algorithms
on
Aha
to
Bell
MDS
A
using
the
for
from
M.
the
Elman
and Doug
Kruskal
source
answering
Murphy
UC-Irvine
Databases
B.
the
code
several
and
David
Repository
and
Domain
Oard
for help
of
Theories;
with
SVD
of
visualiza-
k-d points
of small
dataset,
databases,
determine
algorithms.
al-
arsenal
the
Prof.
Al-
proposed
the
‘queries-by-example’
visualization
‘features’
to
like
AT&T
Machine
trans-
indexing
objects
W.
SVD).
as the
benefits
to multimedia
automatically
and
application
questions
and
Multi-Dimensional
added
datasets.
to map
for
intro-
Karhunen-Lo3ve
as fast
to help
it
sciences
Decomposition,
be
iterative
unable
or
the
space
Acknowledgments
of the
is that
social
the
could
applications
a quadratic,
and
tools
of non-traditional
diverse
and
Value
as general
gorithm,
paper
target
to
with
it achieves
a fraction
recognition,
specifically,
Singular
not
of the
pattern
even
retrieval.
time.
A
the
tance
from
duces
the
and
geom-
quickly
direction
on real
same
traditional
and
and
real
managed
clusters,
k of the
of the algorithm
clustering
etry
‘ FastMap’
existing
dimensionality
FastMap
features
size N.
on
includes:
Application
where
being
speed
algorithm
or 3 dimensions).
Future
much
the
of the
k=3
the
proposed
There,
or most
for
(c)
demonstrated
of our
datasets.
all
values
(k=2
j’s vs $1 and
have
quality
Singular
case))
Multi-Dimensional
same
database
can
(’features’
on the
case)
and
(SVD)
version
2. it is linear
faster
(’distance’
(K-L)
Decomposition
specialized
3. at
problem
Karhunen-Loeve
(b)
Karhunen-Lo&e
This
is the
code
for
Transform
the
K-L
transform
in
Mathemat-
ical [54]
(*
given
$n$
it
to handle
and
case).
a matrix
vectors
creates
their
attributes
172
mat_
of
$m$
a matrix
f lrst
(le.
$k$
, the
with
attributes,
with
$n$
most
‘ important’
vectors
K-L
expansions
of
these
$n$
vectors)
[10]
*)
KLexpansion[
mat_,
k_:2]
. Transpose[
mat
KL[mat,
k]
];
[11]
(*
given
a matrix
dimensions,
$h$
singular
of
KLC
with
the
mat_
{n,m,
K-L
, k_:2
1:=
Module[
$k$
3,
vec
translate
newmat=
{val,
vectors,
so
Table[
vecl=
[13]
n IIN;
the
mat[[ill
-
mean
avgvec,
and
{i,l,nll;
burg,
1;
[14]
11
and P.E.
Christos
similarity
Foundations
Faloutsos,
search
of
and Arun
in sequence
Swami.
databases.
Organization
Data
and
[15]
MD,
C.
March
Faloutsos
Mining
Tomasz
association
databases.
Imielinski,
rules
Proc.
and Arun
between
ACM
and
S.
Symposium
(PODS),
pages 247-252,
Christos
Agrawal
pages
and Ramakrishnan
association
1993.
olym-
in large
[16]
Faloutsos
Altschul,
W.
Gish,
W.
D.J.
Lipman.
A
basic
local
[5] Manish
Arya,
3-d
and
medical
[6] Ricardo
A.
Toga.
16(l)
Baeza-Yates,
Proximity
trees.
Crochemore
In
Science,
Univ.
available
from
Peter
[7] N.
Pattern
Beckmann,
Seeger.
cessmethod
The
pages 322-331,
[8] Thomas
Brinkhoff,
and Bernhard
joins.
[9] Thomas
ACM
of ACM
D.
Kriegel,
Joel
IEEE
Udi
fixed
pages
and
editors,
[21]
processing
SIGMOD,
ofspatial
[22]
robust
pages
237-246,
(David)
Lin.
datasets.
Dept.
College
(URL
ftp:
Fastmap:
and visualCs-tr-
of Computer
Park,
1994.
//Olympus.
also
cs. umd.edu
/sigmod95.ps).
and Susan
T. Dumais.
an analysis
of ACM
Fukunaga.
Personalized
of information
(CA CM),
in-
filtering
35(12):51–60,
of
Introduction
Academic
I. Gargantini.
to
Press,
An effective
ACM
G. H. Golub
De-
Statistical
1990.
Pat-
2nd Edition.
way to represent
(CACM),
quadtrees.
December
25(12):905–910,
and C. F. Van Loan.
ac-
John
K.
Hinrichs
Computations.
index
second
structure
SIGMOD,
ACM
Clustering
Proc.
for
pages 47-57,
John
Algorithms.
Wiley
of
The
Nievergelt.
proximity
the
WG ’83
Concepts
in
grid
queries
(Intern.
file:
Workshop
Computer
a
on spatial
Science),
on
pages
1983.
Jagadish.
attributes.
May
Linear
clustering
ACM
SIGMOD
of
objects
Conf,,
with
pages 332–
1990.
H.V.
Jagadish.
Sixth
IEEE
1990.
J.
to support
Theoretic
H.V.
342,
[24]
and
structure
multiple
1994.
173
Matrix
Press, Baltimore,
a dynamic
Proc.
A. Hartigan.
objects.
[23]
using r-trees.
May 1993.
R-trees:
searching.
Graph
and Bernhard
University
1989.
A. Guttman.
data
SIGMOD,
of spatial
joins
as
& Sons, 1975.
and
Schneider,
May
Systems
June 1984.
pages
Ralf
Kriegel,
sec-
also available
data-mining
of Maryland,
Comm.
edition,
queries
processing
197-208,
Hans-Peter
of Database
1989.
isr tr 94-80,
mosaic
Recognition,
spatial
June 1994.
ACM
Kriegel,
Multi-step
for
CT-SIGMOD-
1992.
100-113,
Hans-Peter
SIGA
and multimedia
The Johns Hopkins
[20]
Schneider,
an efficient
[19]
Manber,
LNCS807,
CA,
March
and King-Ip
delivery:
Comm,
Data
Gusfield,
R.
[18]
1990.
SIGMOD,
Brinkhoff,
500-
1982.
1993.
using
and rectangles.
Seeger.
Seeger. Efficient
Proc.
May
and
and
tool.
a prototype
Cunto,
Matching,
r*-tree:
for points
Faloutsos,
March
Asilomar,
H.-P.
Publication
Fractals
ACM
for indexing,
W. Foltz
Keinosuke
tern
1990.
system.
:38-42,
matching
Springer-Verlag,
198-212.
B.
M.
Combinatorial
Myers,
Qbism:
Walter
and Sun Wu.
5th
Christos
[17]
1994.
search
215(3):403-410,
database
Bulletin,
E.W.
alignment
Cody,
Arthur
image
Engineering
Re-
Fast algodatabases.
September
Miller,
Biologg,
William
Richardson,
Srikant.
pages 487-499,
ofMolecular
Special
on Principles
of traditional
formation
May
207-216,
rules in large
[4] S,F.
Journal
and
Tezt
Swami.
sets ofitems
SIGMOD,
of VLDBConf.,
Proc.
(LSI)
Second
and CS-TR-2242.
algorithm
methods.
for mining
The
Roseman.
SIGART
cember
rithms
and
pages 105–1 15, Gaithers-
Eighth
3383 umiacs-tr-94-132
1993.
[3] Rakesh
and
Ctagsification
indexing
editor,
NE3T.
1994.
key retrieval.
ization
ftp/pub/TechReports/fodo.ps,
Agrawal,
Sci-
1973.
semantic
(TREC-2),
/pub/TechReports
[2] Rakesh
En-
Perfor-
National
Pattern
New York,
Latent
Conference
a fast
In
Algorithms
(FODO)
Conference,
Evanston,
Illinois,
October
also available
through
anonymous
ftp,
from
pos.cs.umd.edu:
Hart.
Wiley,
UMIACS-TR-89-47
References
Efficient
NSF
High
Communications,
In D. K. Harman,
ondary
1
[1] Rakesh Agrawal,
Physical
Challenges:
1992. The FY 1992 U.S. Research
Susan T. Dumais.
trieval
and
to
Program.
Duda
= O *)
on
Grand
Computing
Some approaches
of the ACM(CACM),
215.
. newmat
Range[l,k]
Committee
R.O.
Keller.
Comm.
1973.
Sciences.
TREC-2.
Eigensystem[
Transpose[newmat]
vec[[
I
Mathematical
Scene Analysis.
Dimensi.ons[mat];
mat]
April
Development
[12]
plus,
16(4):230-236,
mance
*)
val,
and R.M.
filesearching.
ence Foundation,
axes
expansion
newmat,i,
Apply[
Burkhard
gineering
first
the
$k$
Of
the
ie.,
avgvec,
{n,m}=
vectors
computes
first
avgvec=
(*
$n$
vectors,
W.A.
best-match
:=
Irat’1
Spatial
Conf.
search
on
Data
with
polyhedra.
Engineering,
Proc.
February
[25]
[26]
H.V.
Jagadish.
Proc.
ACM
Mark
A.
A retrieval
SIGMOD
Jones,
Integrating
Guy
multiple
ocr post-processor.
on
Document
France,
[27]
Joseph
ing.
[29]
[30]
SAGE
Ballard.
[42]
r-
Proc.
of
December
[43]
Chile,
scal-
Wish.
1978.
Surveys,
[45]
correcting
B. Lomet
and
Betty
indexing
performance.
ACM
chical
Salzberg.
method
TODS,
A survey
clustering
The
with
hb-tree:
good
[46]
A.
Desai
[34]
[35]
Wayne
Niblack,
Flickner,
texture
Databases,
[36]
Computer
Science.
J. Nievergelt,
file:
ture,
ACM
J.A.
of
[39]
William
and
Cambridge
[40]
A.
high
Conference,
The
using
color,
Symposium
T.
Conf.,
Video
and
available
r+
In
pages
VLDB,
also
available
as
CS-TR-1795.
of reference
of
points
the
in
ACM
(CA
bestCM),
1977.
retrieval.
1,
R. N. Shepard.
Gilbert
Wang.
ACM
K,C.
Proc,
Sevcik.
multikey
[51]
New techniques
TOIS,
A.W.
W.
[52]
8(2):140-158,
[53]
1993,
for
April
Brian
of spatial
pages
[55]
pages 343–352,
1990.
P. Flannery,
Press,
Rao
and
of
Saul
Numerical
Proc.
A. Teukolsky,
Recipes
in
C.
1988.
Jerry
texture
San Jose, February
Lohse.
perception.
Identifying
In
1962.
and
its
Applications.
2nd edition.
Banerjee,
S. Torgerson.
and E,M.
Santori,
comparisons.
Multidimensional
Psychometrika,
London,
Warping
Neurosc.
Abs.,
scaling:
17:401–419,
I. theory
1952.
SPIE
1992.
174
information
England,
Retrieval,
1979.
Butter-
2nd edition.
Dimitris
Vassiliadis.
The input-state
space approach
to the prediction
of auroral
geomagnetic
activity
from
wind
Artificial
Stephen
variables.
Int.
Intelligence
on Applications
Workshop
in
Solar
Terrestrial
Ph~sics,
1993.
Wolfram.
Mathematical.
Second
Edition.
Forrest
W, Young.
Theory
and Applications.
Hillsdale,
process-
spaces.
i and ii.
1990,
September
[54]
query
P.K.
Algebra
Multidi-
distance
219-246,
for interbrain
C. J. Van-Rljsbergen.
of
The
in an object-
and parameter
1980.
of proximities:
an unknown
Linear
Press,
Toga,
worths,
file struc-
SIGMOD,
analysis
with
27:125-140,
and method.
1984.
ACM
The
scaling
Strang.
16:247,
as IBM
Feb.
March
Vetterling.
University
features
The
objects.
on
Image
processing
A comparison
Ravishankar
level
Taubin.
for
and
query
for native
H. Press,
William
Peter
Conf.
symmetric
system,
SIGMOD
and
1983.
on
1987.
Cornm.
May
3d models
1986.
Orenstein.
ACM
Intl.
(81511),
9(1):38-71,
Spatial
ing techniques
choice
and Tsong-Li
solar
TODS,
May
326-336,
Petkovic,
Also
9203
an adaptable,
orient ed database
[38]
1993.
[50]
Myron
Technology,
H. Hinterberger,
J. Orenstein.
MA,
Conference
20(5):339-343,
Academic
0$
and
Retrieval
RJ
l%c.
Equitz,
by content
1993
[49]
and effective
and Gabriel
Science
February
Report
Will
images
SPIE
and
Research
grid
[37]
shape.
Imaging:
Storage
1908,
Warps,
Theory
Addison-Wesley
Reading,
Dennis
Shasha
Modern
Time
the
for multi-dimensional
searching.
Psychometrika,
1994.
Dragutin
FaIoutsos,
Querying
and
Electronic
B. Kruskal.
September
The
file
mensional
of a
1991.
mining.
September
Glasman,
Christos
project:
unfolding
Efficient
data
Barber,
[48]
Christodoulakis.
the
Han.
spatial
Ron
Eduardo
Yanker,
QBIC
for
pages 144-155,
Conf.,
Shapiro.
to
to
1983.
tJMIACS-TR-87-3,
best-match
Journal,
24(10) :6–8, October
T. Ng and Jiawei
met hods
Stavros
systems:
Computer,
IEEE
VLDB
and
information
Raymond
in hierar-
Computer
England,,
1995.
1990.
Narasimhalu
clustering
advances
The
[47]
1983.
Multimedia
reality.
of recent
M.
match
December
May
and C. Faloutsos.
International
SRC-TR-87-32,
a
guaranteed
15(4):625-658,
algorithms.
26(4):354–359,
[33]
I%h
Vincent.
1995 ACM-
~ntroduction
Inc.,
index
applica-
F.
the
CA,
Comparisons.
Company,
a dynamic
and
Macromolecules:
Sequence
and
Applications,
–
of
McGraw-Hill,
N. Roussopoulos,
507–518,
multiattribute
Jose,
Joseph
and
of
Kelley,
McGill,
and
Edits
Proc.
24(4):377–440,
1992.
F. Murtagh.
Sankoff
T. Sellis,
tree:
1990.
[32]
David
M.J.
II
In Proc.
San
Retrieval.
Publishing
Multidimensional
Hills,
for automatically
Computing
and
vol
1972,
queries.
Conference,
Practice
Beverly
Techniques
neighbor
and Sara Beth
Theory
:
ROUSSOPOU1OS, Steve
String
1964.
sciences
Press, New York,
Salton
G.
N. Shepard,
scaling:
behavioral
Information
[44]
ACM
Nick
Hilbert
multidimensional
and Myron
the
in
appear.
Santiago,
500–509,
29:1–27,
tions
Nearest
In
Roger
Multidimensional
SIGMOD
fractals.
Nonmetric
Romney,
Nerlove.
Saint-Male,
FaJoutsos.
using
A. Kimball
Seminar
a bayesian
to appear.
publications,
Kukich,
David
1991.
1994.
words in text.
[31]
Recognition,
pages
B. Kruskal
Karen
Internchonal
anri
Christos
Psychometrika,
Joseph
W.
in
[41]
shapes.
May
Conference
First
r-tree
B. Kruskal.
scaling.
sources
and
Conference,,
September
[28]
knowledge
1991.
an improved
VJ5DB
and Bruce
In
Kamel
tree:
for similar
pages 208-217,
A. Story,
Analysis
September
Ibrahim
technique
Conf.,
New
Addison
Mzdtidimensionai
Lawrence
Jersey, 1987.
Wesley,
scaling:
Erlbaum
1991.
Hastory,
associates,
Download