Learning from Spatial Overlap Michael H. Coen , M. Hidayath Ansari

Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence
Learning from Spatial Overlap
Michael H. Coen1,2 , M. Hidayath Ansari1 , and Nathanael Fillmore1
1
2
Dept. of Computer Sciences
Dept. of Biostatistics and Medical Informatics
University of Wisconsin-Madison
Madison, WI 53706, USA
Example A
Abstract
Example B
0.503
10
9
1
7
0.501
6
0.5
5
4
0.499
3
2
0.498
1
0
0
1
2
3
4
5
6
7
8
9
10
0.497
0.497
0.498
0.499
0.5
0.501
0.502
0.503
Figure 1: We consider the two point sets in Example A to
be far more similar to one another than those in Example B.
This is the case even though they occupy far more area in
absolute terms and would be deemed further apart by many
distance metrics.
Introduction
What does it mean for two things to be similar? This type of
question is commonplace in computational sciences but its
interpretation varies widely. For example, we may represent
proteins, documents, movies, and images as collections of
atoms, words, reviews, and edges respectively. For each of
these representations, we often want to find distance measures that enable meaningful comparisons between sample
instances.
Our contribution in this paper is to formulate and examine
a new measure, similarity distance, that provides an intuitive
basis for understanding such comparisons. In this paper, our
things are finite, weighted point sets of varying cardinality.
The notion of similarity presented here refers to a measure of
the spatial overlap between these point sets. Namely, when
we consider the similarity of two objects, we are asking: to
what degree do their point set representations occupy the
same region in a metric space? The goal of this paper is to
formalize and answer this question; to compare our solution
to other approaches; and to demonstrate its utility in solving
real-world problems.
It is easiest to begin with an intuitive, visual presentation
of the problem and definition.
1.1
0.502
8
This paper explores a new measure of similarity between
point sets in arbitrary metric spaces. The measure is based on
the spatial overlap of the shapes and densities of these point
sets. It is applicable in any domain where point sets are a
natural representation for data. Specifically, we show examples of its use in natural language processing, object recognition in images, and multidimensional point set classification.
We provide a geometric interpretation of this measure and
show that it is well-motivated, intuitive, parameter-free, and
straightforward to use. We further demonstrate that it is computationally tractable and applicable to both supervised and
unsupervised learning problems.
define a distance function with a range over [0, 1], where
a value of 0 means two point sets perfectly overlap and a
value near 1 means they occupy extremely different regions
of space. To turn this into a similarity function (instead of a
distance) we simply subtract the distance from 1. We make
no assumptions about the cardinality of each set or how they
were generated. Nor do we care about the sizes of the regions of space involved, e.g., the hyper-volumes of their
convex hulls.
An image is useful for illustrating this idea. Consider the
two examples in Figure 1. Each shows two overlapping samples (shown in red and blue respectively) drawn from Gaussian distributions; we would like to compare the similarity
of these samples, each of which is commonly called a point
set. Our intention is that the point sets in Example A should
be judged much more similar than those in Example B based
on their degree of spatial overlap, despite the points in Example A covering orders of magnitude more area than those
in Example B. We discuss the relationship between similarity and distance below, but we note that the relatively tiny
distances involved in Example B would lead many distance
metrics to indicate they are “closer” to one another; this is
the opposite of what we would like to find.
Problem Statement
In this paper, we focus on the concept of spatial overlap as
a measure of similarity. In other words, we would like to
2
Similarity Distance
Similarity distance (dS ) is derived from the KantorovichWasserstein metric (dKW ) (Kantorovich 1942; Deza and Deza
2009), which proposed a solution to the Transportation
c 2011, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
177
Problem posed by Monge in 1781. This problem may be
stated: What is the optimal way to move a set of masses from
suppliers to receivers, who are some distance away? Optimal in this definition means minimizing the amount of total
work performed, where work is defined as mass × distance.
For example, we might imagine a set of factories that stock
a set of warehouses, and we would like to situate them
to minimize the amount of driving necessary between the
two. This problem has been rediscovered in many guises,
most recently in a modified form as the Earth Mover’s Distance (Rubner, Tomasi, and Guibas 2000), which has become popular in computer vision.
It is useful to view the Kantorovich-Wasserstein distance
as the maximally cooperative way to transport masses between sources and sinks. Here, cooperative means that the
sources “agree” to transport their masses with a globally
minimal cost. In other words, they communicate to determine how to minimize the amount of shipping required.
Let us contrast this optimal view with the notion that each
source delivers its mass to all sinks independently of any
other sources, in proportion to its production. We will call
this naive transportation distance (dNT ). In other words, the
sources do not communicate. Each simply makes its own
deliveries to every sink proportionally. Note this is not the
worst (i.e., most inefficient) transportation schema. It is simply what occurs if the sources are oblivious to one another
- when they do not take advantage of the potential savings
that could be gained by cooperation.
2.1
2
Similarity Distance
Δ
0
−1
−3
−2
0.6
0.4
0.2
−2
0
2
4
6
0
0
8
2
4
6
Separating distance Δ
8
Figure 2: The graph on the right plots similarity distance as
a function of separation distance between the two point sets
shown on the left. As can be seen, similarity distance grows
non-linearly as the distance between the point sets increases
and then quickly approaches its asymptotic limit of 1.
As dNT (A, B) ≥ dKW (A, B) by definition, this provides the upper bound for dS (A, B) of 1. We see this in Figure 2, where
the similarity distance between the two illustrated point sets
quickly approaches 1 as they are separated. Conversely, as
the point sets increasingly overlap, their similarity distance
approaches zero rapidly.2
2.2
Formal Definitions
Kantorovich-Wasserstein Distance The discrete formulation of dKW is easily obtained through the discrete version of
the Mallow’s Distance (Levina and Bickel 2001). The optimization problem for computing dKW (A, B) thus corresponds
to the following minimization problem:
Consider two point sets A = {a1 , . . . , am }, with associated
weights pi and B = {b1 , . . . , bn }, with associated weights
qi , with both sets of weights summing to one. Treating A
and B as random variables taking values {ai } and {bj } with
probabilities {pi } and {qj } respectively, dKW is obtained by
minimizing the expected distance between A and B over all
joint distributions F = (fij ) of A and B :
We define a weighted point set A as finite collection of points
{ai ∈ X }, where
each point has an associated weight ωi ∈
[0, 1], such that i ωi = 1. Thus, ω corresponds to a discrete
probability distribution over some domain, for example, X
could be Rd .
The similarity distance dS (A, B) between two such point
sets A and B is simply the ratio of these two metrics,
namely:
dKW (A, B)
.
dNT (A, B)
0.8
1
Preliminary Definitions
dS (A, B) =
1
3
EF A − B =
n
m
fij ai − bj 2 =
i=1 j=1
(1)
n
m
fij dij
i=1 j=1
where F is subject to:
By this definition, dS (A, B) measures the optimization
gained by adding cooperation when moving the source A
onto the sink B . 1 Thus, it is a dimensionless quantity that
ranges between zero and one. For clarity, let us examine dS
at its two extremes. If dS (A, B) = 0, then dKW (A, B) = 0, implying the maximally cooperative distance between A and
B is zero. This can occur only when A = B ; namely they
perfectly overlap; this means each source is co-located with
a sink expecting precisely as much mass as it produces.
In contrast, suppose dS (A, B) → 1. This tells us that cooperation does not help during transportation. This occurs
when A and B are so far apart that the points in A are much
closer to other points in A than those in B and vice-versa.
Thus, cooperation does not yield any significant benefit. In
this case, dKW (A, B) → dNT (A, B), implying dS (A, B) → 1.
Once so formulated, this optimization problem may be
solved using the transportation simplex algorithm. Although
this algorithm is known to have exponential worst case runtime, it is remarkably efficient on most inputs and therefore
1
Note that dNT = 0 iff both point sets contain exactly the same
singleton point. In this case, dS is undefined.
2
Code implementing our approach and all data used in this paper are freely available at http://biocomp.wisc.edu/data.
fij ≥ 0, 1 ≤ i ≤ m, 1 ≤ j ≤ n
n
fij = pi , 1 ≤ i ≤ m
(2)
(3)
j=1
m
fij = qj , 1 ≤ j ≤ n
(4)
i=1
n
m
i=1 j=1
178
fij =
m
i=1
pi =
n
qj = 1
(5)
j=1
(A) Similar point sets
dNT (A, B) =
m
pi dKW ({ai }, B) =
i=1
n
m
0.02
0.015
0.5
0.01
0.025
4
0.02
3
0.015
2
0.01
1
0.005
Absolute Error
Time (seconds)
0.025
Error
5
0.04
0.03
1
0.03
Time
0.045
0.035
1.5
6
Absolute Error
Error
2
0.005
0
20
40
60
80
Data set size reduction (%)
0
100
0
20
40
60
80
0
100
Data set size reduction (%)
Figure 3: Error in similarity distance when approximated by
hyperclustering, averaged over 30 runs. (A) Here, we sample two sets of size 1000 from the same distribution. Their
exact similarity distance is 0.108, which takes 14.3 seconds
to compute precisely. We vary the number of hyperclusters,
corresponding to a reduction in problem size, and plot the
error and overall computation time. (B) Here, we sample
two sets of size 100 from poorly-overlapping distributions.
The actual similarity distance is 0.879, which takes 16.12
seconds to compute precisely. Note in both cases there is
negligible loss in accuracy even when the point set size is
reduced by up to 80%.
pi qj d(ai , bj ) (6)
i=1 j=1
Computational complexity
The complexity of computing similarity distance is dominated by computation of the Kantorovich-Wasserstein distance dKW , which is a well-studied problem; using the Hungarian method has worst case complexity O(n3 ) (Li 2010) in
unrestricted metric spaces. Recently a number of linear or
sublinear time approximation algorithms have been developed for this problem and several variations, e.g., (Li 2010;
Ba et al. 2009). We have tested our implementation, which
uses the transportation simplex algorithm, over several hundred thousand pairs of point sets drawn from standard statistical distributions and real world data sets. The runtime
has expected time complexity of (1.38 × 10−7 )n2.6 seconds,
fit with an R2 value of 1, where n is the size of the larger
of the two point sets being compared. (We are particular
to provide the quadratic coefficient, rather than describe the
runtime using order notation, as its small value is what allows this approach to be used on larger scale problems.)
seconds to within 0.01 of the true value.3 In extensive experimentation with this approximate form of similarity distance, errors of up to 0.05 have little effect and correspond
to natural variation in samples drawn from the same distribution.
3
Related Work
Prior work on quantifying similarity between point sets or
measuring a distance between them generally falls within
one of three categories, most of which are not specifically
designed to measure overlap or similarity. These measures
are also very sensitive to their parameters, which often requires extensive search for a given problem, making their
use problematic in unsupervised learning problems.
3.1
2.4
0.05
Time
The naive distance is therefore the weighted sum of
the “ground” distances d between individual points. It is
straightforward to directly calculate dNT in O(k2 ) time, where
k = max(m, n).
2.3
(B) Dissimilar point sets
2.5
Time (seconds)
widely used. We discuss runtime complexity and an approximation technique for enormous point sets in sections 2.3
and 2.4.
Naive Transportation Distance We now define a naive solution to the transportation problem. Here, each “supply”
point is individually responsible for delivering its mass proportionally to each “receiving” point. In this instance, none
of the shippers cooperate, leading to inefficiency in shipping the overall mass from one probability distribution to
the other.
Over weighted point sets corresponding to discrete distributions, we define naive transportation distance dNT as:
Hyperclustering
Point-set Distance extensions
The first set of approaches are inspired from point-set and
Hausdorff distances. Point-set distance is defined between a
single point x and a set of points A as inf y∈A d(x, y). Hausdorff distance is an extension of this concept. The directed
Hausdorff distance DHaus (A, B) between two sets of points
A and B is supx∈A inf y∈B d(x, y) and the Hausdorff distance
between sets A and B is the larger of DHaus (A, B) and
DHaus (B, A). Other metrics inspired from point-set distance
are discussed in Deza and Deza (2009). We discuss this class
of distances further in section 3.4.
Because similarity distance measures the relative density
differences between point sets, it is not overly sensitive to
their exact numbers or locations. We use this intuition to
approximate similarity distance by grouping nearby points
into a single weighted point.
We call these groups of nearby points “hyperclusters” and
construct them by recursively splitting the original point sets
via k-means (with random initialization) until the maximum
interpoint distance within each hypercluster is less than a
specified threshold. In Figure 3, we show how the error and
runtime change for a pair of point sets as the number of hyperclusters change. Empirically, this technique allows similarity distance to be approximated closely for sets of millions of points. For example, precisely computing similarity
distance for point sets of size 100,000 would take almost 16
days, but an approximate answer can be computed in 46.9
3.2
Root Mean Square Distance
A second method of computing distances between point sets
is to assume an order between the points in them and align
3
We determined this by solving similarity distance analytically
for several common distributions, thereby providing a way to evaluate approximations.
179
(2) dS: 0.834
(1) dS: 0.656
Similar documents
(3) dS: 0.619
Dissimilar documents
placebos
−22
cancer
relief
treatment
ways
blood
thoughts
killed
something
banks
likely
enough
started
got
probably
blot
ad
maybe
level
treated
soon
looks
llthink
mark
anyone
complete
else
dying
place
single
various
subject
try
article
many
foster know
work
migraine
dean
day
homeopathic make
relist
go
cbnewsj
kaflowitz
like
get
help jyusenkyou
arromdee
wisc
jhu gordon
skepticism
writes
decay
edu
ken
instance
prophylaxis therapies dose
−25
2
21
1
−26
−28
reports
disease
living
drugs
therapy
diseases
treatment
surgery
insurance
countries
patients
among
reading
medicine
cure
course
success
lead
banks
coming
science
issue
saying
symptoms scientists
won
least
thing
treatments
known
might
paid
sense
alternative
believe
experience
percent
money
answer
happened
intended
several
better
medical
robert
award
mind
nothing
makes
helped
waste
person
every
seems
case
european
point
britain
keep
england
wonder nl
condition
country
practitioners
king
open
however
twice
effect
nine
wrote
treatable
since
changed
absolutely
found
doubt
modern
personality
report
subject
school
combination
germany
rate
article
direct
bound
many
journal
read
note
hell
homeopathy
untreatable
well
rose
ran
life
business
homeopathic
people
accepted
world
re
magic
make
would
normal
comment
depends
selective austria
relatively
method
placebo
like
curable
time
please
last
ninety
help
one
ethically
reasoning
view
conventional
gordon
counted
incurable
beliefs
writes
new
eduknowingly
deeply
dispense
attempts
strikes
ron
helpless
roth
echoed heading
dick
holland
pitt
kidding
gasch
mean amused
neurodermitis
cstaken otohoracle
rgasch
dk
−30 mds
(4) dS: 0.056
(5) dS: 0.873
(6) dS: 0.443
Panel
dKW
Lin. rescaling
Mean-var norm.
Rank norm.
dS
1
0.337
0.205
0.914
0.205
0.656
2
0.337
0.478
1.698
0.501
0.834
3
0.337
0.325
1.176
0.320
0.619
4
0.337
0.871
1.991
0.500
0.056
5
0.337
0.282
1.599
0.339
0.873
6
0.337
0.162
0.801
0.182
0.443
ρ
0.000
0.097
-0.108
-0.053
1.000
them using an algorithm such as Kabsch (1976) or Procrustes (Goodall 1991). Once an alignment is found, a distortion measure (such as least root mean square distance)
can be calculated by summing up distances between corresponding pairs of points. Clearly this method can only work
for point sets of the same cardinality and is susceptible to
disproportionate influence by outlying points. While modifications exist to overcome these problems, these general
methods of summing distances between pairs of points yield
little information about similarity or shape congruence.
3.5
−29
−30
geb
−28
−26
god
−24
−22
intellect
shameful
surrender
olsen
pitt
cstaken
ch lindy
dsl
cadre
−30
geb
chastity
−29
−28
−27
−26
−25
−24
god
Normalizations of dKW
Another possible approach to measuring similarity between
two point sets would be to first normalize the point sets and
then apply the Kantorovich-Wasserstein distance. Natural
examples of normalization schemes include linear scaling,
mean-variance normalization, and rank normalization (Stolcke, Kajarekar, and Ferrer 2008). These normalizations can
be useful in various circumstances, but Figure 4 and the
Pearson correlation coefficients in the table show that they
do not capture any notion of spatial overlap.
Match Kernels
Pyramid match (Grauman and Darrell 2007) and other
match kernels have been developed as efficient ways to determine similarity between point sets especially with vision
applications in mind. The focus in match kernels however is
to find similarity while not penalizing non-similarity. These
kernels find closest pairs among individual points and only
take into account these pairs for the kernel computation.
Thus, such methods do not capture a notion of the “shape” of
the point sets, but instead only their intersection, regardless
of the importance of their non-overlap.
4
Applications
In this section we examine applications of similarity distance used in isolation and as a kernel to a variety of supervised and unsupervised machine learning problems.
4.1
3.4
cb
distribution to be fit to each point set (or another distribution) and define a kernel based on a probabilistic divergence
measure such as Bhattacharyya distance. This approach is
further kernelized by mapping the elements of each point set
to a new Hilbert space before fitting the parametric model.
The two main issues with this approach are that it assumes
a fixed distribution and is quite inefficient due to expensive
computations involving matrix multiplications, inverses and
determinants.
Many of the approaches mentioned above are lossy in the
sense they rely on only some of the pairwise interactions between points. In doing so, they collapse the problem into
calculating the distances between small sets of the original
points. However, this provides little information about how
similar the overall shapes of the entire points sets are. They
are also neither bounded nor scale-invariant, making absolute judgments of similarity difficult.
Figure 4: All six examples in this figure were constructed
to have the same dKW and Earth Mover’s Distance (= .337),
between the blue and red point sets, while having markedly
different spatial properties from each other. This is reflected
in their similarity distances, as shown above each example.
The table further illustrates that one cannot simply normalize dKW to obtain the measure provided by dS . The final column shows Pearson correlation coefficients of each normalization with similarity distance, demonstrating that none of
them capture the notion of spatial overlap.
3.3
−28
Figure 5: In the example above, point sets corresponding
to two documents are plotted in the semantic subspace defined by god and medical. In each plot, one document is
display in a blue italic font and the other is displayed in a
red non-italic font. On the left, the two documents are from
the same newsgroups whereas on the right they are from different newsgroups. Similarity distance captures the intuitive
notion of spatial overlap corresponding to these classifications. (Only two of 6 dimensions are visualized here.)
1
2
21
12
−27
−31
offer
−30
−26
medical
medical
−24
21
Others
Document Classification
By modeling the topic of a document as a shape, we can
use similarity distance for text classification. We demonstrate this using the 20 Newsgroups dataset (Lang 1995) as
Kondor and Jebara (2003) propose a kernel which takes into
account the density of point sets. They require a Gaussian
180
Figure 6: (A) Probability density function plots along each dimension for point sets sampled in Experiment 2. (B) Examples
of two point sets that are sampled from distributions shown in red and blue in (A). (C) All points in all point sets used in
Experiment 2. The points in one example point set are connected with lines. (D) This panel shows the same data as (C), but in
a 2-D space reconstructed via approximate multidimensional scaling using pairwise similarity distances between the point sets.
Each point in this panel thus represents an entire point set from the original data, and distances between points in this panel
correspond to similarity distances between the point sets they represent. The separating line is an imaginary separator that a
support vector machine might create using a kernel based on similarity distance.
Classifier
Accuracy Precision Recall F-Measure
Baseline
C4.5 (J48)
73.33%
(bag-of-words) Naive Bayes
75.00%
Random forest
78.33%
SVM (RBF kernel)
76.67%
SVM (polynomial kernel)
83.33%
Semantic space SVM (Pyramid match kernel) 75.36%
1-nearest neighbor (dS )
85.00%
2-, 3-, 4-nearest neighbor (dS ) 85.00%
5-nearest neighbor (dS )
81.67%
SVM (1 − dS ) kernel
92.75%
0.763
0.789
0.784
0.800
0.847
0.742
0.860
0.854
0.835
0.909
0.733
0.750
0.783
0.767
0.833
0.719
0.850
0.850
0.817
0.938
to compare documents. To establish a baseline we also used
C4.5, Naive Bayes, random forest and SVMs with common
kernels on indicator bag-of-words vectors. Classification
metrics in Table 1 show that similarity distance is able to
exploit semantic relationships between words (reflected by
their mutual information) to successfully classify samples in
this experiment. Additionally, similarity distance provides
an easy way to visualize and understand the results, something which is uncommon in many classification tasks; an
example is shown in Figure 5.
0.726
0.741
0.783
0.760
0.832
0.730
0.849
0.850
0.814
0.923
Table 1: Results of text experiment using 10-fold cross validation. Results from our approach are shown in red, the best
of which is in bold face. See text for details.
4.2
Object Recognition in Images
We applied similarity distance to an image classification task
on a subset of the publicly available ETH-80 dataset (Leibe
and Schiele 2003), using the data and experimental setup
of (Grauman and Darrell 2007). The subset contains 5 views
of each object in the database. Our experiment used a total
of 256 descriptors in 128 dimensions per image. We trained
an SVM classifier using a variety of kernels on the following
problem: how well can the category of a holdout object be
identified after training on the rest of the data including other
instances of objects from that category? Accuracy results are
shown in Figure 7.
a testbed. The task here is to determine which of two newsgroups a given message came from. We do this by mapping the words in each message to points in a “semantic
space” so that similar sets of words (documents) have similar shapes (See Pado and Lapata (2007) for an overview of
work on semantic spaces). The basis for this space is chosen
by selecting a set of reference words occurring in documents
that have high mutual information for predicting the source
newsgroups. Each word is mapped to a vector consisting of
its similarities with each of these reference words, with similarity between two words being defined by their pointwise
mutual information (PMI) (Terra and Clarke 2003). We estimate these PMIs using ratios of the number of hits reported
by Google for individual words and pairs of words (Turney
and Littman 2005). This construction has a distinct advantage compared to the standard bag-of-words approach because it makes use of semantic relations between words.
For our experiment we chose 30 articles at random from
each of two newsgroups: alt.atheism and sci.med, and selected 6 reference words: (christian, doctor, god, medical,
say, atheists). We mapped each word to a vector in R6 as
follows:
4.3
Classification of Synthetic Data
A common assumption in machine learning is that data from
different classes come from different underlying distributions. It may be the case that instances come in “bags” of
points from the same distribution (for example multiple observations at a single time point). We simulate an example
where we sample sets of points from two different multivariate statistical distributions and see how well similarity
distance can classify instances.
The density functions in one dimension of both distributions are plotted along the x-axis, and along the other dimension on the y -axis in Figure 6.
We sample point sets from each distribution with varying
numbers of points and train an SVM to separate between
them using Pyramid Match and (1 − dS ) as similarity functions. Similarity distance was able to achieve a 79.4% 10-
f (w) = (PMI(christian, w), . . . , PMI(atheists, w))
We performed classification using k-nearest neighbors
(kNN) and support vector machines using pyramic match
kernel and a kernel derived from similarity distance (1 − dS )
181
References
Algorithm
Similarity distance kernel
Match Kernel (Wallraven et al. 2003)
Pyramid Match Kernel
Ba, K. D.; Nguyen, H. L.; Nguyen, H. N.; and Rubinfeld, R.
2009. Sublinear time algorithms for earth mover’s distance.
arXiv abs/0904.0292.
Coen, M. H.; Ansari, M. H.; and Fillmore, N. 2010. Comparing clusterings in space. In ICML 2010: Proceedings of
the 27th International Conference on Machine Learning.
Coen, M. H. 2005. Cross-modal clustering. In AAAI’05:
Proceedings of the 20th National Conference on Artificial
intelligence, 932–937. AAAI Press.
Coen, M. H. 2006. Self-supervised acquisition of vowels
in American English. In AAAI’06: Proceedings of the 21st
National Conference on Artificial intelligence, 1451–1456.
AAAI Press.
Deza, M. M., and Deza, E. 2009. Encyclopedia of Distances.
Springer.
Goodall, C. 1991. Procrustes methods in the statistical analysis of shape. Journal of the Royal Statistical Society. Series
B (Methodological) 53(2):285–339.
Grauman, K., and Darrell, T. 2007. The pyramid match
kernel: Efficient learning with sets of features. Journal of
Machine Learning Research 8.
Kantorovich, L. V. 1942. On the transfer of masses. Dokl
Akad Nauk 37. Translated in Management Science (1959)
4:1–4.
Lang, K. 1995. Newsweeder: Learning to filter netnews. In
ICML 1995: Proceedings of the 12th International Conference on Machine Learning, 331–339.
Leibe, B., and Schiele, B. 2003. Analyzing appearance and
contour based methods for object categorization. In Computer Vision and Pattern Recognition, 2003. Proceedings.
2003 IEEE Computer Society Conference on, volume 2, II –
409–15.
Levina, E., and Bickel, P. 2001. The Earth Mover’s Distance is the Mallows Distance: Some insights from statistics.
IEEE International Conference on Computer Vision 2:251.
Li, S. 2010. On constant factor approximation for earth
mover distance over doubling metrics. arXiv abs/1002.4034.
Pado, S., and Lapata, M. 2007. Dependency-based construction of semantic space models. Computational Linguistics
33(2):161–199.
Rubner, Y.; Tomasi, C.; and Guibas, L. 2000. The Earth
Mover’s Distance as a metric for image retrieval. International Journal of Computer Vision 40(2):99–121.
Stolcke, A.; Kajarekar, S.; and Ferrer, L. 2008. Nonparametric feature normalization for svm-based speaker verification.
In ICASSP 2008: IEEE International Conference on Acoustics, Speech and Signal Processing, 2008., 1577–1580.
Terra, E. L., and Clarke, C. L. A. 2003. Frequency estimates
for statistical word similarity measures. Proceedings of the
2003 Human Language Technology Conference of NAACL.
Turney, P. D., and Littman, M. L. 2005. Corpus-based learning of analogies and semantic relations. In Machine Learning 60(1-3):251–278.
Accuracy
94%
90%
89%
Figure 7: Example images and classification results from
the ETH-80 dataset. Two instances from the 8 classes are
shown.
fold cross-validated classification accuracy, whereas Pyramid Match achieved an accuracy of 68.3%. Note that point
sets from each class appear very similar (an example from
each class is shown in Figure 6(B)), and it is the relative
density at various locations that separates them. In this particular case, the means and variances of the two distributions
are nearly identical along each dimension.
4.4
Clustering
Similarity distance has been used in clustering (Coen 2005;
2006), e.g., in learning the vowel structure of an unknown
language, and in comparing different clusterings (Coen,
Ansari, and Fillmore 2010). In the latter of these, set theoretic approaches have long dominated partitional analyses
of cluster assignments. Similarity distance lets us compare
clusterings spatially in terms of their actual geometric arrangements in addition to their category assignments.
5
Conclusion
This paper has formally examined a new measure of similarity between point sets that is based on their spatial overlap.
It captures an inherent mathematical property between the
datasets that has strong intuitive appeal. In measuring overlap, it takes no parameters, making it suited for both supervised and unsupervised learning problems. Its spatial dependence also suggests how to approach various problems, i.e.,
by mapping instances to shapes that can be distinguished.
Thus, it is well-suited to problems that can be viewed spatially and has a number of surprising mathematical properties that we are currently investigating.
6
Acknowledgments
This work has been supported by the Department of Biostatistics and Medical Informatics, the Department of Computer Sciences, the Wisconsin Alumni Research Foundation, the Vilas Trust, and the School of Medicine and Public Health, at the University of Wisconsin-Madison. The
authors thank Grace Wahba for helpful discussion and the
anonymous reviewers for their comments.
182