Crowdsourced Nonparametric Density Estimation Using Relative Distances Antti Ukkonen Behrouz Derakhshan Hannes Heikinheimo

advertisement
Proceedings, The Third AAAI Conference on Human Computation and Crowdsourcing (HCOMP-15)
Crowdsourced Nonparametric Density Estimation Using Relative Distances
Antti Ukkonen
Behrouz Derakhshan
Hannes Heikinheimo
Finnish Inst. of Occupational Health
Helsinki, Finland
antti.ukkonen@ttl.fi
Rovio Entertainment
Espoo, Finland
behrouz.derakhshan@gmail.com
Reaktor
Helsinki, Finland
hannes.heikinheimo@reaktor.com
Abstract
having a large number of common aspects or features. Likewise, items from low density regions are outliers or in some
sense unusual.
To give an example, consider a collection of photographs
of galaxies (see e.g. (Lintott et al. 2008)). It is reasonable to
assume that some types of galaxies are fairly common, while
others are relatively rare. Suppose our task is to find the
rare galaxies. However, in the absence of prior knowledge
this can be tricky. One approach is to let workers label each
galaxy as either common or rare according to the workers’
expertise. We argue that this has two drawbacks. First, evaluating commonality in absolute terms in a consistent manner
may be difficult. Second, the background knowledge of the
workers may be inconsistent with the data distribution. Perhaps a galaxy that would be considered as rare under general
circumstances is extremely common in the given data. We
argue that in such circumstances density estimation may result in a more reliable method for identifying the rare galaxies. A galaxy should be considered as rare if there are very
few (or none) other galaxies that are similar to it in the studied data.
The textbook approach for nonparametric density estimation are kernel density estimators (Hastie, Tibshirani, and
Friedman 2009, p. 208ff). These methods usually consider
the absolute distances between data points. That is, the distance between items A and B is given on some (possibly
arbitrary) scale. Absolute distances between data points are
used also in other elementary machine learning techniques,
such as hierarchical clustering or dimensionality reduction.
However, in the context of human computation, it is considerably easier to obtain information about relative distances.
For example, statements of the form “the distance between
items A and B is shorter than the distance between items
C and D” are substantially easier to elicit from people than
absolute distances. Moreover, such statements can be collected in an efficient manner via crowdsourcing using appropriately formulated HITs (Wilber, Kwak, and Belongie
2014). It is thus interesting to study what is the expressive
power of relative distances, and are absolute distances even
needed to solve some problems?
A common application of relative distances are algorithms for computing low-dimensional embeddings (representations of the data in Rm ) either directly (van der
Maaten and Weinberger 2012) or via semi-supervised
In this paper we address the following density estimation problem: given a number of relative similarity
judgements over a set of items D, assign a density value
p(x) to each item x ∈ D. Our work is motivated by human computing applications where density can be interpreted e.g. as a measure of the rarity of an item. While
humans are excellent at solving a range of different visual tasks, assessing absolute similarity (or distance) of
two items (e.g. photographs) is difficult. Relative judgements of similarity, such as A is more similar to B than
to C, on the other hand, are substantially easier to elicit
from people. We provide two novel methods for density
estimation that only use relative expressions of similarity. We give both theoretical justifications, as well as
empirical evidence that the proposed methods produce
good estimates.
Introduction
A common application of crowdsourcing is to collect training labels for machine learning algorithms. However, human
computation can also be employed to solve computational
problems directly (Amsterdamer et al. 2013; Chilton et al.
2013; Trushkowsky et al. 2013; Parameswaran et al. 2011;
Bragg, Mausam, and Weld 2013). Some of this work is
concerned with solving machine learning problems with
the crowd (Gomes et al. 2011; Tamuz et al. 2011; van der
Maaten and Weinberger 2012; Heikinheimo and Ukkonen
2013).
In this paper we focus on the problem of nonparametric
density estimation given a finite sample from an underlying distribution. This is a fundamental problem in statistics
and machine learning that has many applications, such as
classification, regression, clustering, and outlier detection.
Density can be understood simply as “the number of data
points that are close to a given data point”. Any item in a
high density region should thus be very similar to a fairly
large number of other items. We argue that in the context
of crowdsourcing, density can be viewed for instance as a
measure of “commonality” of the items being studied. That
is, all items in a high density region can be thought of as
c 2015, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
188
TDE1
TDE2
High-level overview of the proposed methods
methods are based on the observation that the density at a
given point x is proportional to the number of close neighbours of x that belong to D. If there are many other points
in D that are at a close proximity to x, we say that x is from
a high-density region. The question is thus how to identify
and count the points that are close to x without measuring
absolute distances.
The basic idea of the first method, called TDE1, is illustrated in the left panel of Figure 1. Suppose we have placed
a circle with a short radius at every observed data point
ui ∈ D. Given x, we can count how many of these circles
cover x, i.e., the number of ui that are close to x. Observe
that to do this we only need to know whether the distance
between x and some ui is shorter or longer than the radius
of the circle around ui . This is a problem of comparing two
distances, and does thus not require to know the absolute
distances.
We can thus pair every ui ∈ D with a reference point
u0i ∈ D so that the distance between ui and u0i equals the
radius of the circle. However, as the circles at every ui must
have at least roughly the same radius, finding an appropriate
reference point for every ui can be difficult. We also need a
means to find the reference points without access to absolute
distances. This problem can be solved to some degree using
a low-dimensional a embedding, but it adds to the complexity of the method.
The second method we propose, called TDE2, is simpler
in this regard. Rather than having to find a reference point
for every ui ∈ D, it only uses a random set of pairs of
points from D. The only constraint is that the pairs should
be “short”. We say a pair is short if it is short in comparison
to all possible pairs of items from D. The intuition is that
short pairs occur more frequently in high density regions of
the density function p. The second method thus counts the
number of short pairs that are close to x.
Here the notion of “closeness” also depends on the length
of the pair. This is illustrated in the right panel of Figure 1.
The pairs are shown in red, with gray circles around their
endpoints. The radius of the circles is equal to the length of
the respective pair. The pair is considered to be close to x if
x is covered by at least one of the circles that are adjacent to
the pair. Very short pairs must be very close to x in order to
cover it, while longer pairs also cover points that are further
away.
It is easy to see that TDE1 is essentially a standard kernel
density estimator that uses the uniform kernel. The TDE2
estimator is designed for relative distances from the ground
up and has thus some advantages over TDE1. In particular,
TDE2 is simpler to construct.
We propose two methods for density estimation using relative distances. The first is a simple variant of a standard
kernel density estimator, while the other is a novel method
that has some practical advantages over the first one. We call
our estimators “triplet density estimators”, because the relative distance judgements they use are defined on sets of three
items.
Our input is a dataset D, assumed to have been sampled
iid. from some unknown density p, and we want to estimate
the density p(x) at an arbitrary point x. Both of the proposed
Our first method can be viewed as a standard kernel density estimator that does not use absolute distances. First,
we recall the basic definition of a kernel density estimator
as originally proposed (Rosenblatt 1956; Parzen 1962), and
as usually given in textbooks, e.g. (Hastie, Tibshirani, and
Friedman 2009, p. 208ff). Then, we describe how a simple
instance of this can be implemented only using relative distances.
6
6
5
5
4
4
3
3
2
2
1
1
0
-1
0
1
2
3
4
5
0
-1
0
1
2
3
4
5
Figure 1: Left: Example of the TDE1 estimator we propose.
Right: Example of the TDE2 estimator that we propose.
metric-learning (Schultz and Joachims 2003; Davis et al.
2007; Liu et al. 2012) if features are available. While embeddings can be used for various purposes, including also
density estimation, we are more interested in understanding
what can be done using relative distances directly, without
first computing a representation of the items in some (arbitrary) feature space. This is because despite embeddings
being powerful, they can be tricky to use in practical situations, e.g. if the true dimensionality of the data is unknown.
Moreover, an embedding may have to be re-computed when
new data is added to the model. We hope to avoid this by
operating directly with relative distance judgements.
Our contributions
We consider two approaches for density estimation using
relative distances. In particular, we provide the following
contributions:
1. We describe how a standard kernel density estimator can
be expressed using relative distance judgements, and describe how embeddings to Rm can be used to construct
this estimator.
2. We propose a second estimator that is explicitly designed
for human computation. The main upside of this estimator is that it is easier to learn and use. While the estimator is biased, the amount of the bias can be reduced at
the cost of using more work to build the estimator.
3. We conduct experiments on simulated data to compare
the two methods, and show that the second estimator is
in general easier to learn than the first one, and that it
produces better estimates.
TDE1: A simple triplet density estimator
189
Basics of kernel density estimation
An important consequence of this is that it is not possible
to devise a normalised density estimator based on relative
distances. In other words, Equation 2 can not hold with
equality. The best we can hope for is to find an estimator
that is proportional to p(x). Later we show how to derive
a normalised estimator for one-dimensional data, but this
makes use of absolute distances in the normalisation constant, much in the same way that h appears in the normalisation constant of Equation 1. In practice we must thus consider non-normalised estimates of p(x) when using relative
distances.
Let D = {u1 , . . . , un } denote a data set of size n drawn
from a distribution having the density function p. For the
moment, assume that D consists of points in Rm . Given D,
the kernel density estimate (KDE) of p at any point x is given
by
n
1 X
1
KDEn (x) =
K( (x − ui )),
(1)
m
nh i=1
h
where K is the kernel function and h is the bandwidth parameter. Fairly simple conditions on the kernel function K
and the bandwidth h are sufficient to guarantee that the kernel density estimate is an unbiased and consistent estimate
of p. That is, as the size of D increases, we have
lim ED∼p [KDEn (x)] = p(x),
n→∞
Uniform kernel with relative distances
The simplest kernel function K that satisfies the requirements given above is the uniform kernel. It takes some constant positive value if x is at most at distance h from the data
point ui ∈ D at which the kernel is located, and is otherwise
equal to zero. Formally the m-dimensional uniform kernel
is defined as
V (m)−1 : y ≤ 1,
unif
K
(y) =
(5)
0
: y > 1,
(2)
where the expectation is taken over random samples from
p of size n. Most importantly, the kernel function K must
satisfy
Z
K(y) ≥ 0 ∀y and
K(y)dy = 1.
(3)
Rm
where V (m) is the volume of the m-dimensional unitm
sphere. This is defined as V (m) = Γ(πm2+1) , where Γ is the
2
Gamma function.
In this case the density estimate at x is simply proportional to the number data points that are at most at a distance
of h from x. We say the point u covers x with bandwidth h,
whenever K unif ( d(x,u)
h ) ≤ 1, This can be seen as a straightforward generalisation of a histogram that has unspecified
bin boundaries. In Figure 1 (left) we can say that the density
at x is proportional to the number of “discs” that cover x.
It is easy to see that as n increases and h goes to zero (i.e.,
there are more points covered by smaller and smaller discs),
this quantity will indeed approach the underlying density p.
We continue by designing a triplet density estimator that
uses the uniform kernel K unif . To compute K unif , we must
only determine if a data point ui near enough to x. However, we cannot evaluate d(u, v) explicitly for any u, v ∈ D.
Instead, we only have access to an “oracle” that can give
statements about the relative distance between items. That
is, we can issue tasks such as
“Is the distance d(u, x) shorter than the distance
d(u, v)?”
Or, in more natural terms:
“Of the items x and v, which is more similar to u?”
We can use this task to compute the uniform kernel as
follows. Suppose that for every ui ∈ D, we have found some
other u0i ∈ D, so that d(ui , u0i ) is equal to h. Denote this
pairing by P(h). Note that several u ∈ D can be paired with
the same u0 , P(h) is not required to be a proper matching.
Given P(h), by the definition of the uniform kernel given in
Equation 5, we have
d(x, ui )
unif
= V (m)−1 I{d(ui , x) ≤ d(ui , u0i )}.
K
h
(6)
In this paper we consider radial-symmetric kernels. In addition to Equation 3 these satisfy K(y) = K(y 0 ) ⇔ kyk =
ky 0 k. That is, the kernel is a function of only the length of
its input vector. Also, we want K(y) to be non-increasing in
kyk. Now, assume that there exists a distance function d between the items in D. For the moment, let d(x, y) = kx−yk.
We can rewrite Equation 1 as follows:
KDEn (x) =
n
1 X
d(x, ui )
K(
).
nhm i=1
h
(4)
The bandwidth h is a parameter that must be specified
by the user. There are a number of methods for choosing h in an appropriate manner. In practice the bandwidth
h should decrease as n increases. In fact, we must have
limn→∞ h(n) = 0 for Eq. 2 to hold. In the experiments we
use cross validation to select an appropriate value of h when
needed.
Normalised densities and relative distances
Before continuing, we point out a property of density functions and relative distances. Recall that the integral of a density function over its entire domain must by definition be
equal to 1. This means that the value p(x) is not scale invariant: the precise value of p(x) depends on the “unit” we
use to measure distance. If we measure distance in different
units1 , the absolute value of p(x) must change accordingly
as well. This is reflected in Equation 1 through the bandwidth parameter h that must have the same unit as the distance d(u, v).
However, when dealing with relative distances there is
no distance unit at all. We only know that some distance
is shorter (or longer) than some other distance, but the unit
in which the distances are measured is in fact irrelevant.
1
Concretely, suppose we use feet instead of meters.
190
This yields the set P̃(h).
This, together with Equation 4 gives the triplet density estimator
1
TDE1n (x) =
V (m)nhm
X
The pairs (u, v) in P̃(h) have all approximately length h
in the embedding. Of course this length is not the same as the
unknown distance d(u, v), but we can assume that d(u, v) is
approximately the same for every (u, v) ∈ P̃(h).
0
I{d(u, x) < d(u, u )}.
(u,u0 )∈P(h)
(7)
This simply counts the number of u ∈ D that cover x with
the specified bandwidth h. As we only need an oracle that
can provide relative distance judgements to evaluate the indicator function above, Equation 7 provides a kernel density
estimator that is suitable for human computation. Moreover,
all known results that apply to Equation 1 (e.g. the one of
Equation 2) directly carry over to Equation 7.
There are some issues with this result, however. As discussed above, in practice we cannot evaluate the normalisation constant: both m and the exact value of h are unknown
if our input only consists of relative distance comparisons.
More importantly, it is unlikely that for a given D we
would find a pairing P(h) where the distances between all
pairs are exactly equal to h. It is more realistic to assume
that we can find a pairing P̃(h) where this holds approximately. That is, for (almost) every (u, u0 ) ∈ P̃(h), we have
d(u, u0 ) ≈ h. This implies that results about unbiasedness
and convergence that hold for Equation 1 are no longer obviously true. While variable bandwidth density estimators
have been proposed in literature, our problem differs from
those in the sense that we do not try to select an optimal local bandwidth around each u ∈ D, but the bandwidth at u
randomly varies around h. In general it is nontrivial to even
quantify this variance, or show how it affects the resulting
estimator.
This raises one important question, however. If we are not
able to evaluate the absolute value of d(u, v), how can we
find pairs that are (at least approximately) of length h?
TDE2: A second triplet density estimator
The density estimator proposed above (Equation 7) must
construct a low dimensional embedding of the items to find
the pairs P̃(h). While there are a number of techniques for
finding such embeddings, they are not necessarily optimal
for density estimation. Their objective may be to create visually pleasing scatterplots, which requires to introduce additional constraints that will favor embeddings with certain
properties. For instance, the methods may aim to focus on
short distances so that the local neighbourhood of each data
point is preserved better. This emphasises clusters the data
may have. As a consequence the resulting distances may
become disproportionally skewed towards short distances.
However, for the bandwidth selection process to produce
good results, we would prefer the resulting distances to be
as truthful as possible, possibly at the cost of the embedding
resulting in a less clear visualisation. Moreover, the embedding algorithms tend to be sensitive to initialisation.
Motivated by this, we propose a second triplet density estimator that does not need an embedding. In particular, we
show that it is possible to produce good density estimates,
provided we can find a sufficient number of short pairs. Unlike in the method proposed above, they do not all have to
be approximately of the same length h. Instead, it is sufficient to find pairs that are all at most of a certain length that
we will denote by δmax . We first describe the estimator and
prove some of its properties, and continue by discussing the
problem of finding short pairs without an embedding.
Using an embedding to find P̃(h)
Basic idea
As mentioned in the Introduction, by embedding the items
to a low-dimensional Euclidean space, we can trivially estimate density using existing techniques. Such embeddings
can be computed from relative distances e.g. using the algorithm described in (van der Maaten and Weinberger 2012).
This has the downside that to estimate density at a new point
x, we in general must recompute the embedding to see where
x is located. Using the technique described above, however,
it is possible to estimate density directly, provided we have
the set of pairs P(h).
Therefore, we use the embedding only for constructing
the set P̃(h) as follows:
1. Embed all items in D to Rm using some suitable algorithm. We use the method proposed in (van der Maaten
and Weinberger 2012). This results in a set of points X.
Note that the dimensionality m of the embedding is a
parameter that must be specified by the user.
2. Use any existing technique for bandwidth selection to
find a good h given the points X ∈ Rm . We use cross
validation in the experiments later.
3. For every item u ∈ X, find the item v ∈ X so that the
distance between u and v is as close to h as possible.
This variant of the density estimator uses a slightly different
human intelligence task:
“Which of the three items x, u, and v is an outlier?”
The worker must identify one of the three items as the one
being the most dissimilar to the other two. In (Heikinheimo
and Ukkonen 2013) this task was used to compute the centroid of a set of items.
We model this HIT by a function Ω that takes as input
the item x and the pair (u, v), and returns 1 if the distance
between x and u or v is shorter than the distance between u
and v, and 0 otherwise. That is, we have Ω(x, (u, v)) = 0 if
x is chosen as the outlier and Ω(x, (u, v)) = 1 if either u or
v is chosen as the outlier. More formally, we have
Ω(x, (u, v)) = I{min(d(x, u), d(x, v)) ≤ d(u, v)},
and we say that the pair (u, v) covers the item x whenever
Ω(x, (u, v)) = 1.
Our method is based on the following observation. Let
P = {(ui , vi )}li=1 denote a set of pairs, where every ui
and vi is drawn from D uniformly at random. Moreover, let
P δmax ⊂ P denote the set of “short pairs”, meaning those
191
Results for 1-dimensional data
for which we know the distance d(u, v) to be below some
threshold δmax . Now, fix some x ∈ D, and consider the number of short pairs (u, v) ∈ P δmax that cover x. If x resides
in a high density region of p, we would expect there to be
many short pairs that cover x, while there are less of such
pairs if x is from a low density region. This is illustrated in
the right panel of Fig. 1.
The estimator that we propose, called TDE2, is defined
as described above. We first draw an iid. sample of pairs P
from p, and then remove all such pairs from P that are longer
than δmax to obtain P δmax . As we discuss in the remainder
of this section, we can estimate density p(x) at any given x
by computing the number of pairs in P δmax that cover x and
taking a simple transformation of the resulting number.
Next, we discuss the connection between q(x, δmax ) and
p(x). Our objective is to show that by taking a suitable transformation of our estimate for q(x, δmax ) we can obtain a reasonable estimate for p(x).
For one-dimensional data it is fairly easy to see when
events A and B happen simultaneously and derive an expression for q(x, δmax ). Consider all pairs of length δ starting from δ = 0 up to δ = δmax . For every such pair (u, v) of
length δ, item x is covered by (u, v) exactly when u is contained in the interval [x − 2δ, x + δ], and v is at u + δ. (Thus
v must be in the interval [x − δ, x + 2δ].) If this happens we
have min{d(x, u), d(x, v)} ≤ d(u, v) and Ω(x, (u, v)) = 1.
Next, suppose the midpoint of the pair (u, v) is at j. Clearly
we must have u = j − δ/2 and v = j + δ/2. Using these we
can write q(x, δmax ) as
Z δmax Z x+ 32 δ
δ
δ
q(x, δmax ) = 2
p(j − )p(j + ) dj di.
2
2
δ=0
j=x− 32 δ
(12)
The constant 2 is needed because we can obtain (u, v) by
first drawing u and then v, or vice versa. The following
proposition shows how q(x, δmax ) is connected to p(x).
Proposition 2. Let the function q(x, δmax ) be defined as in
Equation 12. We have
s
q(x, δmax )
= p(x).
(13)
lim
δmax →0
3δmax 2
Preliminary definitions and some simple results
As discussed above, the estimate that TDE2 gives for p(x)
is based on the number of short pairs that cover x when the
pairs are drawn independently from p. Suppose x is fixed,
and we obtain the pair (u, v) by first drawing u according to
p, and then draw v independently of u also from p. Let δmax
denote some distance. Consider the events
A = “the item x is covered by the pair (u, v)“, and
B = “the distance d(u, v) is at most δmax “.
Let q(x, δmax ) denote the probability Pr[A ∩ B]. The TDE2
estimator actually estimates q(x, δmax ). Observe that we
have
q(x, δmax ) = Pr[A ∩ B] = Pr[A | B] Pr[B].
(8)
Proof. See Appendix.
Now, let P be a set of pairs sampled iid. from p, and denote
by P δmax the subset of P that only contains pairs up to length
δmax , i.e., P δmax = {(u, v) ∈ P | d(u, v) ≤ δmax }. Observe
δmax
that |P |P| | is an unbiased and consistent estimate of Pr[B],
i.e., we have
|P δmax |
= Pr[B].
|P|
|P|→∞
lim
This shows that by choosing a sufficiently small δmax , we
can obtain approximate estimates for p(x) by taking a simple transformation of the estimate for q(x, δmax ). We conclude this section by combining this observation with propositions 1 and 2 to the following result.
Corollary 1. Let p be a univariate density function, and
let P be a set of pairs sampled independently from p. Let
P δmax ⊆ P contain those pairs in P having length at most
δmax . For a sufficiently small δmax we have
s
S(x, P δmax )
p̂(x, δmax ) =
≈ p(x).
3δmax 2 |P|
(9)
To estimate Pr[A | B], we define the triplet score of the
item x given set of pairs P as
X
Ω(x, (u, v)),
(10)
S(x, P δmax ) =
(u,v)∈P δmax
δmax
and find that S(x,P
|P δmax |
Pr[A | B], i.e.,
)
is an estimate of the probability
S(x, P δmax )
= Pr[A | B].
|P δmax |
|P|→∞
lim
Observe that p̂(x, δmax ) as defined above yields estimates
that are appropriately normalised. The estimates are guaranteed to converge to q(x, δmax ), and have a bias wrt. p(x) that
depends on δmax . By “sufficiently small” we mean that δmax
should be as small as possible without making the set P δmax
too small. We continue with a simple study of the effect of
δmax on the bias and variance of p̂(x, δmax ).
(11)
By combining equations 8, 9 and 11, we obtain the following:
Proposition 1. Given the set P of pairs sampled independently from p, we have
Bias and variance of the 1-dimensional estimator,
an empirical study
S(x, P δmax )
= q(x, δmax ).
|P|
|P|→∞
The estimator p̂(x, δmax ) (Corollary 1) is biased, with the
bias increasing as δmax increases. We studied the effect of
theq
bias empirically. The top panel in Fig. 2 shows the shape
q(x,δmax )
of
for different values of δmax when p is the
3δmax 2
lim
This shows that TDE2 is an unbiased and consistent estimate of q(x, δmax ).
192
10−4
δmax = 0.5
δmax = 1
δmax = 2
total
bias
var
10−5
10−6
10−7
-4
-2
0
2
0.0
4
0.5
1.0
1.5
δmax
2.0
2.5
3.0
1.0
true
estim
500 pairs
5000 pairs
0.9
0.8
short pairs
358
0.7
36
250
0.6
-6
-4
-2
0
2
4
250
6
Figure 2: Top: Examples of the estimator p̂(x, δmax ) for different values of δmax when the underlying density is a standard normal distribution. Bottom: Examples of estimating a
bimodal distribution from a sample of 500 data points with
500 and 5000 pairs (dashed lines) when δmax = 1.0.
750
1250
1750
2500
2250
Figure 3: Top: Estimation error (MSE, wrt. the true density)
as a function of δmax when the estimate is computed using
1000 pairs drawn from a data of 200 data points sampled
from a standard normal distribution. Bottom: Correlation of
the estimate (δmax = 0.5) with true density (the bimodal
distribution shown on the left) as a function of the number
of pairs in P. The inset shows the number of short pairs that
remain in P δmax after removing those longer than δmax .
standard normal distribution. These are what p̂(x, δmax ) will
converge to as the size of P increases. On the other hand, for
“small” δmax the estimate for q(x, δmax ) will have a high
variance as there are only few pairs of length at most δmax .
The top panel of Fig. 3 shows how the errors behave due
to bias and variance as a function of δmax . The estimate is
computed from 200 points sampled from a standard normal
distribution. For small δmax the total error consists mainly of
variance due to the small size of P δmax . As δmax increases
the variance component diminishes, and most of the remaining error is due to bias.
In practice it is possible to set δmax so that the estimator will produce useful results. The bottom panel of Fig. 2
shows an example of a mixture of two normal distributions
(black curve), the limit to which p̂(x, 1.0) would converge
(blue curve), as well as two estimates computed with 500
and 5000 pairs in P, respectively (the red dashed curves).
We can observe that even 500 pairs are enough to obtain
a reasonable estimate. Finally, the bottom panel of Fig. 3
shows how the estimation accuracy (correlation between the
estimate and the true density) behaves as the number of pairs
in P is increased from 250 to 2500.
dimensional inputs.
Proposition 3. Let q(x, δmax ) be defined as in Equation 8,
and let C(δmax ) be some function of δmax . We have
s
q(x, δmax )
= p(x).
(14)
lim
δmax →0
C(δmax )
Proof. See Appendix.
This result is important because in general we cannot
know the true, underlying dimensionality of the data. Hence,
our estimator should be independent of the dimensionality,
and Proposition 3 shows this to be the case. Finally, since
q(x, δmax ) is in practice proportional to the triplet score
δmax
S(x,
) (Eq. 10), our estimator can be defined simply
pP
as S(x, P δmax ), because both C(δmax ) as well as |P| are
constants that do not depend on x.
Finding short pairs without embeddings
Above we showed how the length of the pairs that are used
to compute p̂(x, δmax ) affects the quality of the estimate.
In particular, we should only use pairs that are “short” in
comparison to the average length of the pairs. In practice
we do not have the absolute lengths of the pairs, however.
Generalisation to higher dimensions
Above we showed that for one-dimensional data the square
root of q(x, δmax ) is proportional to p(x) as δmax approaches zero. We can show a similar result also for arbitrary
193
Experimental setup
We can only obtain relative distance information, and cannot
thus simply filter the pairs based on their absolute length.
Instead, we must find short pairs by some other means. Next
we discuss a simple method for doing this.
Let P δmax denote the set of pairs that are at most of length
δmax . Since all pairs in P are drawn uniformly at random
δmax
from p, we know that |P |P| | is an unbiased estimate of the
probability Pr[d(u, v) ≤ δmax ] as was discussed above. Inversely, by fixing some value α ∈ [0, 100], we can consider
all pairs in P for which the length falls below the α:th percentile of the length distribution of all pairs in P. By setting
α to some small value, e.g. α = 10, we obtain pairs that
correspond to some unknown “small” value of δmax , so that
α
α
100 = Pr[d(u, v) ≤ δmax ]. Let P denote the set of pairs
selected for a given α. This is simply an alternative formulation for “short” pairs without considering absolute distances.
Finding the α:th percentile of a set of values can be done
efficiently using a selection algorithm. In general, these find
the i:th largest value in a given list of values without sorting
the values. They commonly run in linear time in the size
of their input, which is |P| in this case. Note that e.g. the
Quickselect algorithm (Hoare 1961) (and its variants, e.g.
the Floyd-Rivest algorithm (Floyd and Rivest 1975)) also
find all pairs that are below the percentile as a simple sideeffect. No separate filtering step is thus needed.
Most selection algorithms are based on pairwise comparisons, and can thus be implemented without explicitly evaluating d(u, v) for any (u, v) ∈ P. We only need an oracle that
can evaluate the indicator function I{d(u, v) < d(u0 , v 0 )}.
This is easily expressed as the following human intelligence
task:
We measure cost as the number of HITs that are needed to
construct the estimator. The methods are implemented as
follows:
• TDE1: The estimator is constructed by first embedding a
set of items into R2 with the method of (van der Maaten
and Weinberger 2012). (In practice we have to make some
guess about the input data dimensionality when using this
method. In these experiments we assume a fixed dimensionality of 2 independently of the true data dimensionality.) Given the embedding, we find an optimal bandwidth
h using cross validation, and select the set P(h) by pairing every u ∈ D with the v ∈ D so that the distance
between u and v in the embedding is as close to h as
possible (see Section ). The estimates are computed with
Equation 7 (by omitting the normalisation).
• TDE2: We use the selection algorithm based method of
Section to compute the set P α . To make the cost of
computing an estimate for a single x equal to the cost of
α
TDE1, we let |P| = 100
α |D|. This will result in P = |D|.
We used α ∈ {0.05, 0.10, 0.15,
p 0.20, 0.25, 0.30, 0.35}.
The estimates are computed as S(x, α) where S(x, α)
is defined analogously to S(x, δmax ).
We use a separate training data Dtrain to learn the estimator, and evaluate it on the test data Dtest that is drawn
from the same distribution as Dtrain . In all cases we let
|Dtrain | = |Dtest | = 200 items sampled iid. from the density in question. We use the following synthetic datasets that
are mixtures of multivariate normal distributions:
Gaussian-5D : A single standard multivariate normal distribution in a 1-dimensional space.
Gaussian-5D : A single standard multivariate normal distribution in a 5-dimensional space.
2Gaussians-3D : Two standard multivariate normal distributions in a 3-dimensional space, located at (0, 0, 0) and
(3, 3, 3).
2Gaussians-5D : As 2Gaussians-3D but in a 5-dimensional
space.
4Gaussians-2D : Four standard multivariate normal distributions in a 2-dimensional space, located at (0, 0), (0, 5),
(5, 0) and (5, 5).
4Gaussians-3D : Four standard multivariate normal distributions in a 3-dimensional space, located at (0, 0, 0),
(5, 0, 0), (0, 5, 0) and (0, 0, 5).
We only consider fairly low-dimensional data with k ≤ 5.
This is because density estimation becomes less meaningful
for high-dimensional data due to the curse of dimensionality.
We think this is a sufficient study of the problem also from
a practical point of view. In crowdsourcing applications we
do not expect the workers to use more than a few dimensions
when giving similarity judgements.
Also, we assume that computation is noise-free. That is,
we obtain consistent and correct answers to each HIT. In
practice some quality control mechanism such as (Dawid
and Skene 1979; Raykar and Yu 2012) can be used to reduce the effect of erroneous solutions to tasks. We assume
“Which of the two pairs, (u, v) or (u0 , v 0 ), are more
similar?”
Moreover, by definition of the percentile, we will select
pairs. This can be used to control the size of the
resulting estimator (the set P α ). The TDE1 estimator proposed above uses by definition n = |D| pairs. To make the
TDE2 estimator of comparable in complexity, we should use
100n
pairs in P to have P α of the desired size. Note that
α
both P and P α are in this case linear in n. This is a desirable property as the number of pairs will directly translate to
the number of HITs needed to construct as well as to evaluate the estimator. For this it is crucial that we do not need
all pairs that are shorter than δmax , but a uniform sample of
these is sufficient.
Using a standard selection algorithm for finding P α has
the downside that the necessary HITs cannot be solved independently. E.g. Quickselect operates in the same way as
Quicksort by placing items to either side of a pivot item.
This creates dependencies among the tasks, where one task
must be solved before others can even be submitted to the
workers. Developing a selection algorithm without such dependencies is an interesting open question.
α
100 |P|
Experiments
In this Section we compare the proposed density estimators,
TDE1 and TDE2, using simulations.
194
Gaussian-1D
Gaussian-5D
0.20
2Gaussians-3D
0.18
0.12
0.17
0.16
0.11
0.15
0.15
0.10
0.10
0.14
0.13
0.09
0.12
9
15
19
26
#HITs/|D|
50
9
13
18
26
#HITs/|D|
2Gaussians-5D
49
9
15
19
4Gaussians-2D
49
4Gaussians-3D
0.20
0.24
0.11
27
#HITs/|D|
0.22
0.18
0.10
0.20
0.16
0.09
0.18
0.14
0.16
0.08
0.12
0.14
9
13
17
26
#HITs/|D|
48
9
13
21
28
#HITs/|D|
48
9 12
18
27
#HITs/|D|
45
Figure 4: Experimental results. Estimation error (MAE) as a function of the number of HITs used to construct the estimator.
TDE1 is shown in red, TDE2 is shown in black.
the input unlike TDE1. Notice that with 4Gaussians-2D the
methods perform almost equally well. Of course in practice
the dimensionality of the input data is unknown, and making
an educated guess about this can be difficult.
this can be taken care of by the crowdsourcing platform, and
may increase the costs by a small constant factor.
Both methods use |D| HITs to compute a single estimate.
We first run TDE2 100 times for every value of α. Then, we
compute the average number of HITs (pairwise comparisons
of two item pairs) that were needed to find P α . Then, we
allow TDE1 to the same number of HITs to compute the
embedding. This way we can compare the methods given
the same amount of work. We report the amount of work as
the number of HITs per data point in Dtrain .
We measure the quality of the resulting estimates using
mean absolute error, defined as
P
test | estimate(x) − p(x)|
,
MAE = x∈D test
|D | maxx {p(x)}
Conclusions
In this work we presented methods, TDE1 and TDE2, for
density estimation using judgements of relative similarity.
TDE1 is an implementation of a standard kernel density estimator with relative distances. TDE2 is a novel method that
has some practical advantages over TDE1. In particular, our
experiments suggest that TDE2 is more robust when the true
dimensionality of the input is unknown.
There are multiple avenues for further research. Both
TDE1 and TDE2 currently use |D| HITs to estimate density at a single point x. It seems likely that this could be improved by optimising the pairs against which x is compared.
Another interesting question is whether we could learn estimators like TDE1 without embeddings? That is, can we find
pairs of approximately the same length directly with e.g. the
kind of pairwise comparisons as used for TDE2. One possible solution to this are partial orders with ties (Ukkonen et
al. 2009). Also the theoretical results presented in this paper could be improved. In particular, proving bounds on the
estimation error as a function of α would be interesting.
where p(x) is the true densityPat x. The estimate and
p are both normalised so that
x∈D test estimate(x) =
P
p(x)
=
1.
We
also
normalise
the error with
x∈D test
maxx {p(x)} so that it can be understood a fraction of the
largest density value in Dtest .
Results
Results are shown in Figure 4. We show how the error behaves as a function of the number of HITs required per data
item to build the estimator. (Here the largest amount of work
corresponds to using α = 0.05 with TDE2.) In this plot we
show TDE1 in red, and TDE2 in black. We observe that
both methods perform better when they are allowed to use
more work. Increasing the amount of work starts to have diminishing effects after some point. TDE1 performs slightly
better than TDE2 when only a very small amount of work is
used. However, as the number of HITs is increased, TDE2
tends to perform better than TDE1. This is mainly because
TDE2 makes no assumptions about the dimensionality of
References
Amsterdamer, Y.; Grossman, Y.; Milo, T.; and Senellart, P.
2013. Crowd mining. In SIGMOD Conference, 241–252.
Bragg, J.; Mausam; and Weld, D. S. 2013. Crowdsourcing
multi-label classification for taxonomy creation. In HCOMP
2013.
Chilton, L. B.; Little, G.; Edge, D.; Weld, D. S.; and Landay,
195
ing (MLSP), 2012 IEEE International Workshop on, 1–6.
IEEE.
Wilber, M. J.; Kwak, I. S.; and Belongie, S. J. 2014. Costeffective hits for relative similarity comparisons. In Second
AAAI Conference on Human Computation and Crowdsourcing.
J. A. 2013. Cascade: crowdsourcing taxonomy creation. In
CHI 2013, 1999–2008.
Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; and Dhillon, I. S.
2007. Information-theoretic metric learning. In ICML 2007.
Dawid, A. P., and Skene, A. M. 1979. Maximum likelihood
estimation of observer error-rates using the em algorithm.
Applied statistics 20–28.
Floyd, R. W., and Rivest, R. L. 1975. Algorithm 489: the
algorithm selectfor finding the i:th smallest of n elements.
Communications of the ACM 18(3):173.
Gomes, R. G.; Welinder, P.; Krause, A.; and Perona, P. 2011.
Crowdclustering. In Advances in Neural Information Processing Systems, 558–566.
Hastie, T.; Tibshirani, R.; and Friedman, J. 2009. The elements of statistical learning, volume 2. Springer.
Heikinheimo, H., and Ukkonen, A. 2013. The crowdmedian algorithm. In First AAAI Conference on Human
Computation and Crowdsourcing.
Hoare, C. A. R. 1961. Algorithm 65: Find. Communications
of the ACM 4(7):321–322.
Lintott, C. J.; Schawinski, K.; Slosar, A.; Land, K.; Bamford, S.; Thomas, D.; Raddick, M. J.; Nichol, R. C.; Szalay,
A.; Andreescu, D.; et al. 2008. Galaxy zoo: morphologies
derived from visual inspection of galaxies from the sloan
digital sky survey. Monthly Notices of the Royal Astronomical Society 389(3):1179–1189.
Liu, E. Y.; Guo, Z.; Zhang, X.; Jojic, V.; and Wang, W. 2012.
Metric learning from relative comparisons by minimizing
squared residual. In ICDM 2012.
Parameswaran, A. G.; Sarma, A. D.; Garcia-Molina, H.;
Polyzotis, N.; and Widom, J. 2011. Human-assisted graph
search: it’s okay to ask questions. PVLDB 4(5):267–278.
Parzen, E. 1962. On estimation of a probability density
function and mode. The Annals of Mathematical Statistics
33(3):1065–1076.
Raykar, V. C., and Yu, S. 2012. Eliminating spammers and
ranking annotators for crowdsourced labeling tasks. Journal
of Machine Learning Research 13:491–518.
Rosenblatt, M. 1956. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical
Statistics 27(3):832–837.
Schultz, M., and Joachims, T. 2003. Learning a distance
metric from relative comparisons. In NIPS 2003.
Tamuz, O.; Liu, C.; Belongie, S.; Shamir, O.; and Kalai, A.
2011. Adaptively learning the crowd kernel. In ICML 2011,
673–680.
Trushkowsky, B.; Kraska, T.; Franklin, M. J.; and Sarkar, P.
2013. Crowdsourced enumeration queries. In ICDE 2013,
673–684.
Ukkonen, A.; Puolamäki, K.; Gionis, A.; and Mannila, H.
2009. A randomized approximation algorithm for computing bucket orders. Inf. Process. Lett. 109(7):356–359.
van der Maaten, L., and Weinberger, K. 2012. Stochastic
triplet embedding. In Machine Learning for Signal Process-
Appendix
Proof of Proposition 2
Proof. We first upper bound q(x, δmax ) by letting
q(x, δmax ) ≤ q 0 (x, δmax )
Z δmax Z x+ 23 δ
= 2
p(u∗ )p(u∗ + δ ∗ ) dj di,
δ=0
j=x− 32 δ
where u∗ and δ ∗ are the solution to argmaxu,δ p(u)p(u + δ)
st. u ∈ [x − 2δmax , x + δmax ] and δ ≤ δmax . Since u∗ and
δ ∗ only depend on δmax and not on δ or j, p(u∗ )p(u∗ + δ ∗ )
can be taken out from the integrals, and we have
Z δmax Z x+ 32 δ
0
∗
∗
∗
q (x, δmax ) = 2p(u )p(u + δ )
1 dj di.
δ=0
j=x− 32 δ
A straightforward calculation reveals that the integrals simplify to 32 δmax 2 , and we obtain
q 0 (x, δmax ) = 3δmax 2 p(u∗ )p(u∗ + δ ∗ ).
Now, as δmax → 0, the x ± 2δmax interval becomes tighter
and tighter around x. It follows that u∗ → x and δ ∗ → 0,
and thus p(u∗ )p(u∗ + δ ∗ ) → p(x)2 . By combining these
observations we find that
lim
δmax →0
q 0 (x, δmax )
3δmax 2
=
=
3δmax 2 p(u∗ )p(u∗ + δ ∗ )
δmax →0
3δmax 2
∗
lim p(u )p(u∗ + δ ∗ ) = p(x)2 .
lim
δmax →0
An upper bound of q(x, δmax ) thus approaches p(x)2 from
above as δmax → 0. We can make a similar reasoning in
terms of the pair (u∗ , u∗ +δ∗ ) that has the lowest probability
in the interval x ± 2δmax . This will show that a lower bound
to q(x, δmax ) approaches p(x)2 from below. By combining these we obtain that q(x, δmax )/3δmax 2 must approach
p(x)2 as δmax → 0.
Proof of Proposition 3
Proof. We present the proof for 2-dimensional data, but the
argument generalises to k-dimensions in a fairly straightforward manner. Fix some point x with density p(x), and consider a random pair (u, v) drawn from p given that d(u, v) is
fixed to some δ ≤ δmax . We sample the pair by first drawing u from p, and then draw v given u so that d(u, v) = δ.
The point x is covered by the pair (u, v) when one of the
following events happens:
Uδ :
Vδ :
196
d(x, u) ≤ δ
δ < d(x, u) < 2δ and d(x, v) ≤ δ.
C(
max
max !0
max )function 4
done
for the
one-dimensional
case,
but
will
show
following
done
for
one-dimensional
but s
will
)the
bethe
defined
as proposition:
in case,
Equation
9, show
and the
let following
C( max ) proposition:
be
Proposition
Let
.q(x,
We
have
some
ofmax
max
max
!
q(x,
) max
Vmax
(x, )) Z E(u, )
max
q(x,
setfunction
V (y, ).
let
p as
(x)
denote
. We
have
some
of) Next,
Z
max
q(x,
s the
q(x,
be
defined
Equation
q(x,
9, maximum
and
) be9,
let
defined
C(letmax
as
) in
be
Equation
nfProposition
4the
Let
Proposition
4We
Let
max )9, and let C( max ) be (12)
=
p(x).
(12)
lim
max
max
.in
have
some
function
of)the
max
= p(x).
lim
q(x,
be
defined
as
in
Equation
and
C(
)
be
4
Let
q(x,
)
be
defined
as
in
Equation
9,
and
let C( max ) be (12)
Proposition
4
Let
=
p(x).
lim
of
for
2-dimensional
data,
but
argument
generalises
max
max
max
s
!0
C(
)
Vn(x,
2 max.),We
that
is,
max
!0
max
C(
) C(onmax
q(x,
max
max
max )
du.
]
=
p(u)
p(v)
dv
Pr[U
The
value
of
the
integrals
no
longer
depends
the
density
p,
only
on
the
length
!0
)
max
have
.
We
have
of
some
function
of
s
max
max
function ofmanner.
p(x).
(12)
have
some
function
of max . =We
y some
straightforward
Fixhave
some point
xlim
with
max . We
q(x,
max )density
V with
(x, ) a function
E(u,of)
C(p(x).
) max
max !0
maxcan
max
of s
the
pair
(u,
v).
We
thus
those
and obtain
=
(12)
lim
s p given
s(13)
q(x,
) replace
ppair
(x)v)proof
=drawnfor
max
p(y).
s
dom
(u,
from
that
d(u,
v)
is
fixed
!0
C(
)
present
the
2-dimensional
data,
but
the
argument
generalises
max
max
We proof
presentfor
the2-dimensional
proof
for 2-dimensional
data,
the
generalises
Proof
We
present
the
data, but
the but
argument
generalises
Proof
= p(x).
(12)argument
lim
y2V (x,2 max
)
q(x,
)q(x,
q(x,
)q(x,
)!0
max
max
max
max
mple the
by first
drawing
uthe
from
p, and
then
draw
max
Since

and
d(x,
u)
2)Cgeneralises
, (p(x).
we
have
d(x,
 2x with
, and
we can thus upper
C(
)Pr[U
max
max
We
present
proof
for
2-dimensional
the
Proof
k-dimensions
indata,
amax
fairly
manner.
Fixpoint
somev)
density
to
density
pargument
(x)
).some
nsions
in pair
a fairly
straightforward
manner.
Fix
some
x but
with
=
(12)
= p(x).
(12)
=
p(x).
lim
limpoint
=
(12)
lim]straightforward
1Fix
k-dimensions
in
a p(x).
fairly
straightforward
manner.
xpoint
with(12)
density
tolim
max
!0
C((u,
We present
the
proof
for
2-dimensional
data,
but
thethat
argument
generalises
Proof
!0
C((u,
)with
=
. The
point
xU
is is
covered
by
pair
v))when
one
maxthe
max
!0
!0
C(
)
C(
)
maxreplacing
max
bility
of
event
the
total
density
inside
V
(x,
)
(so
max
max
bound
Pr[U
]
by
both
p(u)
and
p(v)
with
p
(x).
max
max
k-dimensions
in
a
fairly
straightforward
manner.
Fix
some
point
x
density
to
p(x),
and
consider
a
random
pair
v)
drawn
from
p
given
that
d(u,
v)
isThis
fixed yields
consider
a
random
pair
(u,
v)
drawn
from
p
given
that
d(u,
v)
is
fixed
p(x),
and
consider
a
random
pair
(u,
v)
drawn
from
p
given
that
d(u,
v)
is
fixed
Next,
the
event
Vpoint
. pItgiven
when
u fixed
falls
in u
the
“ring”
around
x,
in
a fairly
straightforward
manner.
Fix
some
xhappens
with
density
to k-dimensions
ppens:
!
We
present
the
proof
data,
but
the
argument
generalises
Proof
ough
to. x)
times
the
probability
thatconsider
vfor
is 2-dimensional
drawn
from
the
p(x),
and
consider
aby
random
pair
(u,
v)
drawn
from
that
d(u,
is
from
We
sample
the
pair
byv)
first
p, and
then draw
to
some
Zdrawing
Zfrom
max .p,

We
sample
the
pair
first
drawing
u
and
then
draw
We
present
the
proof
for
2-dimensional
data,
but
the
argument
generalises
Proof
max
We
present
the
proof
for
2-dimensional
data,
but
the
argument
generalises
Proof

.
We
sample
the
pair
by
first
drawing
u
from
p,
and
then
draw
to
some
max
p(x),
and
a
random
pair
(u,
v)
drawn
from
p)2-dimensional
given
that
v)
fixed
=
V (x,
2=d(u,
(x,isp,
),
and
vxcovered
on
segment
of when
E(u,
) the right side of Figure 5. Like above with U , we
defined
as
the
set
R(x,
esent
the
proof
for
2-dimensional
data,
but
proof
the
argument
for
generalises
data,
but
the
argument
Proof
in
apresent
fairly
straightforward
Fix
some
point
with
density
toconsider
2falls
. We
sample
the
pair
by
first
drawing
u)\V
and
draw
tok-dimensions
some
max
ud(u,
so
that
d(u,
v)
.from
The
point
xthen
is
bythe
thegeneralises
pair (u,
v)
one
vthe
given
E(u,
)We
(so
that
we
have
v)
=manner.
as
adius
, d(u,
i.e.,
the
set
max
Also
see
Antti
Ukkonen
et
al.
δ
du.
Pr[U
]

p
(x)
1
dv
:soto
d(x,
u)

k-dimensions
in
a
fairly
straightforward
manner.
Fix
some
point
x
with
density
k-dimensions
in
a
fairly
straightforward
manner.
Fix
some
point
x
with
density
to
that
v)
=
.
The
point
x
is
covered
by
the
pair
(u,
v)
when
one
upair
sointhat
d(u,
v)
=
The
point
xthen
is covered
by point
the ispair
(u, v)
when one
v given

. uWe
sample
the
first
drawing
u.).
p,
and
draw
max
We
have
that
intersects
with
Vpoint
(x,
onstoinsome
a fairly
straightforward
k-dimensions
manner.
aThe
fairly
some
straightforward
xfrom
with
density
manner.
Fix
with
density
toso
that
d(u,
v)
=
.by
point
xv)
is drawn
covered
by
thepthus
pair
(u,
v) some
when
one
v given
ofFix
the
following
events
happens:
p(x),
and
consider
a (u,
random
pair
(u,
from
given
that
d(u,
Vv)
(x, )xfixed
E(u,
) v) upper
p(x),
and
consider
a
random
pair
v)
drawn
from
p
given
that
d(u,
v)
is
fixed
p(x),
and
consider
a
random
pair
(u,
v)
drawn
from
p
given
that
d(u,
is
fixed
bound
Pr[Vδ ] by replacing p(u) and p(v) both with
owing
events
happens:
:nsider
d(x,
u)
2that
and
d(x,

u<
so
d(u,
v)v)
=
. .The
point
x
is
covered
byZthe
pair
(u,
v) Zwhen
one
v<given
!
of
the
following
events
happens:
of
the
following
events
happens:
!
a
random
pair
p(x),
(u,
v)
and
drawn
consider
from
a
p
random
given
that
pair
d(u,
(u,
v)
v)
drawn
is
fixed
from
p
given
that
d(u,
v)
is
fixed
 maxthe
. We to
sample
the
pair
by
first
uby
from
p,drawing
and then
draw
to max
some
Z atfollowing
Z sample

We
by
first
drawing
u from
p,drawing
and
then
draw
to of
some

. We
sample
the
pair
u from
and density
thenδdraw
some
max
the
happens:
:integrals
d(x,
u) 
U
entered
point
y.events
with
radius .pair
That
is,
the
set
The
value
of
the
nofirst
longer
depends
onp,then
the
only This
on theislength
p maxp,(x).
allowed as d(x, u) ≤ 2δ and d(x, v) ≤ δ.
We
sample
the
pair
by
first
dv
drawing
sample
uvalues
from
p,
and
then
by
first
draw
u(u,from
p,
to
some
max
umax
that
d(u,
v)
=
point
x v)
ispair
covered
by drawing
the
pair
v)
one
vof
given
in. terms
U: and
Vso
by
over
all
du.
Pr[V
]by
=
p(u)
p(v)
dvand
u so
d(u,
v)
=
.integrating
The
x. We
is.u)
covered
the
pair
(u,
v)
when
one
vmax
uThe
so
that
d(u,
=
.
The
point
x
is
covered
bywhen
the
pair
(u, v)draw
when one
vthe
d(x,
u)
 denote
Uthat
du.
])given
=
p(u)
p(v)
Uax(y,
: given
d(x,

Upoint
: the
d(x,
u)
Uthe
V
).
Next,
let
p
(x)
maximum
V
:
<
d(x,
u)
<
2
and
d(x,
v)

.
of
pair
(u,
v).
We
can
thus
replace
those
with
a
function
of and obtain
R(x,
)
E(u,
)\V
(x,
)
We
obtain
that
v)
=
.
The
point
u
x
so
is
covered
that
d(u,
by
v)
the
=
pair
.
The
(u,
point
v)
when
x
is
one
covered
by
the
pair
(u,
v)
when
one
v
given
:
d(x,
u)

U
of thed(u,
following
events
happens:
V
(x,
)
E(u,
)
of
the
following
events
happens:
of
the
following
events
happens:
Z maxV :
: <
d(x,
v)u)
 <. 2 and d(x, v)  .
< d(x, u) < 2V and
d(x,
v)u)<
, that
is,
V .2 : and<d(x,
d(x,
2
!
ing
events
of Pr[V
the: following
happens:
V
< .d(x,
u)events
<Now
2weand
upper
. bound
we
can
write
q(x,
terms Pr[U
of U ]and
by(x)
integrating
all with
values
 pV max
C1 ( p(v)
). over
max ) in
U d(x,
, wev)
upper
both
with
Z
Z
) = ,happens:
(Pr[U
d
max
(x,
u)
we
have] +Ud(x,
v)])Like

can
: d(x,
u) 2above
u) Pr[V ] by replacing p(u) and
U : d(x,
: d(x,
u)
ofthus
U, and
max max ) of
Now
we
can
write
q(x,
in
terms
U
and
V
by
integrating
over
all
values
0
up
to
:
max
max
an=write
q(x,
)
in
terms
of
U
and
V
by
integrating
over
all
values
p
(x).
This
is
allowed
as
d(x,
u)

2
and
d(x,
v)

.
We
obtain
x)
max
p(y).
(13)
max
Now
we
can
write
q(x,
)
in
terms
of
U
and
V
by
integrating
over
all
values
Z
2
δ
max
cing
both
p(u)
and
p(v)
with
p
(x).
This
yields
max
Now
write
q(x,
in terms
VNext,
integrating
overu)event
all values
V
: max
u) <
2WeUand
d(x,
v)by
u)
:max
d(x,
u)
:that
d(x,

Ucan
Uand
consider
the
It happens
u falls Pr[V
in theδ“ring”
: d(x,
< d(x,
and. d(x,
v)  . when
y2Vwe
)up
to
: <) d(x,
of
(x)x,
1 dv du.
] ≤ p around
max
armax
argument
as
the
proof
of
Proposition
2.: of
!
V!
<show
d(x,
u)
<
2.V
and
< .2ZmaxV
: up(x,2
Zmaxv)) =
up to
(Pr[U
] + Pr[V ),]) and
d . v falls on the segment
q(x,
max : Z defined
to max
ofand
Z :
ZZof
max
) d(x,
=
V (x,
of E(u, ) R(x,δ)
as
the
set
Z
pper
lower
bounded
so
that
these
upper
and
lower
ZNow
E(u,δ)∩V (x,δ)
V write
Vthat
:2total
<q(x,
d(x,max
u)
<
2terms
and
d(x,
v)
: . by
<integrating
d(x,
u)
<2inover
2R(x,
and
2V )\V
. by (x,
max
0 U v)
max
max
Now
we
can
)
in
of
U
and
V
all
values
max
we
can
write
q(x,
)
terms
of
and
integrating
over
all
values
=
(Pr[U
]
+
Pr[V
])
d
.
q(x,
max
max
ent
U
is
the
density
inside
V
(x,
)
(so
max
Pr[V
]

p
(x)
1
dv
du.
du.
 p2 asto(x)
1 dv
).] as
E(u,
) \])Vdall
(x,values
) We have thus
that
with
V(Pr[U
(x,
Now
write
) Pr[V
in
terms
of
U
V R(x,
by
integrating
over
!we
0 for
some
C( )q(x,
).
)can
=
(Pr[U
] up
+
])
dintersects
. max
q(x,
x])p(x)
max
max
=
(Pr[U
]+
Pr[V
])
.=
+ the
Pr[V
q(x,
0
max
The
ond)aand
similar
argument
proof
of upthe
toproof
ofmax
max : max q(x,
)
E(u,
)\V
(x, of
) . Proposition 2. We show that
max
times
that
vcan
drawn
from
the:relies
(x, terms
) ] and
E(u,
)isZ].and
Zand
! lower
0 V
0we
write
q(x, probability
write
q(x,
in
of
by )integrating
)the
in be
terms
over
all
integrating
overthese
all values
upper
bounds
for)Vup
Pr[U
Let
Vargument
(y,
denote
0 Uvalues
max
max
max
toNow
:Pr[V
of
max
ZVlower
Zso that
max
The
proof
relies
onU
a similar
proof
ofof
Proposition
2. by
Webounded
show that
)ascan
both
upper
and
upper
and
q(x,
As
above
with Uδ , the value of the integrals only depends on
max
Z
d(u,
v)
=can
as
required).
That
that
we
have
)to
=
]proof
+
Pr[V
]) d
. 2.of
q(x,
)2.
=
]of
+depends
Pr[V
]) d on
. C( 2.and
q(x,
max
2show
As
above
with
,both
the
value
the
integrals
only
p. that
We get
proof
relies
on
aof
argument
as
the
of
Proposition
We
that
drelies
at
point
with
radius
.max
Also,
let
E(y,
)and
denote
the
:The
up
: (Pr[U
on
aysimilar
argument
as
the
proof
of
Proposition
We
show
that
als
no
longer
depends
on
the
density
p,
only
on
the
length
max
max
The
proof
relies
onU
ais,
similar
argument
asthese
the
proof
We
show
) Zsimilar
be
both
upper
lower
bounded
so
that
upper
and
lower
q(x,
bounds
go
to
C(max
)p(x)
as(Pr[U
0Proposition
for
some
). not
max
max
max !
max
du. p. We get the upper bound
Pr[V
]
=
p(u)
p(v)
dv
Z
0
0
δ
and
not
!
(Pr[U
]
+
Pr[V
])
d
.
q(x,
2) =
max
)
can
be
both
upper
lower
bounded
so
that
these
upper
and
lower
q(x,
max theand
max
max
upper
bound
Z
bounds
both
to C(
)p(x)be
asboth
!
0tofor
somelower
C( max
).R(x, for
can
be
both
upper
andgo
lower
bounded
so
that
upper
and
lower
We
proceed
derive
upper
bounds
] and
Pr[V
]. Let
) denote
) soPr[U
E(u,
)\V
(x,
) V (y,lower
can
thus
replace
with
amax
of
and
obtain
max
max these
) as
can
upper
and
bounded
that
these
upper
q(x,
2function
The
proof
relies
on)to
athose
similar
proof
Proposition
2.(Pr[U
We show
that
0
2
proof
on
a=similar
argument
as
the
proof
of. Proposition
2. Weand
show that
=
(Pr[U
] upper
+ the
Pr[V
])for
dof
. for
)disc
]max
+
Pr[V
])
ddenote
q(x,
q(x,
bounds
both
go
C(2 maxargument
)p(x)
asThe
!
0relies
C(centered
).
max
max
max
Pr[V
]max
 ppoint
(x)
C
).C( max
We
proceed
to
derive
bounds
Pr[U
]2and
Pr[V
].!
Let
Vwith
(y,
2)(
the
area
ofsome
amax
at
yfor
radius
. Also,
let E(y, ) denote the
th
go max
to )C(
)p(x)
asbounds
!
0 max
for
some
C(
).
du.
p(u)
p(v)
dvupper
max
max
both
go
to
C(
)p(x)
as
0bounded
some
).
max
can
be
both
and
lower
bounded
so
that
these
upper
and
lower
q(x,
max
02 C
0proof
)Pr[U
can
be
both
upper
and
lower
so
that
these
upper and p(u)
lower and p(v) both with
q(x,
max
The
proof
relies
on
a
similar
argument
as
the
of
Proposition
2.
We
show
that
U
,
we
upper
bound
Pr[V
]
by
replacing
Like
above
with
Figure
5:
For
the
proof
of
Proposition
3.
Left:
Example
of
We
proceed
to
derive
upper
bounds
for
]
and
Pr[V
].
Let
V
(y,
)
denote
Pr[U
]

p
(x)
(
).
1
Pr[Vδ ] ≤ pδmax (x)2 C2 (δ).
the
area
of
a
disc
centered
at
point
y
with
radius
.
Also,
let
E(y,
)
denote
the
)
E(u, )upper bounds2 for
2to
Next
we
these
results
obtain
athe
simple
upper
bound
forthat
oceed
derive
Pr[U
]acombine
Pr[V
].
Let
V
(y,
)as)that
denote
max ):
We
proceed
to
derive
upper
bounds
for
Pr[U
] for
and
Pr[V
].max
Let
Vv)
(y,
)q(x,
denote
bounds
go
to
C(
as
!
0and
for
some
C(
).
bounds
both
go
to
C(bounded
)p(x)
!
0Proposition
some
C(
). show
max )p(x)
max
max
max
max
max
lies
onto
a both
similar
argument
The
proof
as
the
relies
proof
on
of
Proposition
similar
argument
2.
We
as
show
the
proof
of
2.
We
the
area
of
a disc
centered
at
point
y
with
radius
.
Also,
let
E(y,
denote
)
can
be
both
upper
and
lower
so
that
these
upper
and
lower
q(x,
p
(x).
This
is
allowed
as
d(x,
u)

2
and
d(x,

.
We
obtain
max
of
deriving
Pr[V].
ZE(y,
event
Vproceed
. d(x,
Itupper
happens
when
uwith
falls
inaRight:
the
x,)and
f we
abe
disc
centered
at
y Pr[U].
radius
.“ring”
Also,
let
denote
the
We
toderiving
derive
for
Pr[U
]Example
and
Pr[V
].point
Let
V
) denote
We
proceed
toaround
derive
upper
bounds
forso
Pr[U .] Also,
Pr[V
].E(y,
Letand
V) (y,
) denote
the
area
ofbe
disc
at
y(y,
with
let
denote
the
2centered
!
have
v)

2point
, upper
and
we
can
upper
an
both
and
lower
)bounds
can
so
both
that
upper
and
upper
bounded
lower
these
upper
lower
q(x,
max
bounds
both
go
tobounded
C(
)p(x)
asthese
! lower
0max
for some
C(radius
). Z and
Z
maxthus
maxthat
we combine these results to obtain a simple upper
area
of2 a)\V
disc(x,
centered
atvpoint
ythe
with
radius
. max
let E(y,
) denote
the
area
of amax
disc
centered
at
point
radius
the
)the
=and
V (x,p(v)
), and
falls yields
on
the
segment
of
)(Pr[U
q(x,
)Also,
=2E(u,
]y+with
Pr[V
])2 d . Also, let E(y, ) denoteNext
2 max
(u)
p
(x).
This
go to
C( maxwith
)p(x)
bounds
as
both
!
0
go
for
to
some
C(
C(
)p(x)
).
as
!
0
for
some
C(
).
max
max
max
max
max
max
We proceed to derive upper bounds for0Pr[U
] and
Pr[V
].(x)
Let V (y, ) denote
Pr[V
]

p
1
dv
du. q(x, δ
x, ). We have thus
bound
for
max ):
!
Z
ed Z
to derivethe
proceed
forcentered
Pr[U to
]!
and
derive
Pr[V
upper
Let
bounds
Vradius
(y, ⇣for
) denote
andE(y,
Pr[V
Let
VE(u,
(y,
)⌘ denote
R(x,
)
(x, )
Zupper
area bounds
of aWe
disc
at
point
y]. with
.Pr[U
Also,] let
)].denote
the)\V
2
Z
ZNow we can write
max
) inmax
of
byE(y,
inte2
max
δ 1and
p max
(x)U2 C
) +Vpδlet
(x)
C)2 (denote
) d the
ax)disc
centered at
point
with
a disc centered
.q(x,
Also,δlet
at <
E(y,
point
)yterms
denote
with
radius
the
.( Also,
1the
dvyarea
du.of radius
value of the integrals only depends on and not p. We
0to δU , the
du.Asofabove
=
p(u) E(u,grating
Z get
overp(v)
alldvvalues
δ upwith
V (x, )
)
δmax
max :
R(x, )
E(u, )\V (x, )
the upperZ bound
max
q(x,
δ
)
=
(Pr[Uδ ] + Pr[Vδ ]) dδ
2
max
Z
max
2
max
δmax
=
p
(x) (C1Pr[V
( ) +] C2 (p ))
d(x) C2 ( ).
ger
depends
on
the
density
p,
only
on
the
length
0
upper bound Pr[V ] by replacing p(u) and p(v) both0with
Z δmax q(x,
δ
)
=
(Pr[U
]
+
Pr[V
])
dδ.
replace
those
with
a
function
of
and
obtain
δ Z these
δ
results
to obtain a simple upper bound for q(x, max ):
δ
2
δ
2
ed as d(x, u)  2 and d(x, v)  . max
We obtain Next we combine
max
<
p max (x) C1 (δ) + p max (x) C2 (δ) dδ
0 = p max (x)2
!
Z(C1max
( ) + C2 ( )) d
max Z 2
Z
0
]p
(x) C1 ( ).
Z δmax
q(x, 2max0as
) =the proof
(Pr[U
+ Pr[V ]) d
The proof relies
on adu.
similar argument
of ]Propop max (x)2
1 dv
δ
2
max
0
=
p
(x)
C(
).
=
p max (x) (C1 (δ) + C2 (δ)) dδ
max
It happens when
“ring”
around
R(x, )u falls
E(u,in)\V
(x,
)
sition
2.the
We
show
thatx,q(x, δmax ) can beZ both
upper and
⌘
0
max ⇣
2value
)\V (x,
), and
v fallsonly
on the
segment
of E(u,
) p. We get
Z δmax
of the
integrals
depends
on
and not
p max (x)2both
C1 ( ) + p max (x)2 C2 ( ) d
lower bounded
so that
these
upper and<lower bounds
2
δ
have thus
0
(C1 (δ) + C2 (δ)) dδ
= p max (x)
2
)p(x) as δmax → 0 for someZ C(δ
).
!
0
max to2C(δ
max max
ZPr[V ]  p go
(x) C2 ( max
).
2
max
δ
2
(C1δ(].) + C2 ( )) d
We
proceed
to derive upper bounds for=Pr[Uδ ]p and(x)
Pr[V
= p max (x) C(δmax ).
p(v)
dv
du.
(u) results to obtain
hese
a simple
upper bound for q(x, max ):
0
E(u, )\V (x,Let
)
Z max
V (y, δ) denote the area of a disc centered at point
y with
Z max
C(δmax ) that appears above is the one we use in the
= p max (x)2 circular
(C1 ( ) + C2 ( ))The
d
= Pr[V (Pr[U
]+
Pr[V p(u)
])δ.d Also,
radius
letboth
E(y,
δ) denote the 1-dimensional
und
] by replacing
and p(v)
with
0
0
claim
(Eq. 14) of the proposition. Now, we can replace
)
2
and
d(x,
v)

.
We
obtain
Z max ⇣
subspace centered at ⌘point y with radius= pδ.max
That
is, the set
(x)2 C( δmax ).
2
2
!
max
q(x,
δ
) with this upper bound in Equation 14, and obmax
Z
Zp max (x)
<
C
(
)
+
p
(x)
C
(
)
d
max
1
2
E(y, δ) is the “edge” of the set V (y, δ). Next, let p
(x)
0
1 dv
du.
serve
that
Z max
denote
the
maximum
density
inside
the
set
V
(x,
2δ
),
that
max
R(x, )
E(u, )\V
2 (x, )
=
p max (x)is,
(C1 ( ) + C2 ( )) d
pδmax (x)2 C(δmax )
0
e integrals
only
depends
on
and
not
p.
We
get
δmax
Z max
lim
= lim pδmax (x)2 = p(x)2 .
p
(x)
=
max
p(y).
(15)
δmax →0
δmax →0
C(δmax )
= p maxmax
(x)2 2
(C1 ( ) + C2 ( )) d
y∈V (x,2δmax )
]p
(x) 0C2 ( ).
Observe that the probability of event Uδ is the total density
max
Here the second equality simply follows from the definition
(x)2aC(
max ).upper bound for q(x, max ):
s=top obtain
simple
inside V (x, δ) (so that the point u is close enough to x) times
of pδmax (x) as given in Equation 15 above. As δmax goes
the
probability
that
v
is
drawn
from
the
circle
around
u
with
Pr[U ] + Pr[V ]) d
to zero, the area from which we take pδmax (x) shrinks until
p
max
max
2
2
Z
radius δ, i.e., the set E(u, δ) (so that we have d(u, v) = δ as
is,
!
Z
Z
) + C2 ( )) dPr[Uδ ] =
p(u)
p(v) dv du.
only x is left. The upper bound of q(x, δmax )/C(δmax ) thus
approaches p(x)2 from above as δmax → 0. We can do the
same reasoning in terms of a lower bound of q(x, δmax ), by
replacing max with a min in Equation 15, i.e.,
⌘
(x)2 C1 ( ) + required).
p max (x)2 C2 (That
) d
(x)2 (C1 (
V (x,δ)
max
E(u,δ)
pδmax (x) =
(C1 ( ) + C2 ( )) d
0
C(
max ).
Also see left side of Figure 5. Since δ ≤ δmax and d(x, u) ≤
δ, we have d(x, v) ≤ 2δ, and we can thus upper bound
Pr[Uδ ] by replacing both p(u) and p(v) with pδmax (x). This
yields
!
Z
Z
δmax
2
Pr[Uδ ] ≤ p
(x)
1 dv du.
V (x,δ)
p(y).
This lower bound will approach p(x)2 from below as
δmax → 0 in the same manner.
Finally, notice that the argument used in this proof, while
described for k = 2, generalises to arbitrary k. We only
redefine the set V (y, δ) as the volume of a k-dimensional
sphere with radius δ, and E(y, δ) as the k − 1 dimensional
surface of V (y, δ). It follows that we can always derive a
C(δmax ) that is independent of p(x) which is enough to
complete the proof.
E(u,δ)
The value of the integrals no longer depends on the density
p, only on the length δ of the pair (u, v). We can thus replace
those with a function of δ and obtain
Pr[Uδ ] ≤ pδmax (x)2 C1 (δ).
Next, consider the event Vδ . It happens when u falls in
the “ring” around x, defined as the set R(x, δ) = V (x, 2δ) \
V (x, δ), and v falls on the segment of E(u, δ) that intersects
with V (x, δ). We have thus
!
Z
Z
Pr[Vδ ] =
p(u)
p(v) dv du.
R(x,δ)
min
y∈V (x,2δmax )
E(u,δ)∩V (x,δ)
197
Download