Crowdsourced Nonparametric Density Estimation Using Relative Distances Antti Ukkonen Behrouz Derakhshan Hannes Heikinheimo

Proceedings, The Third AAAI Conference on Human Computation and Crowdsourcing (HCOMP-15) Crowdsourced Nonparametric Density Estimation Using Relative Distances Antti Ukkonen Behrouz Derakhshan Hannes Heikinheimo Finnish Inst. of Occupational Health Helsinki, Finland antti.ukkonen@ttl.fi Rovio Entertainment Espoo, Finland behrouz.derakhshan@gmail.com Reaktor Helsinki, Finland hannes.heikinheimo@reaktor.com Abstract having a large number of common aspects or features. Likewise, items from low density regions are outliers or in some sense unusual. To give an example, consider a collection of photographs of galaxies (see e.g. (Lintott et al. 2008)). It is reasonable to assume that some types of galaxies are fairly common, while others are relatively rare. Suppose our task is to find the rare galaxies. However, in the absence of prior knowledge this can be tricky. One approach is to let workers label each galaxy as either common or rare according to the workers’ expertise. We argue that this has two drawbacks. First, evaluating commonality in absolute terms in a consistent manner may be difficult. Second, the background knowledge of the workers may be inconsistent with the data distribution. Perhaps a galaxy that would be considered as rare under general circumstances is extremely common in the given data. We argue that in such circumstances density estimation may result in a more reliable method for identifying the rare galaxies. A galaxy should be considered as rare if there are very few (or none) other galaxies that are similar to it in the studied data. The textbook approach for nonparametric density estimation are kernel density estimators (Hastie, Tibshirani, and Friedman 2009, p. 208ff). These methods usually consider the absolute distances between data points. That is, the distance between items A and B is given on some (possibly arbitrary) scale. Absolute distances between data points are used also in other elementary machine learning techniques, such as hierarchical clustering or dimensionality reduction. However, in the context of human computation, it is considerably easier to obtain information about relative distances. For example, statements of the form “the distance between items A and B is shorter than the distance between items C and D” are substantially easier to elicit from people than absolute distances. Moreover, such statements can be collected in an efficient manner via crowdsourcing using appropriately formulated HITs (Wilber, Kwak, and Belongie 2014). It is thus interesting to study what is the expressive power of relative distances, and are absolute distances even needed to solve some problems? A common application of relative distances are algorithms for computing low-dimensional embeddings (representations of the data in Rm ) either directly (van der Maaten and Weinberger 2012) or via semi-supervised In this paper we address the following density estimation problem: given a number of relative similarity judgements over a set of items D, assign a density value p(x) to each item x ∈ D. Our work is motivated by human computing applications where density can be interpreted e.g. as a measure of the rarity of an item. While humans are excellent at solving a range of different visual tasks, assessing absolute similarity (or distance) of two items (e.g. photographs) is difficult. Relative judgements of similarity, such as A is more similar to B than to C, on the other hand, are substantially easier to elicit from people. We provide two novel methods for density estimation that only use relative expressions of similarity. We give both theoretical justifications, as well as empirical evidence that the proposed methods produce good estimates. Introduction A common application of crowdsourcing is to collect training labels for machine learning algorithms. However, human computation can also be employed to solve computational problems directly (Amsterdamer et al. 2013; Chilton et al. 2013; Trushkowsky et al. 2013; Parameswaran et al. 2011; Bragg, Mausam, and Weld 2013). Some of this work is concerned with solving machine learning problems with the crowd (Gomes et al. 2011; Tamuz et al. 2011; van der Maaten and Weinberger 2012; Heikinheimo and Ukkonen 2013). In this paper we focus on the problem of nonparametric density estimation given a finite sample from an underlying distribution. This is a fundamental problem in statistics and machine learning that has many applications, such as classification, regression, clustering, and outlier detection. Density can be understood simply as “the number of data points that are close to a given data point”. Any item in a high density region should thus be very similar to a fairly large number of other items. We argue that in the context of crowdsourcing, density can be viewed for instance as a measure of “commonality” of the items being studied. That is, all items in a high density region can be thought of as c 2015, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 188 TDE1 TDE2 High-level overview of the proposed methods methods are based on the observation that the density at a given point x is proportional to the number of close neighbours of x that belong to D. If there are many other points in D that are at a close proximity to x, we say that x is from a high-density region. The question is thus how to identify and count the points that are close to x without measuring absolute distances. The basic idea of the first method, called TDE1, is illustrated in the left panel of Figure 1. Suppose we have placed a circle with a short radius at every observed data point ui ∈ D. Given x, we can count how many of these circles cover x, i.e., the number of ui that are close to x. Observe that to do this we only need to know whether the distance between x and some ui is shorter or longer than the radius of the circle around ui . This is a problem of comparing two distances, and does thus not require to know the absolute distances. We can thus pair every ui ∈ D with a reference point u0i ∈ D so that the distance between ui and u0i equals the radius of the circle. However, as the circles at every ui must have at least roughly the same radius, finding an appropriate reference point for every ui can be difficult. We also need a means to find the reference points without access to absolute distances. This problem can be solved to some degree using a low-dimensional a embedding, but it adds to the complexity of the method. The second method we propose, called TDE2, is simpler in this regard. Rather than having to find a reference point for every ui ∈ D, it only uses a random set of pairs of points from D. The only constraint is that the pairs should be “short”. We say a pair is short if it is short in comparison to all possible pairs of items from D. The intuition is that short pairs occur more frequently in high density regions of the density function p. The second method thus counts the number of short pairs that are close to x. Here the notion of “closeness” also depends on the length of the pair. This is illustrated in the right panel of Figure 1. The pairs are shown in red, with gray circles around their endpoints. The radius of the circles is equal to the length of the respective pair. The pair is considered to be close to x if x is covered by at least one of the circles that are adjacent to the pair. Very short pairs must be very close to x in order to cover it, while longer pairs also cover points that are further away. It is easy to see that TDE1 is essentially a standard kernel density estimator that uses the uniform kernel. The TDE2 estimator is designed for relative distances from the ground up and has thus some advantages over TDE1. In particular, TDE2 is simpler to construct. We propose two methods for density estimation using relative distances. The first is a simple variant of a standard kernel density estimator, while the other is a novel method that has some practical advantages over the first one. We call our estimators “triplet density estimators”, because the relative distance judgements they use are defined on sets of three items. Our input is a dataset D, assumed to have been sampled iid. from some unknown density p, and we want to estimate the density p(x) at an arbitrary point x. Both of the proposed Our first method can be viewed as a standard kernel density estimator that does not use absolute distances. First, we recall the basic definition of a kernel density estimator as originally proposed (Rosenblatt 1956; Parzen 1962), and as usually given in textbooks, e.g. (Hastie, Tibshirani, and Friedman 2009, p. 208ff). Then, we describe how a simple instance of this can be implemented only using relative distances. 6 6 5 5 4 4 3 3 2 2 1 1 0 -1 0 1 2 3 4 5 0 -1 0 1 2 3 4 5 Figure 1: Left: Example of the TDE1 estimator we propose. Right: Example of the TDE2 estimator that we propose. metric-learning (Schultz and Joachims 2003; Davis et al. 2007; Liu et al. 2012) if features are available. While embeddings can be used for various purposes, including also density estimation, we are more interested in understanding what can be done using relative distances directly, without first computing a representation of the items in some (arbitrary) feature space. This is because despite embeddings being powerful, they can be tricky to use in practical situations, e.g. if the true dimensionality of the data is unknown. Moreover, an embedding may have to be re-computed when new data is added to the model. We hope to avoid this by operating directly with relative distance judgements. Our contributions We consider two approaches for density estimation using relative distances. In particular, we provide the following contributions: 1. We describe how a standard kernel density estimator can be expressed using relative distance judgements, and describe how embeddings to Rm can be used to construct this estimator. 2. We propose a second estimator that is explicitly designed for human computation. The main upside of this estimator is that it is easier to learn and use. While the estimator is biased, the amount of the bias can be reduced at the cost of using more work to build the estimator. 3. We conduct experiments on simulated data to compare the two methods, and show that the second estimator is in general easier to learn than the first one, and that it produces better estimates. TDE1: A simple triplet density estimator 189 Basics of kernel density estimation An important consequence of this is that it is not possible to devise a normalised density estimator based on relative distances. In other words, Equation 2 can not hold with equality. The best we can hope for is to find an estimator that is proportional to p(x). Later we show how to derive a normalised estimator for one-dimensional data, but this makes use of absolute distances in the normalisation constant, much in the same way that h appears in the normalisation constant of Equation 1. In practice we must thus consider non-normalised estimates of p(x) when using relative distances. Let D = {u1 , . . . , un } denote a data set of size n drawn from a distribution having the density function p. For the moment, assume that D consists of points in Rm . Given D, the kernel density estimate (KDE) of p at any point x is given by n 1 X 1 KDEn (x) = K( (x − ui )), (1) m nh i=1 h where K is the kernel function and h is the bandwidth parameter. Fairly simple conditions on the kernel function K and the bandwidth h are sufficient to guarantee that the kernel density estimate is an unbiased and consistent estimate of p. That is, as the size of D increases, we have lim ED∼p [KDEn (x)] = p(x), n→∞ Uniform kernel with relative distances The simplest kernel function K that satisfies the requirements given above is the uniform kernel. It takes some constant positive value if x is at most at distance h from the data point ui ∈ D at which the kernel is located, and is otherwise equal to zero. Formally the m-dimensional uniform kernel is defined as V (m)−1 : y ≤ 1, unif K (y) = (5) 0 : y > 1, (2) where the expectation is taken over random samples from p of size n. Most importantly, the kernel function K must satisfy Z K(y) ≥ 0 ∀y and K(y)dy = 1. (3) Rm where V (m) is the volume of the m-dimensional unitm sphere. This is defined as V (m) = Γ(πm2+1) , where Γ is the 2 Gamma function. In this case the density estimate at x is simply proportional to the number data points that are at most at a distance of h from x. We say the point u covers x with bandwidth h, whenever K unif ( d(x,u) h ) ≤ 1, This can be seen as a straightforward generalisation of a histogram that has unspecified bin boundaries. In Figure 1 (left) we can say that the density at x is proportional to the number of “discs” that cover x. It is easy to see that as n increases and h goes to zero (i.e., there are more points covered by smaller and smaller discs), this quantity will indeed approach the underlying density p. We continue by designing a triplet density estimator that uses the uniform kernel K unif . To compute K unif , we must only determine if a data point ui near enough to x. However, we cannot evaluate d(u, v) explicitly for any u, v ∈ D. Instead, we only have access to an “oracle” that can give statements about the relative distance between items. That is, we can issue tasks such as “Is the distance d(u, x) shorter than the distance d(u, v)?” Or, in more natural terms: “Of the items x and v, which is more similar to u?” We can use this task to compute the uniform kernel as follows. Suppose that for every ui ∈ D, we have found some other u0i ∈ D, so that d(ui , u0i ) is equal to h. Denote this pairing by P(h). Note that several u ∈ D can be paired with the same u0 , P(h) is not required to be a proper matching. Given P(h), by the definition of the uniform kernel given in Equation 5, we have d(x, ui ) unif = V (m)−1 I{d(ui , x) ≤ d(ui , u0i )}. K h (6) In this paper we consider radial-symmetric kernels. In addition to Equation 3 these satisfy K(y) = K(y 0 ) ⇔ kyk = ky 0 k. That is, the kernel is a function of only the length of its input vector. Also, we want K(y) to be non-increasing in kyk. Now, assume that there exists a distance function d between the items in D. For the moment, let d(x, y) = kx−yk. We can rewrite Equation 1 as follows: KDEn (x) = n 1 X d(x, ui ) K( ). nhm i=1 h (4) The bandwidth h is a parameter that must be specified by the user. There are a number of methods for choosing h in an appropriate manner. In practice the bandwidth h should decrease as n increases. In fact, we must have limn→∞ h(n) = 0 for Eq. 2 to hold. In the experiments we use cross validation to select an appropriate value of h when needed. Normalised densities and relative distances Before continuing, we point out a property of density functions and relative distances. Recall that the integral of a density function over its entire domain must by definition be equal to 1. This means that the value p(x) is not scale invariant: the precise value of p(x) depends on the “unit” we use to measure distance. If we measure distance in different units1 , the absolute value of p(x) must change accordingly as well. This is reflected in Equation 1 through the bandwidth parameter h that must have the same unit as the distance d(u, v). However, when dealing with relative distances there is no distance unit at all. We only know that some distance is shorter (or longer) than some other distance, but the unit in which the distances are measured is in fact irrelevant. 1 Concretely, suppose we use feet instead of meters. 190 This yields the set P̃(h). This, together with Equation 4 gives the triplet density estimator 1 TDE1n (x) = V (m)nhm X The pairs (u, v) in P̃(h) have all approximately length h in the embedding. Of course this length is not the same as the unknown distance d(u, v), but we can assume that d(u, v) is approximately the same for every (u, v) ∈ P̃(h). 0 I{d(u, x) < d(u, u )}. (u,u0 )∈P(h) (7) This simply counts the number of u ∈ D that cover x with the specified bandwidth h. As we only need an oracle that can provide relative distance judgements to evaluate the indicator function above, Equation 7 provides a kernel density estimator that is suitable for human computation. Moreover, all known results that apply to Equation 1 (e.g. the one of Equation 2) directly carry over to Equation 7. There are some issues with this result, however. As discussed above, in practice we cannot evaluate the normalisation constant: both m and the exact value of h are unknown if our input only consists of relative distance comparisons. More importantly, it is unlikely that for a given D we would find a pairing P(h) where the distances between all pairs are exactly equal to h. It is more realistic to assume that we can find a pairing P̃(h) where this holds approximately. That is, for (almost) every (u, u0 ) ∈ P̃(h), we have d(u, u0 ) ≈ h. This implies that results about unbiasedness and convergence that hold for Equation 1 are no longer obviously true. While variable bandwidth density estimators have been proposed in literature, our problem differs from those in the sense that we do not try to select an optimal local bandwidth around each u ∈ D, but the bandwidth at u randomly varies around h. In general it is nontrivial to even quantify this variance, or show how it affects the resulting estimator. This raises one important question, however. If we are not able to evaluate the absolute value of d(u, v), how can we find pairs that are (at least approximately) of length h? TDE2: A second triplet density estimator The density estimator proposed above (Equation 7) must construct a low dimensional embedding of the items to find the pairs P̃(h). While there are a number of techniques for finding such embeddings, they are not necessarily optimal for density estimation. Their objective may be to create visually pleasing scatterplots, which requires to introduce additional constraints that will favor embeddings with certain properties. For instance, the methods may aim to focus on short distances so that the local neighbourhood of each data point is preserved better. This emphasises clusters the data may have. As a consequence the resulting distances may become disproportionally skewed towards short distances. However, for the bandwidth selection process to produce good results, we would prefer the resulting distances to be as truthful as possible, possibly at the cost of the embedding resulting in a less clear visualisation. Moreover, the embedding algorithms tend to be sensitive to initialisation. Motivated by this, we propose a second triplet density estimator that does not need an embedding. In particular, we show that it is possible to produce good density estimates, provided we can find a sufficient number of short pairs. Unlike in the method proposed above, they do not all have to be approximately of the same length h. Instead, it is sufficient to find pairs that are all at most of a certain length that we will denote by δmax . We first describe the estimator and prove some of its properties, and continue by discussing the problem of finding short pairs without an embedding. Using an embedding to find P̃(h) Basic idea As mentioned in the Introduction, by embedding the items to a low-dimensional Euclidean space, we can trivially estimate density using existing techniques. Such embeddings can be computed from relative distances e.g. using the algorithm described in (van der Maaten and Weinberger 2012). This has the downside that to estimate density at a new point x, we in general must recompute the embedding to see where x is located. Using the technique described above, however, it is possible to estimate density directly, provided we have the set of pairs P(h). Therefore, we use the embedding only for constructing the set P̃(h) as follows: 1. Embed all items in D to Rm using some suitable algorithm. We use the method proposed in (van der Maaten and Weinberger 2012). This results in a set of points X. Note that the dimensionality m of the embedding is a parameter that must be specified by the user. 2. Use any existing technique for bandwidth selection to find a good h given the points X ∈ Rm . We use cross validation in the experiments later. 3. For every item u ∈ X, find the item v ∈ X so that the distance between u and v is as close to h as possible. This variant of the density estimator uses a slightly different human intelligence task: “Which of the three items x, u, and v is an outlier?” The worker must identify one of the three items as the one being the most dissimilar to the other two. In (Heikinheimo and Ukkonen 2013) this task was used to compute the centroid of a set of items. We model this HIT by a function Ω that takes as input the item x and the pair (u, v), and returns 1 if the distance between x and u or v is shorter than the distance between u and v, and 0 otherwise. That is, we have Ω(x, (u, v)) = 0 if x is chosen as the outlier and Ω(x, (u, v)) = 1 if either u or v is chosen as the outlier. More formally, we have Ω(x, (u, v)) = I{min(d(x, u), d(x, v)) ≤ d(u, v)}, and we say that the pair (u, v) covers the item x whenever Ω(x, (u, v)) = 1. Our method is based on the following observation. Let P = {(ui , vi )}li=1 denote a set of pairs, where every ui and vi is drawn from D uniformly at random. Moreover, let P δmax ⊂ P denote the set of “short pairs”, meaning those 191 Results for 1-dimensional data for which we know the distance d(u, v) to be below some threshold δmax . Now, fix some x ∈ D, and consider the number of short pairs (u, v) ∈ P δmax that cover x. If x resides in a high density region of p, we would expect there to be many short pairs that cover x, while there are less of such pairs if x is from a low density region. This is illustrated in the right panel of Fig. 1. The estimator that we propose, called TDE2, is defined as described above. We first draw an iid. sample of pairs P from p, and then remove all such pairs from P that are longer than δmax to obtain P δmax . As we discuss in the remainder of this section, we can estimate density p(x) at any given x by computing the number of pairs in P δmax that cover x and taking a simple transformation of the resulting number. Next, we discuss the connection between q(x, δmax ) and p(x). Our objective is to show that by taking a suitable transformation of our estimate for q(x, δmax ) we can obtain a reasonable estimate for p(x). For one-dimensional data it is fairly easy to see when events A and B happen simultaneously and derive an expression for q(x, δmax ). Consider all pairs of length δ starting from δ = 0 up to δ = δmax . For every such pair (u, v) of length δ, item x is covered by (u, v) exactly when u is contained in the interval [x − 2δ, x + δ], and v is at u + δ. (Thus v must be in the interval [x − δ, x + 2δ].) If this happens we have min{d(x, u), d(x, v)} ≤ d(u, v) and Ω(x, (u, v)) = 1. Next, suppose the midpoint of the pair (u, v) is at j. Clearly we must have u = j − δ/2 and v = j + δ/2. Using these we can write q(x, δmax ) as Z δmax Z x+ 32 δ δ δ q(x, δmax ) = 2 p(j − )p(j + ) dj di. 2 2 δ=0 j=x− 32 δ (12) The constant 2 is needed because we can obtain (u, v) by first drawing u and then v, or vice versa. The following proposition shows how q(x, δmax ) is connected to p(x). Proposition 2. Let the function q(x, δmax ) be defined as in Equation 12. We have s q(x, δmax ) = p(x). (13) lim δmax →0 3δmax 2 Preliminary definitions and some simple results As discussed above, the estimate that TDE2 gives for p(x) is based on the number of short pairs that cover x when the pairs are drawn independently from p. Suppose x is fixed, and we obtain the pair (u, v) by first drawing u according to p, and then draw v independently of u also from p. Let δmax denote some distance. Consider the events A = “the item x is covered by the pair (u, v)“, and B = “the distance d(u, v) is at most δmax “. Let q(x, δmax ) denote the probability Pr[A ∩ B]. The TDE2 estimator actually estimates q(x, δmax ). Observe that we have q(x, δmax ) = Pr[A ∩ B] = Pr[A | B] Pr[B]. (8) Proof. See Appendix. Now, let P be a set of pairs sampled iid. from p, and denote by P δmax the subset of P that only contains pairs up to length δmax , i.e., P δmax = {(u, v) ∈ P | d(u, v) ≤ δmax }. Observe δmax that |P |P| | is an unbiased and consistent estimate of Pr[B], i.e., we have |P δmax | = Pr[B]. |P| |P|→∞ lim This shows that by choosing a sufficiently small δmax , we can obtain approximate estimates for p(x) by taking a simple transformation of the estimate for q(x, δmax ). We conclude this section by combining this observation with propositions 1 and 2 to the following result. Corollary 1. Let p be a univariate density function, and let P be a set of pairs sampled independently from p. Let P δmax ⊆ P contain those pairs in P having length at most δmax . For a sufficiently small δmax we have s S(x, P δmax ) p̂(x, δmax ) = ≈ p(x). 3δmax 2 |P| (9) To estimate Pr[A | B], we define the triplet score of the item x given set of pairs P as X Ω(x, (u, v)), (10) S(x, P δmax ) = (u,v)∈P δmax δmax and find that S(x,P |P δmax | Pr[A | B], i.e., ) is an estimate of the probability S(x, P δmax ) = Pr[A | B]. |P δmax | |P|→∞ lim Observe that p̂(x, δmax ) as defined above yields estimates that are appropriately normalised. The estimates are guaranteed to converge to q(x, δmax ), and have a bias wrt. p(x) that depends on δmax . By “sufficiently small” we mean that δmax should be as small as possible without making the set P δmax too small. We continue with a simple study of the effect of δmax on the bias and variance of p̂(x, δmax ). (11) By combining equations 8, 9 and 11, we obtain the following: Proposition 1. Given the set P of pairs sampled independently from p, we have Bias and variance of the 1-dimensional estimator, an empirical study S(x, P δmax ) = q(x, δmax ). |P| |P|→∞ The estimator p̂(x, δmax ) (Corollary 1) is biased, with the bias increasing as δmax increases. We studied the effect of theq bias empirically. The top panel in Fig. 2 shows the shape q(x,δmax ) of for different values of δmax when p is the 3δmax 2 lim This shows that TDE2 is an unbiased and consistent estimate of q(x, δmax ). 192 10−4 δmax = 0.5 δmax = 1 δmax = 2 total bias var 10−5 10−6 10−7 -4 -2 0 2 0.0 4 0.5 1.0 1.5 δmax 2.0 2.5 3.0 1.0 true estim 500 pairs 5000 pairs 0.9 0.8 short pairs 358 0.7 36 250 0.6 -6 -4 -2 0 2 4 250 6 Figure 2: Top: Examples of the estimator p̂(x, δmax ) for different values of δmax when the underlying density is a standard normal distribution. Bottom: Examples of estimating a bimodal distribution from a sample of 500 data points with 500 and 5000 pairs (dashed lines) when δmax = 1.0. 750 1250 1750 2500 2250 Figure 3: Top: Estimation error (MSE, wrt. the true density) as a function of δmax when the estimate is computed using 1000 pairs drawn from a data of 200 data points sampled from a standard normal distribution. Bottom: Correlation of the estimate (δmax = 0.5) with true density (the bimodal distribution shown on the left) as a function of the number of pairs in P. The inset shows the number of short pairs that remain in P δmax after removing those longer than δmax . standard normal distribution. These are what p̂(x, δmax ) will converge to as the size of P increases. On the other hand, for “small” δmax the estimate for q(x, δmax ) will have a high variance as there are only few pairs of length at most δmax . The top panel of Fig. 3 shows how the errors behave due to bias and variance as a function of δmax . The estimate is computed from 200 points sampled from a standard normal distribution. For small δmax the total error consists mainly of variance due to the small size of P δmax . As δmax increases the variance component diminishes, and most of the remaining error is due to bias. In practice it is possible to set δmax so that the estimator will produce useful results. The bottom panel of Fig. 2 shows an example of a mixture of two normal distributions (black curve), the limit to which p̂(x, 1.0) would converge (blue curve), as well as two estimates computed with 500 and 5000 pairs in P, respectively (the red dashed curves). We can observe that even 500 pairs are enough to obtain a reasonable estimate. Finally, the bottom panel of Fig. 3 shows how the estimation accuracy (correlation between the estimate and the true density) behaves as the number of pairs in P is increased from 250 to 2500. dimensional inputs. Proposition 3. Let q(x, δmax ) be defined as in Equation 8, and let C(δmax ) be some function of δmax . We have s q(x, δmax ) = p(x). (14) lim δmax →0 C(δmax ) Proof. See Appendix. This result is important because in general we cannot know the true, underlying dimensionality of the data. Hence, our estimator should be independent of the dimensionality, and Proposition 3 shows this to be the case. Finally, since q(x, δmax ) is in practice proportional to the triplet score δmax S(x, ) (Eq. 10), our estimator can be defined simply pP as S(x, P δmax ), because both C(δmax ) as well as |P| are constants that do not depend on x. Finding short pairs without embeddings Above we showed how the length of the pairs that are used to compute p̂(x, δmax ) affects the quality of the estimate. In particular, we should only use pairs that are “short” in comparison to the average length of the pairs. In practice we do not have the absolute lengths of the pairs, however. Generalisation to higher dimensions Above we showed that for one-dimensional data the square root of q(x, δmax ) is proportional to p(x) as δmax approaches zero. We can show a similar result also for arbitrary 193 Experimental setup We can only obtain relative distance information, and cannot thus simply filter the pairs based on their absolute length. Instead, we must find short pairs by some other means. Next we discuss a simple method for doing this. Let P δmax denote the set of pairs that are at most of length δmax . Since all pairs in P are drawn uniformly at random δmax from p, we know that |P |P| | is an unbiased estimate of the probability Pr[d(u, v) ≤ δmax ] as was discussed above. Inversely, by fixing some value α ∈ [0, 100], we can consider all pairs in P for which the length falls below the α:th percentile of the length distribution of all pairs in P. By setting α to some small value, e.g. α = 10, we obtain pairs that correspond to some unknown “small” value of δmax , so that α α 100 = Pr[d(u, v) ≤ δmax ]. Let P denote the set of pairs selected for a given α. This is simply an alternative formulation for “short” pairs without considering absolute distances. Finding the α:th percentile of a set of values can be done efficiently using a selection algorithm. In general, these find the i:th largest value in a given list of values without sorting the values. They commonly run in linear time in the size of their input, which is |P| in this case. Note that e.g. the Quickselect algorithm (Hoare 1961) (and its variants, e.g. the Floyd-Rivest algorithm (Floyd and Rivest 1975)) also find all pairs that are below the percentile as a simple sideeffect. No separate filtering step is thus needed. Most selection algorithms are based on pairwise comparisons, and can thus be implemented without explicitly evaluating d(u, v) for any (u, v) ∈ P. We only need an oracle that can evaluate the indicator function I{d(u, v) < d(u0 , v 0 )}. This is easily expressed as the following human intelligence task: We measure cost as the number of HITs that are needed to construct the estimator. The methods are implemented as follows: • TDE1: The estimator is constructed by first embedding a set of items into R2 with the method of (van der Maaten and Weinberger 2012). (In practice we have to make some guess about the input data dimensionality when using this method. In these experiments we assume a fixed dimensionality of 2 independently of the true data dimensionality.) Given the embedding, we find an optimal bandwidth h using cross validation, and select the set P(h) by pairing every u ∈ D with the v ∈ D so that the distance between u and v in the embedding is as close to h as possible (see Section ). The estimates are computed with Equation 7 (by omitting the normalisation). • TDE2: We use the selection algorithm based method of Section to compute the set P α . To make the cost of computing an estimate for a single x equal to the cost of α TDE1, we let |P| = 100 α |D|. This will result in P = |D|. We used α ∈ {0.05, 0.10, 0.15, p 0.20, 0.25, 0.30, 0.35}. The estimates are computed as S(x, α) where S(x, α) is defined analogously to S(x, δmax ). We use a separate training data Dtrain to learn the estimator, and evaluate it on the test data Dtest that is drawn from the same distribution as Dtrain . In all cases we let |Dtrain | = |Dtest | = 200 items sampled iid. from the density in question. We use the following synthetic datasets that are mixtures of multivariate normal distributions: Gaussian-5D : A single standard multivariate normal distribution in a 1-dimensional space. Gaussian-5D : A single standard multivariate normal distribution in a 5-dimensional space. 2Gaussians-3D : Two standard multivariate normal distributions in a 3-dimensional space, located at (0, 0, 0) and (3, 3, 3). 2Gaussians-5D : As 2Gaussians-3D but in a 5-dimensional space. 4Gaussians-2D : Four standard multivariate normal distributions in a 2-dimensional space, located at (0, 0), (0, 5), (5, 0) and (5, 5). 4Gaussians-3D : Four standard multivariate normal distributions in a 3-dimensional space, located at (0, 0, 0), (5, 0, 0), (0, 5, 0) and (0, 0, 5). We only consider fairly low-dimensional data with k ≤ 5. This is because density estimation becomes less meaningful for high-dimensional data due to the curse of dimensionality. We think this is a sufficient study of the problem also from a practical point of view. In crowdsourcing applications we do not expect the workers to use more than a few dimensions when giving similarity judgements. Also, we assume that computation is noise-free. That is, we obtain consistent and correct answers to each HIT. In practice some quality control mechanism such as (Dawid and Skene 1979; Raykar and Yu 2012) can be used to reduce the effect of erroneous solutions to tasks. We assume “Which of the two pairs, (u, v) or (u0 , v 0 ), are more similar?” Moreover, by definition of the percentile, we will select pairs. This can be used to control the size of the resulting estimator (the set P α ). The TDE1 estimator proposed above uses by definition n = |D| pairs. To make the TDE2 estimator of comparable in complexity, we should use 100n pairs in P to have P α of the desired size. Note that α both P and P α are in this case linear in n. This is a desirable property as the number of pairs will directly translate to the number of HITs needed to construct as well as to evaluate the estimator. For this it is crucial that we do not need all pairs that are shorter than δmax , but a uniform sample of these is sufficient. Using a standard selection algorithm for finding P α has the downside that the necessary HITs cannot be solved independently. E.g. Quickselect operates in the same way as Quicksort by placing items to either side of a pivot item. This creates dependencies among the tasks, where one task must be solved before others can even be submitted to the workers. Developing a selection algorithm without such dependencies is an interesting open question. α 100 |P| Experiments In this Section we compare the proposed density estimators, TDE1 and TDE2, using simulations. 194 Gaussian-1D Gaussian-5D 0.20 2Gaussians-3D 0.18 0.12 0.17 0.16 0.11 0.15 0.15 0.10 0.10 0.14 0.13 0.09 0.12 9 15 19 26 #HITs/|D| 50 9 13 18 26 #HITs/|D| 2Gaussians-5D 49 9 15 19 4Gaussians-2D 49 4Gaussians-3D 0.20 0.24 0.11 27 #HITs/|D| 0.22 0.18 0.10 0.20 0.16 0.09 0.18 0.14 0.16 0.08 0.12 0.14 9 13 17 26 #HITs/|D| 48 9 13 21 28 #HITs/|D| 48 9 12 18 27 #HITs/|D| 45 Figure 4: Experimental results. Estimation error (MAE) as a function of the number of HITs used to construct the estimator. TDE1 is shown in red, TDE2 is shown in black. the input unlike TDE1. Notice that with 4Gaussians-2D the methods perform almost equally well. Of course in practice the dimensionality of the input data is unknown, and making an educated guess about this can be difficult. this can be taken care of by the crowdsourcing platform, and may increase the costs by a small constant factor. Both methods use |D| HITs to compute a single estimate. We first run TDE2 100 times for every value of α. Then, we compute the average number of HITs (pairwise comparisons of two item pairs) that were needed to find P α . Then, we allow TDE1 to the same number of HITs to compute the embedding. This way we can compare the methods given the same amount of work. We report the amount of work as the number of HITs per data point in Dtrain . We measure the quality of the resulting estimates using mean absolute error, defined as P test | estimate(x) − p(x)| , MAE = x∈D test |D | maxx {p(x)} Conclusions In this work we presented methods, TDE1 and TDE2, for density estimation using judgements of relative similarity. TDE1 is an implementation of a standard kernel density estimator with relative distances. TDE2 is a novel method that has some practical advantages over TDE1. In particular, our experiments suggest that TDE2 is more robust when the true dimensionality of the input is unknown. There are multiple avenues for further research. Both TDE1 and TDE2 currently use |D| HITs to estimate density at a single point x. It seems likely that this could be improved by optimising the pairs against which x is compared. Another interesting question is whether we could learn estimators like TDE1 without embeddings? That is, can we find pairs of approximately the same length directly with e.g. the kind of pairwise comparisons as used for TDE2. One possible solution to this are partial orders with ties (Ukkonen et al. 2009). Also the theoretical results presented in this paper could be improved. In particular, proving bounds on the estimation error as a function of α would be interesting. where p(x) is the true densityPat x. The estimate and p are both normalised so that x∈D test estimate(x) = P p(x) = 1. We also normalise the error with x∈D test maxx {p(x)} so that it can be understood a fraction of the largest density value in Dtest . Results Results are shown in Figure 4. We show how the error behaves as a function of the number of HITs required per data item to build the estimator. (Here the largest amount of work corresponds to using α = 0.05 with TDE2.) In this plot we show TDE1 in red, and TDE2 in black. We observe that both methods perform better when they are allowed to use more work. Increasing the amount of work starts to have diminishing effects after some point. TDE1 performs slightly better than TDE2 when only a very small amount of work is used. However, as the number of HITs is increased, TDE2 tends to perform better than TDE1. This is mainly because TDE2 makes no assumptions about the dimensionality of References Amsterdamer, Y.; Grossman, Y.; Milo, T.; and Senellart, P. 2013. Crowd mining. In SIGMOD Conference, 241–252. Bragg, J.; Mausam; and Weld, D. S. 2013. Crowdsourcing multi-label classification for taxonomy creation. In HCOMP 2013. Chilton, L. B.; Little, G.; Edge, D.; Weld, D. S.; and Landay, 195 ing (MLSP), 2012 IEEE International Workshop on, 1–6. IEEE. Wilber, M. J.; Kwak, I. S.; and Belongie, S. J. 2014. Costeffective hits for relative similarity comparisons. In Second AAAI Conference on Human Computation and Crowdsourcing. J. A. 2013. Cascade: crowdsourcing taxonomy creation. In CHI 2013, 1999–2008. Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; and Dhillon, I. S. 2007. Information-theoretic metric learning. In ICML 2007. Dawid, A. P., and Skene, A. M. 1979. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics 20–28. Floyd, R. W., and Rivest, R. L. 1975. Algorithm 489: the algorithm selectfor finding the i:th smallest of n elements. Communications of the ACM 18(3):173. Gomes, R. G.; Welinder, P.; Krause, A.; and Perona, P. 2011. Crowdclustering. In Advances in Neural Information Processing Systems, 558–566. Hastie, T.; Tibshirani, R.; and Friedman, J. 2009. The elements of statistical learning, volume 2. Springer. Heikinheimo, H., and Ukkonen, A. 2013. The crowdmedian algorithm. In First AAAI Conference on Human Computation and Crowdsourcing. Hoare, C. A. R. 1961. Algorithm 65: Find. Communications of the ACM 4(7):321–322. Lintott, C. J.; Schawinski, K.; Slosar, A.; Land, K.; Bamford, S.; Thomas, D.; Raddick, M. J.; Nichol, R. C.; Szalay, A.; Andreescu, D.; et al. 2008. Galaxy zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey. Monthly Notices of the Royal Astronomical Society 389(3):1179–1189. Liu, E. Y.; Guo, Z.; Zhang, X.; Jojic, V.; and Wang, W. 2012. Metric learning from relative comparisons by minimizing squared residual. In ICDM 2012. Parameswaran, A. G.; Sarma, A. D.; Garcia-Molina, H.; Polyzotis, N.; and Widom, J. 2011. Human-assisted graph search: it’s okay to ask questions. PVLDB 4(5):267–278. Parzen, E. 1962. On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3):1065–1076. Raykar, V. C., and Yu, S. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research 13:491–518. Rosenblatt, M. 1956. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics 27(3):832–837. Schultz, M., and Joachims, T. 2003. Learning a distance metric from relative comparisons. In NIPS 2003. Tamuz, O.; Liu, C.; Belongie, S.; Shamir, O.; and Kalai, A. 2011. Adaptively learning the crowd kernel. In ICML 2011, 673–680. Trushkowsky, B.; Kraska, T.; Franklin, M. J.; and Sarkar, P. 2013. Crowdsourced enumeration queries. In ICDE 2013, 673–684. Ukkonen, A.; Puolamäki, K.; Gionis, A.; and Mannila, H. 2009. A randomized approximation algorithm for computing bucket orders. Inf. Process. Lett. 109(7):356–359. van der Maaten, L., and Weinberger, K. 2012. Stochastic triplet embedding. In Machine Learning for Signal Process- Appendix Proof of Proposition 2 Proof. We first upper bound q(x, δmax ) by letting q(x, δmax ) ≤ q 0 (x, δmax ) Z δmax Z x+ 23 δ = 2 p(u∗ )p(u∗ + δ ∗ ) dj di, δ=0 j=x− 32 δ where u∗ and δ ∗ are the solution to argmaxu,δ p(u)p(u + δ) st. u ∈ [x − 2δmax , x + δmax ] and δ ≤ δmax . Since u∗ and δ ∗ only depend on δmax and not on δ or j, p(u∗ )p(u∗ + δ ∗ ) can be taken out from the integrals, and we have Z δmax Z x+ 32 δ 0 ∗ ∗ ∗ q (x, δmax ) = 2p(u )p(u + δ ) 1 dj di. δ=0 j=x− 32 δ A straightforward calculation reveals that the integrals simplify to 32 δmax 2 , and we obtain q 0 (x, δmax ) = 3δmax 2 p(u∗ )p(u∗ + δ ∗ ). Now, as δmax → 0, the x ± 2δmax interval becomes tighter and tighter around x. It follows that u∗ → x and δ ∗ → 0, and thus p(u∗ )p(u∗ + δ ∗ ) → p(x)2 . By combining these observations we find that lim δmax →0 q 0 (x, δmax ) 3δmax 2 = = 3δmax 2 p(u∗ )p(u∗ + δ ∗ ) δmax →0 3δmax 2 ∗ lim p(u )p(u∗ + δ ∗ ) = p(x)2 . lim δmax →0 An upper bound of q(x, δmax ) thus approaches p(x)2 from above as δmax → 0. We can make a similar reasoning in terms of the pair (u∗ , u∗ +δ∗ ) that has the lowest probability in the interval x ± 2δmax . This will show that a lower bound to q(x, δmax ) approaches p(x)2 from below. By combining these we obtain that q(x, δmax )/3δmax 2 must approach p(x)2 as δmax → 0. Proof of Proposition 3 Proof. We present the proof for 2-dimensional data, but the argument generalises to k-dimensions in a fairly straightforward manner. Fix some point x with density p(x), and consider a random pair (u, v) drawn from p given that d(u, v) is fixed to some δ ≤ δmax . We sample the pair by first drawing u from p, and then draw v given u so that d(u, v) = δ. The point x is covered by the pair (u, v) when one of the following events happens: Uδ : Vδ : 196 d(x, u) ≤ δ δ < d(x, u) < 2δ and d(x, v) ≤ δ. C( max max !0 max )function 4 done for the one-dimensional case, but will show following done for one-dimensional but s will )the bethe defined as proposition: in case, Equation 9, show and the let following C( max ) proposition: be Proposition Let .q(x, We have some ofmax max max ! q(x, ) max Vmax (x, )) Z E(u, ) max q(x, setfunction V (y, ). let p as (x) denote . We have some of) Next, Z max q(x, s the q(x, be defined Equation q(x, 9, maximum and ) be9, let defined C(letmax as ) in be Equation nfProposition 4the Let Proposition 4We Let max )9, and let C( max ) be (12) = p(x). (12) lim max max .in have some function of)the max = p(x). lim q(x, be defined as in Equation and C( ) be 4 Let q(x, ) be defined as in Equation 9, and let C( max ) be (12) Proposition 4 Let = p(x). lim of for 2-dimensional data, but argument generalises max max max s !0 C( ) Vn(x, 2 max.),We that is, max !0 max C( ) C(onmax q(x, max max max ) du. ] = p(u) p(v) dv Pr[U The value of the integrals no longer depends the density p, only on the length !0 ) max have . We have of some function of s max max function ofmanner. p(x). (12) have some function of max . =We y some straightforward Fixhave some point xlim with max . We q(x, max )density V with (x, ) a function E(u,of) C(p(x). ) max max !0 maxcan max of s the pair (u, v). We thus those and obtain = (12) lim s p given s(13) q(x, ) replace ppair (x)v)proof =drawnfor max p(y). s dom (u, from that d(u, v) is fixed !0 C( ) present the 2-dimensional data, but the argument generalises max max We proof presentfor the2-dimensional proof for 2-dimensional data, the generalises Proof We present the data, but the but argument generalises Proof = p(x). (12)argument lim y2V (x,2 max ) q(x, )q(x, q(x, )q(x, )!0 max max max max mple the by first drawing uthe from p, and then draw max Since  and d(x, u) 2)Cgeneralises , (p(x). we have d(x,  2x with , and we can thus upper C( )Pr[U max max We present proof for 2-dimensional the Proof k-dimensions indata, amax fairly manner. Fixpoint somev) density to density pargument (x) ).some nsions in pair a fairly straightforward manner. Fix some x but with = (12) = p(x). (12) = p(x). lim limpoint = (12) lim]straightforward 1Fix k-dimensions in a p(x). fairly straightforward manner. xpoint with(12) density tolim max !0 C((u, We present the proof for 2-dimensional data, but thethat argument generalises Proof !0 C((u, )with = . The point xU is is covered by pair v))when one maxthe max !0 !0 C( ) C( ) maxreplacing max bility of event the total density inside V (x, ) (so max max bound Pr[U ] by both p(u) and p(v) with p (x). max max k-dimensions in a fairly straightforward manner. Fix some point x density to p(x), and consider a random pair v) drawn from p given that d(u, v) isThis fixed yields consider a random pair (u, v) drawn from p given that d(u, v) is fixed p(x), and consider a random pair (u, v) drawn from p given that d(u, v) is fixed Next, the event Vpoint . pItgiven when u fixed falls in u the “ring” around x, in a fairly straightforward manner. Fix some xhappens with density to k-dimensions ppens: ! We present the proof data, but the argument generalises Proof ough to. x) times the probability thatconsider vfor is 2-dimensional drawn from the p(x), and consider aby random pair (u, v) drawn from that d(u, is from We sample the pair byv) first p, and then draw to some Zdrawing Zfrom max .p,  We sample the pair first drawing u and then draw We present the proof for 2-dimensional data, but the argument generalises Proof max We present the proof for 2-dimensional data, but the argument generalises Proof  . We sample the pair by first drawing u from p, and then draw to some max p(x), and a random pair (u, v) drawn from p)2-dimensional given that v) fixed = V (x, 2=d(u, (x,isp, ), and vxcovered on segment of when E(u, ) the right side of Figure 5. Like above with U , we defined as the set R(x, esent the proof for 2-dimensional data, but proof the argument for generalises data, but the argument Proof in apresent fairly straightforward Fix some point with density toconsider 2falls . We sample the pair by first drawing u)\V and draw tok-dimensions some max ud(u, so that d(u, v) .from The point xthen is bythe thegeneralises pair (u, v) one vthe given E(u, )We (so that we have v) =manner. as adius , d(u, i.e., the set max Also see Antti Ukkonen et al. δ du. Pr[U ]  p (x) 1 dv :soto d(x, u)  k-dimensions in a fairly straightforward manner. Fix some point x with density k-dimensions in a fairly straightforward manner. Fix some point x with density to that v) = . The point x is covered by the pair (u, v) when one upair sointhat d(u, v) = The point xthen is covered by point the ispair (u, v) when one v given  . uWe sample the first drawing u.). p, and draw max We have that intersects with Vpoint (x, onstoinsome a fairly straightforward k-dimensions manner. aThe fairly some straightforward xfrom with density manner. Fix with density toso that d(u, v) = .by point xv) is drawn covered by thepthus pair (u, v) some when one v given ofFix the following events happens: p(x), and consider a (u, random pair (u, from given that d(u, Vv) (x, )xfixed E(u, ) v) upper p(x), and consider a random pair v) drawn from p given that d(u, v) is fixed p(x), and consider a random pair (u, v) drawn from p given that d(u, is fixed bound Pr[Vδ ] by replacing p(u) and p(v) both with owing events happens: :nsider d(x, u) 2that and d(x,  u< so d(u, v)v) = . .The point x is covered byZthe pair (u, v) Zwhen one v<given ! of the following events happens: of the following events happens: ! a random pair p(x), (u, v) and drawn consider from a p random given that pair d(u, (u, v) v) drawn is fixed from p given that d(u, v) is fixed  maxthe . We to sample the pair by first uby from p,drawing and then draw to max some Z atfollowing Z sample  We by first drawing u from p,drawing and then draw to of some  . We sample the pair u from and density thenδdraw some max the happens: :integrals d(x, u)  U entered point y.events with radius .pair That is, the set The value of the nofirst longer depends onp,then the only This on theislength p maxp,(x). allowed as d(x, u) ≤ 2δ and d(x, v) ≤ δ. We sample the pair by first dv drawing sample uvalues from p, and then by first draw u(u,from p, to some max umax that d(u, v) = point x v) ispair covered by drawing the pair v) one vof given in. terms U: and Vso by over all du. Pr[V ]by = p(u) p(v) dvand u so d(u, v) = .integrating The x. We is.u) covered the pair (u, v) when one vmax uThe so that d(u, = . The point x is covered bywhen the pair (u, v)draw when one vthe d(x, u)  denote Uthat du. ])given = p(u) p(v) Uax(y, : given d(x,  Upoint : the d(x, u) Uthe V ). Next, let p (x) maximum V : < d(x, u) < 2 and d(x, v)  . of pair (u, v). We can thus replace those with a function of and obtain R(x, ) E(u, )\V (x, ) We obtain that v) = . The point u x so is covered that d(u, by v) the = pair . The (u, point v) when x is one covered by the pair (u, v) when one v given : d(x, u)  U of thed(u, following events happens: V (x, ) E(u, ) of the following events happens: of the following events happens: Z maxV : : < d(x, v)u)  <. 2 and d(x, v)  . < d(x, u) < 2V and d(x, v)u)< , that is, V .2 : and<d(x, d(x, 2 ! ing events of Pr[V the: following happens: V < .d(x, u)events <Now 2weand upper . bound we can write q(x, terms Pr[U of U ]and by(x) integrating all with values  pV max C1 ( p(v) ). over max ) in U d(x, , wev) upper both with Z Z ) = ,happens: (Pr[U d max (x, u) we have] +Ud(x, v)])Like  can : d(x, u) 2above u) Pr[V ] by replacing p(u) and U : d(x, : d(x, u) ofthus U, and max max ) of Now we can write q(x, in terms U and V by integrating over all values 0 up to : max max an=write q(x, ) in terms of U and V by integrating over all values p (x). This is allowed as d(x, u)  2 and d(x, v)  . We obtain x) max p(y). (13) max Now we can write q(x, ) in terms of U and V by integrating over all values Z 2 δ max cing both p(u) and p(v) with p (x). This yields max Now write q(x, in terms VNext, integrating overu)event all values V : max u) < 2WeUand d(x, v)by u) :max d(x, u) :that d(x,  Ucan Uand consider the It happens u falls Pr[V in theδ“ring” : d(x, < d(x, and. d(x, v)  . when y2Vwe )up to : <) d(x, of (x)x, 1 dv du. ] ≤ p around max armax argument as the proof of Proposition 2.: of ! V! <show d(x, u) < 2.V and < .2ZmaxV : up(x,2 Zmaxv)) = up to (Pr[U ] + Pr[V ),]) and d . v falls on the segment q(x, max : Z defined to max ofand Z : ZZof max ) d(x, = V (x, of E(u, ) R(x,δ) as the set Z pper lower bounded so that these upper and lower ZNow E(u,δ)∩V (x,δ) V write Vthat :2total <q(x, d(x,max u) < 2terms and d(x, v) : . by <integrating d(x, u) <2inover 2R(x, and 2V )\V . by (x, max 0 U v) max max Now we can ) in of U and V all values max we can write q(x, ) terms of and integrating over all values = (Pr[U ] + Pr[V ]) d . q(x, max max ent U is the density inside V (x, ) (so max Pr[V ]  p (x) 1 dv du. du.  p2 asto(x) 1 dv ).] as E(u, ) \])Vdall (x,values ) We have thus that with V(Pr[U (x, Now write ) Pr[V in terms of U V R(x, by integrating over !we 0 for some C( )q(x, ). )can = (Pr[U ] up + ]) dintersects . max q(x, x])p(x) max max = (Pr[U ]+ Pr[V ]) .= + the Pr[V q(x, 0 max The ond)aand similar argument proof of upthe toproof ofmax max : max q(x, ) E(u, )\V (x, of ) . Proposition 2. We show that max times that vcan drawn from the:relies (x, terms ) ] and E(u, )isZ].and Zand ! lower 0 V 0we write q(x, probability write q(x, in of by )integrating )the in be terms over all integrating overthese all values upper bounds for)Vup Pr[U Let Vargument (y, denote 0 Uvalues max max max toNow :Pr[V of max ZVlower Zso that max The proof relies onU a similar proof ofof Proposition 2. by Webounded show that )ascan both upper and upper and q(x, As above with Uδ , the value of the integrals only depends on max Z d(u, v) =can as required). That that we have )to = ]proof + Pr[V ]) d . 2.of q(x, )2. = ]of +depends Pr[V ]) d on . C( 2.and q(x, max 2show As above with ,both the value the integrals only p. that We get proof relies on aof argument as the of Proposition We that drelies at point with radius .max Also, let E(y, )and denote the :The up : (Pr[U on aysimilar argument as the proof of Proposition We show that als no longer depends on the density p, only on the length max max The proof relies onU ais, similar argument asthese the proof We show ) Zsimilar be both upper lower bounded so that upper and lower q(x, bounds go to C(max )p(x) as(Pr[U 0Proposition for some ). not max max max ! max du. p. We get the upper bound Pr[V ] = p(u) p(v) dv Z 0 0 δ and not ! (Pr[U ] + Pr[V ]) d . q(x, 2) = max ) can be both upper lower bounded so that these upper and lower q(x, max theand max max upper bound Z bounds both to C( )p(x)be asboth ! 0tofor somelower C( max ).R(x, for can be both upper andgo lower bounded so that upper and lower We proceed derive upper bounds ] and Pr[V ]. Let ) denote ) soPr[U E(u, )\V (x, ) V (y,lower can thus replace with amax of and obtain max max these ) as can upper and bounded that these upper q(x, 2function The proof relies on)to athose similar proof Proposition 2.(Pr[U We show that 0 2 proof on a=similar argument as the proof of. Proposition 2. Weand show that = (Pr[U ] upper + the Pr[V ])for dof . for )disc ]max + Pr[V ]) ddenote q(x, q(x, bounds both go C(2 maxargument )p(x) asThe ! 0relies C(centered ). max max max Pr[V ]max  ppoint (x) C ).C( max We proceed to derive bounds Pr[U ]2and Pr[V ].! Let Vwith (y, 2)( the area ofsome amax at yfor radius . Also, let E(y, ) denote the th go max to )C( )p(x) asbounds ! 0 max for some C( ). du. p(u) p(v) dvupper max max both go to C( )p(x) as 0bounded some ). max can be both and lower bounded so that these upper and lower q(x, max 02 C 0proof )Pr[U can be both upper and lower so that these upper and p(u) lower and p(v) both with q(x, max The proof relies on a similar argument as the of Proposition 2. We show that U , we upper bound Pr[V ] by replacing Like above with Figure 5: For the proof of Proposition 3. Left: Example of We proceed to derive upper bounds for ] and Pr[V ]. Let V (y, ) denote Pr[U ]  p (x) ( ). 1 Pr[Vδ ] ≤ pδmax (x)2 C2 (δ). the area of a disc centered at point y with radius . Also, let E(y, ) denote the ) E(u, )upper bounds2 for 2to Next we these results obtain athe simple upper bound forthat oceed derive Pr[U ]acombine Pr[V ]. Let V (y, )as)that denote max ): We proceed to derive upper bounds for Pr[U ] for and Pr[V ].max Let Vv) (y, )q(x, denote bounds go to C( as ! 0and for some C( ). bounds both go to C(bounded )p(x) ! 0Proposition some C( ). show max )p(x) max max max max max lies onto a both similar argument The proof as the relies proof on of Proposition similar argument 2. We as show the proof of 2. We the area of a disc centered at point y with radius . Also, let E(y, denote ) can be both upper and lower so that these upper and lower q(x, p (x). This is allowed as d(x, u)  2 and d(x,  . We obtain max of deriving Pr[V]. ZE(y, event Vproceed . d(x, Itupper happens when uwith falls inaRight: the x,)and f we abe disc centered at y Pr[U]. radius .“ring” Also, let denote the We toderiving derive for Pr[U ]Example and Pr[V ].point Let V ) denote We proceed toaround derive upper bounds forso Pr[U .] Also, Pr[V ].E(y, Letand V) (y, ) denote the area ofbe disc at y(y, with let denote the 2centered ! have v)  2point , upper and we can upper an both and lower )bounds can so both that upper and upper bounded lower these upper lower q(x, max bounds both go tobounded C( )p(x) asthese ! lower 0max for some C(radius ). Z and Z maxthus maxthat we combine these results to obtain a simple upper area of2 a)\V disc(x, centered atvpoint ythe with radius . max let E(y, ) denote the area of amax disc centered at point radius the )the =and V (x,p(v) ), and falls yields on the segment of )(Pr[U q(x, )Also, =2E(u, ]y+with Pr[V ])2 d . Also, let E(y, ) denoteNext 2 max (u) p (x). This go to C( maxwith )p(x) bounds as both ! 0 go for to some C( C( )p(x) ). as ! 0 for some C( ). max max max max max max We proceed to derive upper bounds for0Pr[U ] and Pr[V ].(x) Let V (y, ) denote Pr[V ]  p 1 dv du. q(x, δ x, ). We have thus bound for max ): ! Z ed Z to derivethe proceed forcentered Pr[U to ]! and derive Pr[V upper Let bounds Vradius (y, ⇣for ) denote andE(y, Pr[V Let VE(u, (y, )⌘ denote R(x, ) (x, ) Zupper area bounds of aWe disc at point y]. with .Pr[U Also,] let )].denote the)\V 2 Z ZNow we can write max ) inmax of byE(y, inte2 max δ 1and p max (x)U2 C ) +Vpδlet (x) C)2 (denote ) d the ax)disc centered at point with a disc centered .q(x, Also,δlet at < E(y, point )yterms denote with radius the .( Also, 1the dvyarea du.of radius value of the integrals only depends on and not p. We 0to δU , the du.Asofabove = p(u) E(u,grating Z get overp(v) alldvvalues δ upwith V (x, ) ) δmax max : R(x, ) E(u, )\V (x, ) the upperZ bound max q(x, δ ) = (Pr[Uδ ] + Pr[Vδ ]) dδ 2 max Z max 2 max δmax = p (x) (C1Pr[V ( ) +] C2 (p )) d(x) C2 ( ). ger depends on the density p, only on the length 0 upper bound Pr[V ] by replacing p(u) and p(v) both0with Z δmax q(x, δ ) = (Pr[U ] + Pr[V ]) dδ. replace those with a function of and obtain δ Z these δ results to obtain a simple upper bound for q(x, max ): δ 2 δ 2 ed as d(x, u)  2 and d(x, v)  . max We obtain Next we combine max < p max (x) C1 (δ) + p max (x) C2 (δ) dδ 0 = p max (x)2 ! Z(C1max ( ) + C2 ( )) d max Z 2 Z 0 ]p (x) C1 ( ). Z δmax q(x, 2max0as ) =the proof (Pr[U + Pr[V ]) d The proof relies on adu. similar argument of ]Propop max (x)2 1 dv δ 2 max 0 = p (x) C( ). = p max (x) (C1 (δ) + C2 (δ)) dδ max It happens when “ring” around R(x, )u falls E(u,in)\V (x, ) sition 2.the We show thatx,q(x, δmax ) can beZ both upper and ⌘ 0 max ⇣ 2value )\V (x, ), and v fallsonly on the segment of E(u, ) p. We get Z δmax of the integrals depends on and not p max (x)2both C1 ( ) + p max (x)2 C2 ( ) d lower bounded so that these upper and<lower bounds 2 δ have thus 0 (C1 (δ) + C2 (δ)) dδ = p max (x) 2 )p(x) as δmax → 0 for someZ C(δ ). ! 0 max to2C(δ max max ZPr[V ]  p go (x) C2 ( max ). 2 max δ 2 (C1δ(].) + C2 ( )) d We proceed to derive upper bounds for=Pr[Uδ ]p and(x) Pr[V = p max (x) C(δmax ). p(v) dv du. (u) results to obtain hese a simple upper bound for q(x, max ): 0 E(u, )\V (x,Let ) Z max V (y, δ) denote the area of a disc centered at point y with Z max C(δmax ) that appears above is the one we use in the = p max (x)2 circular (C1 ( ) + C2 ( ))The d = Pr[V (Pr[U ]+ Pr[V p(u) ])δ.d Also, radius letboth E(y, δ) denote the 1-dimensional und ] by replacing and p(v) with 0 0 claim (Eq. 14) of the proposition. Now, we can replace ) 2 and d(x, v)  . We obtain Z max ⇣ subspace centered at ⌘point y with radius= pδ.max That is, the set (x)2 C( δmax ). 2 2 ! max q(x, δ ) with this upper bound in Equation 14, and obmax Z Zp max (x) < C ( ) + p (x) C ( ) d max 1 2 E(y, δ) is the “edge” of the set V (y, δ). Next, let p (x) 0 1 dv du. serve that Z max denote the maximum density inside the set V (x, 2δ ), that max R(x, ) E(u, )\V 2 (x, ) = p max (x)is, (C1 ( ) + C2 ( )) d pδmax (x)2 C(δmax ) 0 e integrals only depends on and not p. We get δmax Z max lim = lim pδmax (x)2 = p(x)2 . p (x) = max p(y). (15) δmax →0 δmax →0 C(δmax ) = p maxmax (x)2 2 (C1 ( ) + C2 ( )) d y∈V (x,2δmax ) ]p (x) 0C2 ( ). Observe that the probability of event Uδ is the total density max Here the second equality simply follows from the definition (x)2aC( max ).upper bound for q(x, max ): s=top obtain simple inside V (x, δ) (so that the point u is close enough to x) times of pδmax (x) as given in Equation 15 above. As δmax goes the probability that v is drawn from the circle around u with Pr[U ] + Pr[V ]) d to zero, the area from which we take pδmax (x) shrinks until p max max 2 2 Z radius δ, i.e., the set E(u, δ) (so that we have d(u, v) = δ as is, ! Z Z ) + C2 ( )) dPr[Uδ ] = p(u) p(v) dv du. only x is left. The upper bound of q(x, δmax )/C(δmax ) thus approaches p(x)2 from above as δmax → 0. We can do the same reasoning in terms of a lower bound of q(x, δmax ), by replacing max with a min in Equation 15, i.e., ⌘ (x)2 C1 ( ) + required). p max (x)2 C2 (That ) d (x)2 (C1 ( V (x,δ) max E(u,δ) pδmax (x) = (C1 ( ) + C2 ( )) d 0 C( max ). Also see left side of Figure 5. Since δ ≤ δmax and d(x, u) ≤ δ, we have d(x, v) ≤ 2δ, and we can thus upper bound Pr[Uδ ] by replacing both p(u) and p(v) with pδmax (x). This yields ! Z Z δmax 2 Pr[Uδ ] ≤ p (x) 1 dv du. V (x,δ) p(y). This lower bound will approach p(x)2 from below as δmax → 0 in the same manner. Finally, notice that the argument used in this proof, while described for k = 2, generalises to arbitrary k. We only redefine the set V (y, δ) as the volume of a k-dimensional sphere with radius δ, and E(y, δ) as the k − 1 dimensional surface of V (y, δ). It follows that we can always derive a C(δmax ) that is independent of p(x) which is enough to complete the proof. E(u,δ) The value of the integrals no longer depends on the density p, only on the length δ of the pair (u, v). We can thus replace those with a function of δ and obtain Pr[Uδ ] ≤ pδmax (x)2 C1 (δ). Next, consider the event Vδ . It happens when u falls in the “ring” around x, defined as the set R(x, δ) = V (x, 2δ) \ V (x, δ), and v falls on the segment of E(u, δ) that intersects with V (x, δ). We have thus ! Z Z Pr[Vδ ] = p(u) p(v) dv du. R(x,δ) min y∈V (x,2δmax ) E(u,δ)∩V (x,δ) 197

Crowdsourced Nonparametric Density Estimation Using Relative Distances Antti Ukkonen Behrouz Derakhshan Hannes Heikinheimo

Related documents

Products

Support

Crowdsourced Nonparametric Density Estimation Using Relative Distances Antti Ukkonen Behrouz Derakhshan Hannes Heikinheimo

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib