Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Propagating Ranking Functions on a Graph: Algorithms and Applications Buyue Qian Xiang Wang Ian Davidson IBM T. J. Watson Research Yorktown Heights, NY 10598 bqian@us.ibm.com IBM T. J. Watson Research Yorktown Heights, NY 10598 wangxi@us.ibm.com University of California Davis, Davis, CA 95616 davidson@cs.ucdavis.edu Abstract methods to estimate the unknown or improve the weak ranking functions by those strong ones. The purpose of this work is to propagate linear ranking functions over the nodes on a graph, where each node corresponds to a retrieval task/user. Problem Setting. There are three key concepts in this work, (i) linear ranking function, (ii) ordinal distribution features, and (iii) ranking function propagation. We shall now explain each briefly. A linear ranking function is a realvalued vector, whose inner product with a data point is usually called a ranking score. In a retrieval task using a linear ranking function, we first calculate ranking scores, the inner product between the ranking function vector and all data points, and then sort the data points based on the ranking scores. Data points with higher ranking scores are more relevant to the retrieval topic. Ordinal distribution features are a powerful way of representing complex objects by the counts of features that are ordinal, that is the features represent an ordered set of values such as small, medium, large. Consider our experiments on creating a ranking function for each type of bird based on their song. The songs are represented using spectrograms which contains the counts of each audio frequency and analogous to histograms. The spectrograms capture not only the counts of each frequency but also that adjacent (in the spectrogram) frequencies are similar. Propagation of ranking functions implies that in our study a graph can be comprised of strong, weak, and unknown ranking functions, and we allow the strong ranking functions being propagated to the weak or unknown functions. In particular we can calculate the ranking functions after an infinite number of propagation steps on the network. Proposal. To explore the graph propagation of linear ranking functions on distribution data, there are three main challenges that we address. (i) Since our ranking function is of ordinal (order sensitive features) regular label propagation methods will fare poorly with these ordinal distribution features. We adopt the Hellinger distance metric to calculate the similarity between two linear ranking functions (distributions). (ii) Propagating distributed based ranking functions requires a specialized objective function. We generalize the objective functions of several classic diffusion methods whose objectives have shown to be useful in solving real problems. (iii) The construction of a similarity graph to propagate the ranking functions will vary between applications. For example, in a personalized movie ranking Learning to rank is an emerging learning task that opens up a diverse set of applications. However, most existing work focuses on learning a single ranking function whilst in many real world applications, there can be many ranking functions to fulfill various retrieval tasks on the same data set. How to train many ranking functions is challenging due to the limited availability of training data which is further compounded when plentiful training data is available for a small subset of the ranking functions. This is particularly true in settings, such as personalized ranking/retrieval, where each person requires a unique ranking function according to their preference, but only the functions of the persons who provide sufficient ratings (of objects, such as movies and music) can be well trained. To address this, we propose to construct a graph where each node corresponds to a retrieval task, and then propagate ranking functions on the graph. We illustrate the usefulness of the idea of propagating ranking functions and our method by exploring two real world applications. Introduction Any system that presents ordered results to users is performing ranking (Burges et al. 2005). This has led to extensive interest in applying machine learning techniques to learn ranking functions. Such techniques are called learning to rank, whose goal is to automatically build a retrieval function from training data, and has been successfully applied to various information retrieval problems. In many real world applications, we often require many ranking functions to fulfill various retrieval tasks on a single dataset. It is often the case that the training data is limited and imbalanced hence, only a few ranking functions can be trained adequately (deemed as strong), while the majority ranking functions are completely unknown or learnt with little training data (deemed as weak). Consider a personalized movie ranking problem, each person has different taste of movies and thus requires a unique ranking function. However, for a particular user, a strong ranking function can be learned only if that user provides sufficient movie ratings, which is often not the case in practice. In this study we consider using graph diffusion c 2015, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 1833 of features matters to learning algorithms. The latter type of ordinal distribution features is commonly used in learning to rank problems and also covers a large portion of data representations, such as the color histogram used in computer vision, the spectrogram used in natural language processing, and various transformations used in signal processing. While on the data of independent features, ranking functions can be readily propagated using typical label propagation methods, there is no available method, to our knowledge, to perform ranking function propagation on distribution data. Existing graph diffusion methods perform propagation in a bin-by-bin fashion and ignore the interactions between adjacent histogram bins. We shall in the following example show that sometimes shifting and preserving shape are preferred. problem, the graph can be the user co-rating matrix or constructed using side information such as user profiles. Though the goals look similar, our work differs from multi-task learning (MTL) in two aspects. (i) MTL usually assumes just a few tasks whilst our method performs propagation on a large set of ranking functions (2000+ in one of our experiments). (ii) While MTL tries to learn a set of cooperative tasks together, our method views the learning of the initial set of ranking functions as a preprocessing step, and only focuses on the inference of unknown ranking functions. Contribution and Claims. Our work makes the following three technical contributions. (i) We explore a new method to efficiently learn thousands of ranking functions from insufficient and imbalanced training data. (ii) We propose a generic diffusion method to propagate linear ranking functions modeled as distributions on a graph. Our approach allows for interactions between adjacent histogram bins making propagation more useful. (iii) The proposed method can be naturally deployed to network settings, such as personalized ranking and social network advertising. (a) Related Work Our work relates to two topics: label propagation and learning to rank. We briefly review related work in both areas. Graph-based label propagation mainly aims to address the issue of transduction. Blum and Chawla have proposed label propagation methods on graphs as a mincut problem. The Gaussian fields and harmonic function (GFHF) method in (2003) is a continuous relaxation to the difficult discrete Markov random fields. The local and global consistency (LGC) method (Zhou et al. 2003) extends GFHF with the normalized Laplacian. Another manifold regularization framework (2006) employs two regularization terms, i.e., a base kernel and a L2 regularizer. Lawrence and Jordan propose to learn with unlabeled data in the context of Gaussian process. Comprehensive surveys on semi-supervised learning can be found in (Zhu 2005) and (Chapelle et al. 2006). Learning to rank algorithms fall into three categories which differ in the form of training data. (i) Pointwise methods approximate a ranking problem as ordinal regression with large margin formulations (Shashua and Levin 2002) (Crammer and Singer 2001), and subset ranking formulations(Cossock and Zhang 2006). (ii) Pairwise methods learn from pairs of instances ordered by their relevance to the ranking problem (iii) Listwise use a fully ordered rank list as an instance, e.g., ListNet (Cao et al. 2007), AdaRank (Xu and Li 2007) and SVM Map (Yue et al. 2007). Learning to rank approaches have shown to be useful in many information retrieval applications. RankSVM was applied Joachims to document retrieval, RankNet (Burges et al. 2005) has shown to be useful on large scale web search. A brief survey of learning to rank approaches exists (Hang 2011). A graph with each node correspondes to a distribution/ranking 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0 0.1 0 1 2 3 4 0 5 (b) Distirbution 1 (three bars on the left) 1 2 3 4 5 (c) Ideal propagation (three bars in the mid) 1 0.4 0.4 0.4 0.35 0.35 0.3 0.3 0.3 0.25 0.25 0.25 0.2 0.2 0.2 0.15 0.15 0.15 0.1 0.1 0.1 0.05 0.05 0.05 0 1 (e) 2 3 4 1 (f) 4 5 0 5 GFHF propagation 3 Distribution 2 (three bars on the right) 0.35 0 2 (d) 2 3 4 5 LGC propagation 1 (g) 2 3 4 5 Our propagation 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 (h) Our propagation minus a constant Figure 1: Distribution propagation using different methods. Note that Figure 1(h) is equivalent to Figure 1(g), since each distribution here is viewed as a linear ranking function To illustrate the importance of distribution propagation, let us consider a toy graph with only three nodes. Figure 1(a) presents a simple graph with three nodes, each of which corresponds to a distribution (linear ranking function). We see that in the given Distribution 1 (Figure 1(b)) there are three bars on the left side of the 5-bin histogram with the middle bar higher than the other two, and the Distribution 2 (Figure 1(d)) follows the same shape as Distribution 1 but locates on the right side of the 5-bin histogram. It is clear that shifting Distribution 1 to the right by two bins is equivalent to Distribution 2, and vice versa. Therefore, we can infer the ideal propagation on distribution data as in Figure 1(c). We next infer the missing distribution (in the middle) using different graph propagation methods. Figure 1(e) and 1(f) show the inferred distributions using two classic graph The Importance of Propagating Distributions Feature representations of data objects fall into two categories. (i) Independent features, where the ordering of features is permutable without affecting the learning result. (ii) Ordinal distribution/histogram features, where the ordering 1834 Sij in the matrix S indicates the similarity between the retrieval tasks ti and tj , Sij = 1 if ti and tj are exactly the same and Sij = 0 if ti and tj are completely irrelevant, and we have Sij = Sji . Let D denote the degree Pm matrix of S, which is a diagonal matrix and Dii = j=1 Sij . Before propagating ranking functions on graph, we are given a initial set (possibly incomplete) of ranking functions obtained from some learning to rank algorithm, which is denoted using Y , Y ∈ Rm×d , and Yi· (i-th row of Y ) is the given ranking function of retrieval task ti . We set Yi· to a vector comprised of zeros if the ranking function for ti is missing. The target value in our formulation is the propagation result of ranking functions, which is denoted by W , W ∈ Rm×d , and Wi· (i-th row of W ) is the resulting (propagated) ranking function for retrieval task ti . We assume the initial set of ranking functions Y consists of a small portion of strong (sufficiently trained) ranking functions and a large portion of weak/unknown (insufficiently/not trained) ranking functions. Our goal is to fill in the unknown or improve the weak ranking functions using the strong functions based on the similarity amongst retrieval tasks, i.e., to propagate the ranking functions in Y via the similarity matrix S of the graph. We follow the classic consistency assumption of graph-based semi-supervised learning, nearby ranking functions are likely to be similar, and generalize the objective functions from previous studies. Formally, the generalized objective is written as follows. diffusion methods, GFHF and LGC, respectively. We observe that traditional methods do not work well on distribution data, since their result is essentially a weighted average of the given distributions. Figure 1(g) shows the result of our method, which captured the key structure of the two given distributions (higher in the middle and lower on both sides). Since our study concerns ranking function propagation, each distribution here can be viewed as a linear ranking function. As adding a constant to a linear ranking function on distribution data (nonnegative and sum-to-one) does not change the ranking order, we can remove the redundant constant and have an equivalent ranking function as shown in Figure 1(h). The result is very close to the ideal propagation. The Method Preliminaries Formally, the problem to address is described as follows. In a retrieval problem, we are given a dataset containing n instances X = {x1 , x2 , · · · , xn }, which are represented as distributions in Rd space. Since ranking functions of non-distribution data can be readily handled by traditional graph propagation methods, this paper only concerns ordinal distribution based feature representations, i.e., we assume the features of data are (i) nonnegative, (ii) sumto-one, and (iii) the order of features is not permutable. Such representations have been shown to be useful for ranking problems of images (Datta et al. 2008). On the set of instances X , there are m ranking tasks/categories T = {t1 , t2 , · · · , tm }, which correspond to a set of ranking functions W = {w1 , w2 , · · · , wm } respectively. Each ranking function is defined with respect to an external ordering (retrieval task) that we are trying to learn from the data. Since in our approach we adopt a linear ranking function, i.e. wi is a real-valued vector with the same dimension as xi , the ranking score of an instance xi w.r.t. a retrieval tasks tj is simply calculated from the inner product between xi and wj . γij = wjT xi arg min W m X i,j=1 Sij d2 (Wi· , Wj· ) + m X µi d2 (Wi· , Yi· ) (2) i=1 where d() denotes some distance metric, and d2 () simply denotes the square of a distance. The first term measures the smoothness of graph, i.e., two similar ranking functions should have small distance. The second term penalizes any changes to the initial ranking functions, i.e., the given strong ranking functions should not be overwritten significantly since these functions are deemed as correct. µ is a tuning parameter that balances the smoothness and penalty. In practice, µ can be a m-length vector, such that we can set a large overwriting penalty to the locations of strong ranking functions, and a zero/small penalty for the overwriting of unknown/weak ranking functions respectively. We shall later in this paper experimentally demonstrate that the performance of our proposed approach is not sensitive to the scale of µ. The generalized objective is closely related to several classic graph-based semi-supervised learning methods. If d(Wi· , Wj· ) = |Wi· − Wj· | the objective reduces to GFHF W (graph Laplacian). If d(Wi· , Wj· ) = | √WDi· − √ j· | the ob- (1) We construct a graph over ranking functions using a measure of similarity that is application dependent and allow the interactions amongst themselves. In particular, we assume that among the m ranking functions, there are only a small portion of ranking functions that are trained with sufficient data (deemed as strong functions), and the majority remaining ones are unknown or learned with little training data (deemed as unknown or weak functions). In this paper we present a general approach to propagate linear ranking functions, and the input of our method is a set of ranking functions. Any learning to rank methods that produce linear ranking functions can be served as a preprocessing component for our approach. ii Djj jective reduces to LGC (normalized graph Laplacian). With the objective function, the problem turns to selecting an appropriate distance metric. Though propagation on nondistribution data can be naturally handled by typical graph diffusion approaches, the propagation of distributions is difficult to solve. Similar problems involving probability interpolation have been encountered in mathematical literatures (McCann 1997; Agueh and Carlier 2011), topic modeling (Cai, Wang, and He 2009), clustering problem (Applegate Objective Function To perform ranking function propagation, we first construct a graph of ranking functions, which is defined by a symmetric similarity/affinity matrix S, S ∈ Rm×m . Note that the graph contains both strong and unknown/weak ranking functions to enable the propagation of functions. An entry 1835 ranking functions to be inferred. To recover W in closed form, we write the objective in Eq.(4) into matrix format. et al. 2011), and also graphics applications (Bonneel et al. 2011). To address this issue, our algorithm was designed with a few criteria in mind. (1) The distance metric is preferred to be within the range from 0 to 1. (2) It allows interactions between adjacent histogram bins. (3) From the distance metric, we should be able to derive a simple or closedform solution for the objective. A comparison of statistical distances found the Hellinger distance metric satisfies all the aforementioned criteria. The Hellinger distance, a type of f divergence, is used to quantify the similarity between two probability distributions. Let a and b denote two distributions, the Hellinger metric in discrete format is defined as s X √ p 2 1 H(a, b) = √ ai − bi (3) 2 i Q(W ) W 1 ∂Q 1 ∂W ◦ 2 1 2 i=1 µi d X √ wij − √ 2 yij 1 1 (6) where I denotes the identity matrix, and ()◦2 denotes the element-wise square of a matrix. The propagation of ranking functions is closely related to random walk on a graph, but here the strong ranking functions are viewed as “absorbing boundary” for the random walk. Our method differs from typical random walk in two main aspects, (i) we penalize the changes on the strong ranking functions, and (ii) our solution is an equilibrium state in terms of hitting time. k=1 + 1 = (D − S)W ◦ 2 + µ(W ◦ 2 − Y ◦ 2 ) = 0 With simple linear algebra we have the global optimal W . !◦2 −1 1 1 (7) (D − S) + I Y ◦2 W = µ m d X 2 √ 1 X √ Sij wik − wjk 2 i,j=1 m X 1T 1 1 tr{ W ◦ 2 (D − S)W ◦ 2 2 1 1 1 1 1 + µ(W ◦ 2 − Y ◦ 2 )T (W ◦ 2 − Y ◦ 2 )} (5) 2 where tr{} denotes the trace operation, and ()◦ 2 the denotes element-wise square root of a matrix. 1 Since W ◦ 2 and W have one-to-one correspondence, the optimization of Q with respect to W is equivalent to min1 1 imizing Q with respect to W ◦ 2 . Since W ◦ 2 is continuous and our objective function is convex, we recover the mini1 mum by simply setting the derivative w.r.t. W ◦ 2 to zero. In addition there are other benefits brought by the Hellinger metric: (4) Unlike other statistical distances, the Hellinger metric is applicable to any distributions without pre-processing corrections, e.g., KL divergence prohibits zeros on the same location of both distribution which is corrected using the Laplace correction, (5) The Hellinger distance has shown to be robust to noise and skew-insensitive to imbalanced data (Cieslak et al. 2012; Goldberg et al. 2009). Substituting the Hellinger metric into the Eq.(2), our objective function is written as arg min = (4) j=1 Empirical Evaluation From the definition of the Hellinger distance metric, we see that it is required that all entries in a linear ranking function are (i) non-negative (wij ≥ 0, ∀i, j) in order to keep the metric valid, and the entries in a function are (ii) sum-to-one Pd ( j=1 wij = 1, ∀i) so as to preserve the nice properties of the Hellinger metric. However, in practice these two requirements rarely hold in a linear ranking function learned by some learn to rank algorithm. To address this issue, we adopt a simple scheme to preprocess linear ranking functions: (1) adding/subtracting a constant to make the entries range from zero to a positive number, and then (2) multiplying/dividing a constants so that the entries sum to one. Since our work only deals with distribution data, the above two manipulations do not affect the ranking orders. Recall that a ranking score is calculated using rij = wjT xi . For adding/subtracting a constant we have rij = (wj + c)T xi = wjT xi + c, since Pd j=1 wij = 1. As for multiplying/dividing a constant, we have rij = cwjT xi . We see that both the shifting and normalization make an uniform change to all ranking scores, therefore, the ranking orders remain exactly the same. We in this section attempt to understand the strengths and relative performance of our distribution based ranking function propagation method, which we refer to in this section as RFP (Ranking Function Propagation). It is important to note that, to the best of our knowledge, our work is the first on ranking function propagation on distribution data. Therefore, we in our experiment can only compare with non-distribution propagation methods. In particular we wish to answer how well our method compares to the following state-of-the-art baseline methods: 1. GFHF (2003): A state-of-the-art label propagation method based on the harmonic function and the graph Laplacian. 2. LGC (2003): A label propagation method based on the graph smoothness and the normalized graph Laplacian. The result shows that our propagation method significantly outperforms the two baseline methods. Given this, the next question naturally raised would be “How good are the inferred ranking functions comparing to the ranking functions that are trained with sufficient data?”. To investigate this we explore the following two extreme scenarios: 3. Lower Bound: performance of the initial ranking functions without propagation. For unknown ranking functions, we simply use a random vector. An inferred ranking function should be at least better than a random guess. Closed-form Solution In the proposed learning objective, the cost function involves only one variable W to be optimized, which contains the 1836 4. Upper Bound: performance of all the ranking functions fully trained, which can be deemed as 1−training errors, and is the upper bound for any propagation methods. recordings, such as Brown Creeper, Pacific Wren. Each audio recording is paired with a set of species that are present. These species labels were obtained by listening to the audio and looking at spectrograms. Our goal is to learn 19 ranking functions to retrieve the 19 species of bird. We first construct a graph of the 19 nodes (species). The affinity matrix S is constructed by calculating the Hellinger similarity (i.e., one minus the Hellinger distance) amongst the prototypes (averaged audio spectrogram of a species) of the 19 bird species. We then propagate ranking functions on the graph. Learning to rank model. We use RankSVM (Chapelle 2007) to provide the initial ranking functions for the propagation methods. The pairwise constraints used for training in RankSVM are generated using labels (in the bird species dataset) or the ratings (in the movie poster dataset). The parameters in RankSVM are set as follows: the penalty constant C is set to 1, and the two options for Newton’s method, i.e., the maximum number of linear conjugate gradients is 20, the stopping criterion for conjugate gradients is 10−3 . Parameters. GFHF is nonparametric. For LGC and RFP, there is a parameter µ that balances the graph smoothness and overwriting penalty. In LGC, µ is set using cross validation. In RFP, we set the vector µ with the following values, 10 at the locations corresponding to strong (trained with abundant data) ranking functions, 0 at the locations corresponding to unknown (no training data) ranking functions, and 1 at the locations corresponding to weak (trained with little data) functions. We will also experimentally show that our propagation method is not sensitive to the scale of µ. Evaluation measure. In our evaluation, ranking accuracy was assessed using a standard measure - normalized discounted cumulative gain (NDCG). We use binary ratings to calculate NDCG, i.e., rating is 1 if relevant, and 0 otherwise. To further investigate the properties of inferred ranking functions, since a linear ranking function can be viewed as a vector in the space, we calculate the cosine similarity between an inferred ranking function and the corresponding ranking functions that is fully trained using all data (upper bound). (a) Average NDCG at top 20 (c) Experiment 1. Acoustic Retrieval of Bird Species (b) Cosine similarity between inferred ranking functions and the fully trained (deemed as the ground truth) functions Average NDCG w.r.t. different µ Figure 3: Evaluation on birds dataset (standard deviation is denoted by shade). Result and discussion. In each trial we randomly select 5 species of birds, and train a strong ranking function for each of them using adequate training data. The 5 strong ranking functions are used as the initial set of ranking functions, and then additional 150 (randomly selected) labeled audio recordings are gradually added to the training set, which are used to build new ranking functions, or improve the existing weak ones. At each step, we infer the unknown or weak ranking functions using the five graph propagation methods. The experiment is repeated for 100 times. The mean (averaged over both the 19 retrieval tasks and 100 random trials) NDCG scores (at rank 20) are reported in Figure 3(a). We see that both the GFHF and LGC achieve significantly higher accuracy than Lower bound. This confirms the motivation and usefulness of ranking function propagation since it improves the retrieval performance with no additional training data. It can be observed that our distribution propagation method significantly outperforms the competing techniques, and approaches the Upper bound as the training set increases. Figure 3(b) shows the cosine similarity between the inferred ranking functions and the ranking functions that are fully trained with all data (deemed as the ground truth). The result also verifies the superior performance of our method. This demonstrates the effective- Figure 2: Song meter data collection locations in the H. J. Andrews Forest. Datasets and Experimental Settings. The birds dataset (Briggs et al. 2013) consists of 645 ten-second audio recordings collected from the Cascade mountain range of Oregon (as shown in Figure 2). The raw audio signal is converted into a spectrogram (an image representation of the sound), by dividing it into frames, and applying the FFT to each frame. To extract features, the spectrogram is divided into 24 frequency bands, and we then summarize each band with statistics, including the mean, variance, skewness, kurtosis, min, max, and median. Finally, each audio recoding is represented using a 168-dimensional statistical spectrogram descriptor (SSD), which can be viewed as distribution data. There are 19 species of bird presented in the audio 1837 ness and necessity of the proposed graph diffusion method since the propagation of distributions differs from traditional propagation methods that were derived to propagate vectors of independent labels. By observing the performance comparison of our method w.r.t different values of µ shown in Figure 3(c), we can conclude that our method is not sensitive to the scale of parameter µ, since the ranking accuracy is about the same even when µ is significantly different (setting Our-mu=1 denotes µ = 1 for strong ranking functions and µ = 0.1 for weak ones, Our-mu=10 is ten times bigger). each step, we perform ranking functions propagation on the user graph, and the experiment is repeated for 50 times. Experiment 2. Personalized Movie Ranking (a) Average NDCG at top 200 (b) Cosine similarity between inferred ranking functions and the fully trained (deemed as the ground truth) functions Figure 4: Movie posters crawled from the web Datasets and Experimental Settings. The dataset used in this experiment is a collection of movie posters, which are crawled from the web using the links provided in HetRec2011-MovieLens-2K dataset (Cantador, Brusilovsky, and Kuflik 2011). Note that the goal of this experiment is not to evaluate our method as a recommender system, rather just to show the capability of our method in content-based personalized ranking. The HetRec-2K dataset contains 855,598 personal ratings on 10,197 movies from 2,113 users. On average, there are 404.921 ratings per user, and 84.637 ratings per movie. We explore an interesting scenario of personalized movie ranking – rank movies for each user based on the movie posters. This is inspired by a recent online article1 , which suggests that movie posters, as people’s first impression of a movie, are closely correlated to people’s taste of movies. Using the picture URLs provided in the dataset, we collected 9,893 movie posters from IMDb website (we will make the movie poster dataset publicly available soon). We calculate the color histogram (4 × 4 × 4 bins), such that each movie poster is represented by a 64-dimensional vector. According to the definition of color histogram, a representation of the distribution of colors in an image, the data can be viewed as distribution data. The 9,893 movie posters we crawled are associated with 835,189 ratings from 2,112 users. Our goal is to build a distinct ranking function for each user. The user affinity matrix S is constructed by simply normalizing the co-rating matrix of users, which consists of the counts that both users rated the same movies. In our evaluation, the user ratings (ranged from 0.5 to 5) are used to calculate the NDCG scores. (c) The mean NDCG scores at rank 200, averaged over the 2,112 users and 50 random trials, are reported in Figure 5(a). It can be observed that the three propagation methods (i.e., RFP, GFHF, and LGC) significantly outperform the Lower bound baseline. This confirms our motivation of ranking function propagation, as it does increase the ranking accuracy without extra training data. Among the three propagation methods, we see that our distribution propagation achieves the highest ranking accuracy. The cosine similarity between the inferred ranking functions and the ground truth is presented in Figure 5(b), which shows that the ranking functions learned from our method are closer to the ground truth than the two competitors. It implies that distribution propagation differs from typical label propagation, which in turn demonstrates the necessity of our method when counts or frequencies are used as features. Figure 5(c) shows that the ranking performance only changes slightly though µ differs significantly. This verifies one of our claims that the proposed method is not sensitive to the scale of parameter µ. Conclusion In this paper we describe a method to propagate ranking functions on a graph and present an algorithm to perform distribution propagation. The method is used to address the issue caused by limited and imbalanced training data to propagate thousands of ranking functions on the same set of data. The proposed propagation approach is centered around distribution data and allows for the interactions between neighboring histogram bins. We derive a closed-form solution for the optimization, which makes our algorithm easy to implement. The experimental result on real world problems demonstrates the usefulness of our propagation method. Result and discussion. We select 100 users in each random trial, and train a strong ranking function for each of them using all available ratings. These strong functions are used as the initial ranking functions to propagate, and then additional 10,000 randomly selected user ratings are added to the training data step by step, which are used either to train unknown ranking functions, or to improve the weak ones. At 1 Average NDCG w.r.t. different µ Figure 5: Evaluation on movie posters (standard deviation is denoted by shade). http://www.boredpanda.com/movie-poster-cliches/ 1838 Acknowledgments Cossock, D., and Zhang, T. 2006. Subset ranking using regression. In COLT, 605–619. Crammer, K., and Singer, Y. 2001. Pranking with ranking. In NIPS, 641–647. Datta, R.; Joshi, D.; Li, J.; and Wang, J. Z. 2008. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2):5:1–5:60. Goldberg, A. B.; Zhu, X.; Singh, A.; Xu, Z.; and Nowak, R. 2009. Multi-manifold semi-supervised learning. In AISTATS, 169–176. Hang, L. 2011. A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems 94(10):1854–1862. Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’02, 133–142. Lawrence, N. D., and Jordan, M. I. 2005. Semi-supervised learning via gaussian processes. In Saul, L. K.; Weiss, Y.; and Bottou, L., eds., Advances in Neural Information Processing Systems 17. Cambridge, MA: MIT Press. 753–760. McCann, R. J. 1997. A convexity principle for interacting gases. advances in mathematics 128(1):153–179. Shashua, A., and Levin, A. 2002. Ranking with large margin principle: Two approaches. In NIPS, 937–944. Xu, J., and Li, H. 2007. Adarank: a boosting algorithm for information retrieval. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 391–398. Yue, Y.; Finley, T.; Radlinski, F.; and Joachims, T. 2007. A support vector method for optimizing average precision. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 271–278. Zhou, D.; Bousquet, O.; Lal, T. N.; Weston, J.; and Schölkopf, B. 2003. Learning with local and global consistency. In NIPS. Zhu, X.; Ghahramani, Z.; and Lafferty, J. D. 2003. Semisupervised learning using gaussian fields and harmonic functions. In ICML, 912–919. Zhu, X. 2005. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison. The authors gratefully acknowledge support of this research from ONR grants N00014-09-1-0712, N00014-11-1-0108 and NSF Grant NSF IIS-0801528. References Agueh, M., and Carlier, G. 2011. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis 43(2):904–924. Applegate, D.; Dasu, T.; Krishnan, S.; and Urbanek, S. 2011. Unsupervised clustering of multidimensional distributions using earth mover distance. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, 636–644. Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7:2399–2434. Blum, A., and Chawla, S. 2001. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, 19–26. Bonneel, N.; van de Panne, M.; Paris, S.; and Heidrich, W. 2011. Displacement interpolation using lagrangian mass transport. ACM Trans. Graph. 30(6):158:1–158:12. Briggs, F.; Huang, Y.; Raich, R.; Eftaxias, K.; Lei, Z.; Cukierski, W.; Hadley, S. F.; Hadley, A.; Betts, M.; Fern, X. Z.; et al. 2013. The 9th annual mlsp competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment. In Machine Learning for Signal Processing (MLSP), 2013 IEEE International Workshop on, 1–8. IEEE. Burges, C. J. C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.; Hamilton, N.; and Hullender, G. N. 2005. Learning to rank using gradient descent. In ICML, 89–96. Cai, D.; Wang, X.; and He, X. 2009. Probabilistic dyadic data analysis with local and global consistency. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, 105–112. Cantador, I.; Brusilovsky, P.; and Kuflik, T. 2011. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, RecSys 2011. Cao, Z.; Qin, T.; Liu, T.-Y.; Tsai, M.-F.; and Li, H. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, ICML ’07, 129–136. Chapelle, O.; Schölkopf, B.; Zien, A.; et al. 2006. Semisupervised learning, volume 2. MIT press Cambridge. Chapelle, O. 2007. Training a support vector machine in the primal. Neural Computation 19(5):1155–1178. Cieslak, D. A.; Hoens, T. R.; Chawla, N. V.; and Kegelmeyer, W. P. 2012. Hellinger distance decision trees are robust and skew-insensitive. Data Min. Knowl. Discov. 24(1):136–158. 1839