Ranked Personalized Recommendations Using Discrete Choice Models ARCHIVES MASSACHUSETTS INSTITUTE OF TECHNOLOGY by Ammar Ammar NOV 0 2 2015 B.Sc., Massachusetts Institute of Technology (2009) M.Eng., Massachusetts Institute of Technology (2010) LIBRARIES Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2015 @ Massachusetts Institute of Technology 2015. All rights reserved. Signature redacted Author ....... Department of Electrical Engineering and Computer Science August 28, 2015 C ertified by ........................... Signature redacted Devavrat Shah Associate Professor Thesis Supervisor Signature redacted Accepted by .................... PrfessoILhlie A. Kolodziejski Chair, Department Committee on Graduate Theses 2 Ranked Personalized Recommendations Using Discrete Choice Models by Ammar Ammar Submitted to the Department of Electrical Engineering and Computer Science on August 28, 2015, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Personalized recommendation modules have become an integral part of most consumer information systems. Whether you are looking for a movie to watch, a restaurant to dine, or a news article to read, the number of available option has exploded significantly. Furthermore, the commensurate growth in data collection and processing has created a unique opportunity, where the successful identification of a relevant/desired item in a timely and efficient manner can have serious ramifications for the underlying business in terms of consumer satisfaction, operational efficiency, or both. Taken together, these developments create a need for a principled, scalable, and efficient approach for distilling the available consumer data into compact and accurate representations that can be utilized for making inference about future behavior and preference. In this work, we address the problem of providing such recommendations using ranked data, both as system input and output . In particular, we consider two concrete, and interrelated, scenarios, that capture a large number of applications in a variety of domains. In the first scenario, we consider a set-up where the desired goal is to identify a single global ranking, as we would in a tournament. This setup is analogous to the problem of rank aggregation, historically studied in political science and economics, and more recently in computer science and operations research. In the second scenario, we extend the setup to include multiple 'prominent' rankings. Taken together, these rankings reflect the intrinsic heterogeneity of the population, where each ranking can be viewed as a profile for a subset of said population. In both scenarios, the goal is to (i) devise a model to explain and compress the data, (ii) provide efficient algorithms to identify the relevant ranking for a given user, and (iii) provide a theoretical characterization of the difficulty of this task together with conditions under which this difficulty can be avoided. To that end, and drawing on ideas from econometrics and computer science, we propose a model for the single ranking problem where the data is assumed to be generated from a Multi-Nomial Logit (MNL) model, a parametric probability distribution over permutations used in applications ranging from the ranking of players in online 3 gaming platforms to the pricing of airline tickets. We then devise a simple algorithm for learning the underlying ranking directly from data, and show that this algorithm is consistent for a large subset of the so called Random Utility Models (RUM). Building on the insight from the single ranking case, we handle the multiple ranking scenario using a mixture of Multi-Nomial Logit models. We then provide a theoretical illustration of the difficulty in learning models from this class, which is not surprising given the richness of the model class, and the notorious difficulties inherent in dealing with ranked data. Finally, we devise a simple algorithm for estimating the model under plausible realistic conditions, together with theoretical guarantees on the performance together with an experimental evaluation. Thesis Supervisor: Devavrat Shah Title: Associate Professor 4 Acknowledgments This thesis is the outcome of my doctoral research requirement at the Laboratory of Information and Decision Systems (LIDS) under the superb supervision of my advisor, Dr. Devavrat Shah. I had the fortune of meeting Devavrat during an introductory probability class that he was teaching during my freshman year at MIT, and his charm, energy, and enthusiasm got me instantly interested in the use of probability in modeling and algorithm design. His patience and guidance over the past few years have been instrumental to the completion of this thesis. I am also grateful to Dr. Sanjoy Mitter and Dr. Munther Dahleh for their support as a part of my thesis committee. I am indebted to them for their generous feedback, continuous encouragement, and invaluable advice during my time here at MIT. I am also greatly appreciative to all my teachers and mentors at MIT and elsewhere. Special thanks to Dr. Boris Katz, Dr. Franz-Josef Ulm, Dr. Nasser Rabbat, Dr. Leila Farsakh, Dr. Nancy Murray, Hubert Murray (FAIA, RIBA), Christine Lane, and Dr. George Verghese. Special thanks also go to Lynne Dell, Jennifer Donovan, Brian Jones, Debbie Wright, Petra Aliberti, Alina Man, and the rest the administrative team at LIDS, past and present, for their kind help and support. I am also lucky and grateful to have made a number of wonderful and kind friends during my time here. I am specially grateful to my officemates, colleagues, and friends Srikanth Jagabathula, Tauhid Zaman, Yuan Zhong, Seewong Oh, Sahand Negahban, Guy Bresler, Luis Voloch, Christina Lee, George Chen, Jehangir Amjad, Hajir Roozbehani, Kimon Drakopoulos, Yola Katsargyri, Ali Faghih, and Noele Norris for numerous fun and inspiring conversations. Last, but definitely not least, I am lucky and extremely grateful to have my family. I would not be here, or anywhere, without their infinite love and unwavering support. 5 6 .............37 Contents 1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.1.1 A Systematic Resolution . . . . . . . . . . . . . . . . . . . . . 13 1.1.2 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . 14 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.1 2 9 Introduction 19 The Single Ranking Problem 2.1 Model and Problem Statement ...................... 2.2 M ain Results 20 ...... ......................... ........................ 23 24 2.2.1 Aggregate Ranking ...... 2.2.2 The Mode ....... 2.2.3 Top-K Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 31 ............................. 26 2.3 Learning the Max-Ent Model . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Evaluation and Scalability . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Proofs . . . . . . . . . . . . . . . . . . . . . 2.5.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.2 Proof of Theorem I . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . .44 2.5.5 Proof of Theorem 4: subgradient algorithm . . . . . . . . . . . 7 45 53 The Multiple Ranking Problem Difficulty of Learning the MMNL Model: a Lower-bound . . . . . . 54 3.2 An Algorithm . . . . . . . . . - .. .. . 55 Preprocessing . . . . . - .. .. . 56 3.2.2 Clustering . . . . . . . - . ... . 57 3.2.3 Learning Weights within Clusters . . . . . . 58 Algorithmic Guarantee . . . . . . . . . . 59 . . . . . . . . . . . . 61 Proofs . . . . . . . . . . . . . . .. .. . 62 3.4.1 Proof of Theorem 7 . . . . . . . . 62 3.4.2 Proof of Theorem 8 . . . . . . . . 66 3.4.3 Proof of Lemma 3 . . . . . . . . . 68 3.4.4 Proof of Proposition 2 . . . . . . 70 3.4.5 Proof of Theorem 10 . . . . . . 71 3.4.6 Proof of Theorem 9 . . . 72 . . . Illustration . . . . 3.3.1 3.4 . 3.2.1 . 3.3 . 3.1 . . 3 77 4 Conclusion and Future Work 8 Chapter 1 Introduction The need for personalized rankings of objects, from web pages and movies, to consumer products, has become an integral part of most consumer information systems. In the context of web search, for instance, ranking provides a quality filter that prioritizes high quality pages pertaining to a certain query; making it feasible to retrieve useful information from the overwhelming corpus that is the World Wide Web. In other contexts, such as movie recommendation, the resulting ranking also provides a filter on the relevance of the recommended objects to the consumer. In both types of settings, ranking plays an important role in eliminating inefficiencies arising from the mismatch between what the consumer prefers and what the consumer is being offered. Generally speaking, the ranking procedure takes information about past consumer 'preferences' and produces a ranking that ideally captures future ones. The input to this procedure typically consists of partial, noisy, or incomplete expression of consumer preference over the available options. And the purpose of the procedure is to stitch this input into a meaningful ranking as an output. A principled approach towards that end is to assume an underlying model describing the ground truth, and to assume that the available observations are incomplete noisy realizations of this ground truth. From a system designer's perspective, this procedure can be decoupled into two interrelated steps: (a) devise a meaningful model of this underlying truth, and (b) develop scalable inference algorithm that can produce a ranking that 9 leverages this model to extract most of the available and relevant information to the consumer. Naturally, the quality of this procedure, and its ability to provide meaningful rankings, hinge on three factors. First, since our knowledge about the consumer population comes from the data we observe, this data should be rich enough in order to accurately capture consumer preference, at the individual and population levels. Consequently, the devised model should have the sufficient resolution to accommodate and leverage the richness of this data. At the individual level, the model should be robust to noise in the form of varying or mood-dependent behavior. At the population level, the model should be able to capture the diversity or heterogeneity inherent to most modern datasets. In this work, we develop a framework for providing personalized ranked recommendations using ordinal, or comparison, data; a fine grained data representation of consumer preference that occur naturally in most applications. At the heart of our framework is a high resolution model that enables us to accommodate noise, while capturing and leveraging consumer diversity. As a part of this framework, we also provide a tractable algorithms for learning the model from data, and for providing recommendations, together with an analysis and an evaluation of the performance of the framework. 1.1 Problem Statement For a more concrete definition of the problem, consider a collection with n objects (e.g., movies or restaurants), and a population of N consumers who provides feedback about these objects. Given this, we would like to design a procedure that can take this feedback as its input, and produces a 'meaningful' ranking as its output. While the meaningfulness of the resultant ranking is, of course, a matter of application, we can broadly identify two types of ranking problems exemplified by the following scenarios: Scenario 1. Given the scores/outcomes of games in a tournament, e.g. NFL, we 10 would like to rank order sports teams. In such a setting, the eventual goal is to come up with a single ranking. And subsequently, decide which teams go to the next round in a season and eventually who wins the tournament (the Super Bowl). Scenario 2. Given consumer ratings of choices, e.g. movies on Netflix, we would like to learn the different prevalent types of movie watchers, i.e. different prevalent rankings or preferences of movies. And subsequently, use this to provide recommendations of movies to a given consumer based on her/his previously identified type. While these two scenarios might seem different from a practical point of view, at a high level, they share a similar design structure. In both of these scenarios, we have a partial, or incomplete, expression of the ordering of the available objects or choices of interest, and we are interested in producing a single, global or personalized, ranking. This problem gives rise to a few fundamental questions. First, in what format and at what level of granularity should the data in both of these scenarios collected? what properties should the resultant ranking(s) have? and how do we compute such ranking(s)? For the first question, one popular solution relies on representing each piece consumer feedback as a numerical score on a fixed scale. Typically this score reflects the number of 'stars' the user would give the choice in question (e.g. 4 out 5 stars). This is the approach followed by a large number of on and offline retailers such Amazon and Netflix, and numerous others. Other approaches adopt a special, and rather minimal, case of this by restricting the scale to two options in the form of 'thumbs up' or 'thumbs down'. Despite being simple and intuitive, these representation suffers from a number of shortcomings. At the individual consumer level, for instance, the scale of choice is often arbitrary and does not provide a guarantee against mood-dependent behavior: a consumer might give an item 3 out of 5 stars on a rainy day, but give the same item 4 stars if he/she happens to be in a better mood. At the population level, it is not clear whether the scale of such ratings is the same for all consumers: consumer A, an avid movie watcher, is less likely to give a movie 5 stars than an average consumer. These objections suggest that a one can be gained from adopting a richer and 11 mood-independent representation of the consumers preference. A natural solution to this problem, inspired by the seminal work of Samuelson 140], is to adopt the axiom of revealed preferences. According to this axiom, the innate preference of the consumer is revealed as an ordering of the available choices. For example, the preference of a consumer facing a choice between items A, B, and C is fully captured through the order of preference over these items, mathematically represented as a permutation of the items. Note that the use of permutations here simultaneously eliminates the issue of scale while providing a fine-grained representation of what the consumer prefers. On the flip side, assuming that the each consumer's permutation is deterministic or fixed hinders our ability to deal with scenarios where the data contains inconsistent consumer behavior (e.g., alternating between preferring A to B and vice versa). We can deal with this issue by bringing randomness into the picture. In this stochastic view, consumers behavior is captured by stochastic model that fully specifies the probability of observing such changes. Thus, the population as a whole can similarly be viewed as a 'larger' probability model. Here, it is interesting to note that the richness of this representation dramatically increases the number of possible observations 1, and the very nature of ordinal data can make the problem significantly more complex. For a simple illustration of this complexity, consider the setting with 3 items and 3 consumers with preferences given in Table 1.1. In this example 2, no single item can be chosen as the clear winner, due to the 'cyclical' nature of the preferences taken together. As it turns out, that these types of problems can persist if we consider a setting with more choices and users as shown by Arrow in his famous impossibility theorem [5]. In this result, Arrow demonstrated the impossibility of combining individual rankings to obtain a single ranking with certain 'desirable properties'. Putting the question of these properties aside for a moment, it is important to note that this is the problem precisely captured by Scenario 1 outlined above. 1Note that n items can be ordered in 2 n! ways This example is commonly known as the voting, or Condorcet, paradox. 12 Consumer Consumer 1 Consumer 2 Consumer 3 1st Preference A B C 2nd Preference B C A 3rd Preference C A B Table 1.1: The Voting Paradox So, what properties should the resultant ranking have? In Arrow's setup, this ranking is required to be 'fair' and representative of all preferences. Further, the ordering of any two choices A and B in the resultant ranking should be solely determined by the individual preferences between these two choices, and cannot be affected by preferences pertaining to any other choice C3. Putting the significance and necessity of these assumptions aside for a moment, it is important to note that this is precisely the setup outlined in Scenario 1, and by extension to Scenario 2. 1.1.1 A Systematic Resolution A systematic way out of this maze is to explain the data using a probability distribution over permutations. By doing this, we absolve ourselves of the need to seek one consistent ranking at the outset. This kind of decoupling is not new, and has been fruitfully applied in econometric forecasting, most notably in the theory of discrete choice models (cf. 132],[10I and [30]). For the tournament scenario this distribution over permutation can be viewed as a 'noisy' observation of an unknown underlying permutation. And if the noise model is chosen appropriately, then we would expect the distribution to have one 'prominent' ranking that reflects the result of the tournament. Thus, the question of ranking boils down to finding this prominent ranking, for a given setting of the noise. For the second scenario, the probability distribution should also allow for multiple prominent rankings. For the main results in this work, we adopt this distribution over permutation view using two popular models that satisfy our requirements: the Multi-Nomial Logit 3 This property is commonly known as the Independence of Irrelevant Alternatives (I.I.A.) property dictates 13 (MNL) model, and the mixed Multinomial Logit model (MMNL). In addition to satisfying the requirements, these models also while allow us to develop algorithms with analyzable performance. 1.1.2 Mathematical Model Before diving into the details, it might be useful to provide a quick mathematical definition of both the MNL and the MMNL models. By doing so, we hope to provide a useful outline as well as a clear idea of these models that at the core of our contributions. These models are included here together with a statement of the learning problems, central to our framework. The Multi-Nomial Logit, or MNL, model is a probability distribution over permutations '. The probabilities provided by the model are fully specified using a set of n parameters (w 1 , ... , wn) E R+, corresponding to one of n available items, or choices. For a given set of choices, C, the probability of choosing an item i E C is given by the choice probability Prc(i; w) = . Sampling a permutation, -, from the model can be done using the following simple sequential procedure: choose an item i in the first position with probability proportional to wi (as per the previous definition). Then, choose a item j with probability proportional to ws from the remaining n - 1 items, and so on. Similarily, the Mixed Multi-Nomial Logit, or MMNL, model is mixture of several MNL model (or mixing components). When the number of components in the mixture is K, the probability of a given permutation o- takes the form Pr(u) _ a- = Pr(u; wk), where each ak > 0 is used to denote the mixing probability of component k, and weights wk c Rn are the parameters for the kth component MNL, and Pr(-; wk) is the probability of sampling o- given MNL with parameter wk. Sampling a permutation, k E {1, ... , K}, -, from the model can be done by first choosing a component, then sampling the permutation from the chosen component using the sequential sampling procedure mentioned above. In this work, we consider two (related) learning problems: (a) the problem of 4 These models are commonly known as a choice model 14 learning a single prominent ranking model from pairwise comparison and first-order marqinal data, and (b) the problem of learning the MMNL model from ordered tuples of length 1. In the case of comparison data, each data points consists of a comparison between two of the n items. For first-order marginal data, users express the position, k, that they believe an item i should fall in. Furthermore, both types of these types of data are assumed to come from a single underlying distribution. We refer to this problem as the ranking problem. In the second learning problem, referred to as the multiple ranking, or recommendation problem, each of N user expresses his/her preference in the form of tuples of some length 1 < n, where 1 is not necessarily the same for all users. Further, this data is assumed to come from an underlying MMNL distribution with K components. Here, we would then like to learn the order of the number of components, their underlying weights, and the order of the top elements in each of the components. The emphasis on the top elements is simultaneously motivated by practical application as well as considerations of tractability in the model, as we show in the corresponding chapter. 1.2 Contributions The main contribution of this thesis is to provide a framework for end-to-end personalized ranked recommendation. For ease of exposition, we present the problem in two chapters corresponding to the single ranking and multiple ranking. The first chapter deals with the problem of learning a single ranking distribution from data to ultimately provide this ranking. The second chapter deals with the problem of learn-, ing the MMNL distribution using the solution of the first problem as a sub-module. These two problems are presented in the following two chapters, respectively. These chapters provide concise formulations with algorithms and guarantees for these problems. Our contribution to the problem of single-ranking problem consist of multiple algorithms for using pairwise and first-order marginal data to obtain a single ranking 15 in a variety of ways. Most notably, we provide an algorithm for correctly identifying the prominent ranking of the Thurstone model, and more generally RUM models, directly from data. We also provide a formulation and a solution of the problem as an Entropy Maximization problem, together with algorithms and guarantees. Our contribution to the multiple ranking problem is mainly a contribution to the problem of learning the MMNL model. To that, we first, we provide a construction for a mixture choice model that demonstrates the difficulty of the problem. In particular we identify a fundamental limitation for any learning algorithm to learn a mixture choice model in general. Specifically, we show that for K = 6(n) and 1 = e(log n), there exists a pair of distinct MMNL models so that the partial preferences generated from both are identical in distribution. That is, even with an infinite number of samples, it is not possible to distinguish said models. This difficulty suggests that one needs to impose further conditions on the model to guarantee learnability from data. Guided by common consumer behavior, we provide sufficient conditions under which this is possible, together with algorithms and error bounds for learning. 1.3 Related Work Here, we provide a overview of some of the work that has been done on ranking and recommendation, respectively. Given the pervasive presence of these problems in a variety of applications, the references included here consists mainly of works from computer science and econometrics that we believe are relevant to the problem of personalized ranked recommendation. This overview is far from complete, and we highly recommend that the reader refer to some of the works mentioned here for further pointers. In the context of discrete choice, we build on the influential line of research exemplified by the work of McFadden [32], where the data is assumed to come from an underlying family of parametric distributions such as the one proposed by Bradley and Terry 110], Plackett [371, and Luce [30], and commonly known as the Multinomial Logit Model family (cf. McFadden [32]). In these models, the problem of ranking is 16 equivalent to learning the model parameters; this task can be done using the maximum likelihood criterion, as in McFadden [32], or via an iterative procedure on the data as in Ammar and Shah [4] and Negahban et al. [351, where the later uses a Markov Chain style iterative algorithm. We also take note of works utilizing the Random Utility Models (RUM) originally proposed by Marschak 1311, such as the recent work of Azari et al.[42]. For multiple rankings and recommendation, we note the recently popularized matrix completion methods (cf. [361, 1291, and [111), where the incomplete consumermovie rating matrix is presumed to be of low rank and its missing entries are filled by finding the best low-rank factorization of the said matrix. In the context of discrete choice, we take note of, and build on, the work pertaining to the MMNL model by Boyd and Mellman [9] and Cardell and Dunbar [12]. We also draw inspiration from the seminal work of McFadden and Train [331 which present a compelling case for MMNL models by demonstrating the ability of such models to approximate, to any level of precision, any distribution in the RUM family. In [261, the question of learning sparse choice models from exact marginals is introduced, and precise conditions for learnability for all sets of partial preferences is characterized. This is done by connecting it to the dimensionality of partial preferences determined via spectral representation of the permutation group. We also take note of the recent work of Farias et al. [22][20] [21] which assumes that the underlying model is a sparse distribution over permutation, and propose algorithms for fitting the model to data an optimization framework. In other contexts, the task of ranking objects or assigning scores has been of great interest over the past decade or so with similar concerns. There is a long list of works, primarily in the context of bipartite ranking, including the RankBoost by Freund et al. [23], label ranking by Dekel et al. [15], Crammer and Singer [14], Shalev-Shwartz and Singer [41] as well as analytic, learning results on bipartite ranking including those of Agarwal et al. [2], Usunier et al. 144] and Rudin and Schapire [39]. The algorithm that will be closest to our proposal is the p-norm push algorithm by Rudin [381 which uses 4, norm of information to achieve ranking. 17 The question of learning a single ranking distribution over permutations from partial or limited information has been well studied in the recent literature. Notably, in the work of Huang, Guestrin and Guibas [251, the task of interest is to infer the most likely permutation of identities of objects that are being tracked through noisy sensing by maintaining distribution over permutations. To deal with the 'factorial blowup', authors propose to maintain only the first-order marginal information of the distribution (essentially corresponding to certain Fourier coefficients), then use the Fourier inversion formula to recover the distribution and subsequently predict its mode as the likely assignment. The algorithmic view on rank aggregation was revived in work by Dwork et al [17] where they consider design of approximation algorithms to find 'optimal' ranking with respect to a specific metric on permutations. Very recently, a high-dimensional statistical inference view for learning distribution over permutations based on comparison data has been introduced by Mitliagkas et al [341. The maximum entropy approach for learning distribution is a classical one dating back to the work of Boltzman. The maximum entropy (max-ent) distribution, a member of an appropriate exponential distribution family, is maximum likelihood estimation of the parameters in that family (cf. see [47]). Indeed, the use of exponential family distribution over rankings has been around for more than few decades now (cf. see [16, Chapter 91). We provide a careful analysis of a stochastic sub-gradient algorithm for learning the parameters of this max-entropy distribution. This algorithm is distributed and iterative. It directly builds upon the algorithm used in [28] for distributed wireless scheduling. 18 Chapter 2 The Single Ranking Problem For a quick refresher, we start this chapter with a more detailed overview of our contributions single ranking problem. The input data for this problem comes in two different flavors: (a) pair-wise comparison data (e.g. item i is preferred to item j), and (b) first-order marginal data (e.g. item i is ranked in position k). Given data in either form, we focus our attention on three aggregation problems: (1) finding an aggregate ranking over a collection of items (e.g. NetFlix movies), (2) finding the most likely ordering of the items (e.g. object tracking a la [25]), and (3) identifying the top-k items in a collection We solve (1) by introducing a general method which gives each item a score that reflects its importance according to the distribution. We then present a specific instance of this method which allows us to compute the desired scores from the data (comparison or first-order marginal) directly, without the need to learn the distribution. More importantly, we show that the ranking induced by this scoring method is equivalent to the ranking obtained from the family of Thurstone (1927, [43]; also see [16, Ch 91) models, a popular family of parametric distributions used in a wide range of applications (e.g. online gaming and airline ticket pricing). For (2), we use the principle of maximum entropy to derive a concise parameterization of an underlying distribution (most) consistent with the data. Given the form of the max-ent distribution (an exponential family), computing the mode reduces to solving a maximum-weight matching problem on a bipartite graph with weights induced from the parameters. For the case of first-order marginals, this is 19 an easy instance of the network-flow problem (can be solved, for example, using belief propagation [61). Furthermore, we propose a heuristic for mode computation as well that bypasses the step of learning the max-ent parameters but uses directly the available partial preference data. Such a heuristic, for example, can speed up computation of [25] drastically. Somewhat curiously, we show that this heuristic is first-order approximation of the mode finding of the max-ent distribution. For pair-wise comparisons representation, the problem is not known to be solvable in polynomial time. We propose a simple randomized scheme that is a 2-approximation of it. We solve problem (3) using another distribution-based scoring scheme where the scores can be computed directly from the data for first order marginals, or by learning a max-ent distribution in the case of comparisons. We present a stochastic gradient algorithm for learning the max-ent distribution needed for some of the aforementioned problems. This algorithm is derived from [28], however the proof is different (and simpler). It provides explicit rate of convergence for both data types (comparisons and first-order marginals). In both cases, the algorithm uses an oracle to compute intermediate marginal expectations of the max-ent distibution. We prove that the exact computation of such marginal expectations is #P-hard. Using standard MCMC methods and their known mixing time bounds, our analysis suggests that for a collection of n items, the computation time scales exponentially in n and polynomially in n respectively for the pair-wise comparisons and first-order marginals respectively. Two remarks are in order: first, the result for first-order marginals also suggests a distributed scheduling algorithm for input-queued switches with polynomial time learning complexity (unlike exponential for wireless network model). Second, the standard stochastic approximation based approaches cf. [8] do not apply as is (due to compactness of domain related issues). 2.1 Model and Problem Statement Model: We consider a universe of n available items, K = {1, 2, ... , n}. Each user has preference order, represented as permutation, over these n items. Specifically, if - is 20 the permutation, the user prefers item i over j if -(i) < c-(j). We assume that there is a distribution, say p, over the space of permutations of n items, Sn, that defines the collective preferences of the entire user population. Data: We consider scenarios where we have access to partial or limited information about p. Specifically, we shall restrict our attention to two popular types of data: first-order ranking and comparisons. Each of these two types correspond to some sort of marginal distribution of p as follows: First-ordermarginals: For any 1 < i, k < n, the fraction of population that ranks item i as their kth choice is the first-order marginal information for distribution p. Specifically, mik = where ]I{E} Pt[{O(i) = k}] p(O-)(2.1(i)=k} = (.) denotes the indicator variable for event E. Collectively, we have the n x n matrix [mij] of the first-order marginals, that we shall denote by M. This is the type of information that was maintained for tracking agents in the framework introduced by Huang, Guestrin and Guibas [251. Comparison Data: For any 1 < i, j < n, the fraction of population that prefers item i over item j is the comparison marginal information. Specifically, A1P,[{o-(i) < c-(j)}] = > p(o-){a(s)<e(J)}. (2.2) Collectively, we have access to the n x n matrix [cij] of comparison marginals, denoted by C. Such data is available through customer transactions in many businesses, cf. 1191. Remarks. First, while we assume mik (resp. cij) available for all i, k (resp. i, j), if only a subset of it is available, the algorithm with that information works equally well, with the obvious caveat that the quality of the output is dependant on the richness of our data. Second, we shall assume that mik E (0, 1) for all i, k (resp. 21 ci E (0, 1) for all i, j). Finally, in practice one may have a noisy version of M or C data. However, the procedures we describe are inherently robust (as they are simple continuous functions of the observed data) with respect to small noise in the data. Therefore, for the purpose of conceptual development, such an idealized assumption is reasonable. Goal: Roughly speaking, the goal is to utilize data of type M or C to obtain various useful rankings of the objects of interest. Specifically, we are interested in (a) finding an aggregate or representative ranking over the items in question, (b) finding 'most likely' (or mode) ranking, and (c) finding a ranking that emphasizes the top k objects. To address these questions, we propose the following approach: (a) assume that the data originates from some underlying distribution over permutations. (b) Use the data to answer the question directly, without learning the distribution, whenever possible. (c) Otherwise, learn a distribution that is consistent with the data (M or C), and use said distribution to answer the question. In principle, there could be multiple, possibly infinitely many, distributions that are consistent with the observed data (M or C, assuming it is generated by a consistent underlying unknown distribution). As mentioned earlier, we shall choose the max-ent distribution that is consistent with the observed data. Result Outline: Here we provide somewhat detailed explanation of the results summarized in the table presented earlier. Specifically, we solve problems (a), (b), and (c) as follows. To find an aggregate ranking (Section 2.2.1), we assign each item a score derived from the distribution over permutations. We then propose an efficient algorithm to compute said score directly from the data, without learning the distribution . We show that the ranking induced by the computed scores is equivalent to the ranking induced by the parametric family of Thurstone models [43116, Ch 9] (Section 2.2.1), a popular family of distributions used in applications ranging from online gaming to airline ticket pricing. Effectively, our result implies that if one learns any of the distributions in this family from the data, and uses the learned parameters to obtain a ranking, then this ranking is identical to the ranking we obtain directly 22 from the data (i.e. no need to learn the distribution)! As for the mode of the distribution, we assume a maximum entropy underlying model, and derive a concise parameterization of the model using 0(n 2 ) parameters (Section 2.2.2). We also show that finding the mode of the distribution is equivalent to solving an optimization problem on these parameters. In the case of First-Order Marginals (see [251), this problem is easy and can be solved using max-weight matching on a bipartite graph. We also provide an efficient heuristic for finding the mode in the case of first-order marginal data directly from the data, without learning the max-ent distribution. In the case of comparison data, the problem is more challenging. In this case, we provide an 2-approximation algorithm for finding the mode. In Section 2.2.3, for the top-k ranking problem, we propose a score that emphasizes the top-k items. We show that this score can be computed exactly and directly from First-Order Marginal data, and approximated using the max-ent distribution in the case of comparison data. 2.2 Main Results Before we get into the details of estimation the distribution, lets consider the problem of ranking given said distribution. More precisely, lets assume that we are given a distribution over permutations p, and asked to obtain an ordered list of the items of interest that reflects the collective preference implied by the distribution. A classical approach in this setting is the axiomatic one. In this approach one comes up with a set of axioms that the ranked list should satisfy, and then try to come up with a ranking function or algorithm that satisfies these axioms. Unfortunately, seemingly natural axioms cannot be satisfied by any algorithm [5]. In this section, we opt for a non-axiomatic approach to aggregate preferences. We address the problems of finding: (a) an overall aggregate ranking of all items, (b) the mode of the distribution, and a (c) top-k ranking. These problems demonstrate the utility of having or assuming an underlying distribution, and give rise to situations 23 where one can bypass the learning step and use the data directly for ranking. In the latter situation, one can make the conceptual use of assuming a distribution, without performing complicated computations to obtain a ranking. 2.2.1 Aggregate Ranking Here we propose a method to obtain an entire ranking of all objects. Building on the intuition followed by popular voting rules, the basic premise is that the objects that are ranked higher more frequently should be getting higher ranking. This can be formalized as follows: for any monotonically strictly increasing non-negative function f : N -+ [0, oo], define score Sf(i) for object i as n Sf(i) = (2.3) E3f (n - k)IP(o-(i) = k). k=1 The choice of f(x) = xP assigns the pth norm of the distribution of o-(i) as score to object i. One can take this line of reasoning further by noting that the exponential function for a given e > 0, fe(x) = exp(ex), effectively captures the combined effect of all p-norms. Therefore, we propose, what we call the E-ranking, with scores defined as: n Se(i) = Iexp(-Ok)P(o-(i) = k). (2.4) k=1 By selection e ~ ln k, the scores are effectively capturing the occurrence of objects in top k positions only; and for E near 0 they are capturing the effect of lower p moments more prominently. Furthermore, intermediate choices of e give effective ranking for various scenarios. We focus our attention on the case where p = 1, and present a score that can take us directly from the data to the ranking, without the intermediate step of learning the distribution. We refer to this ranking as the f1 ranking. 24 f 1 Ranking The fi score is given by: n Si(i) = Z(n - k) -P[-(i) = k] k=1 In the case of first-order marginal data, this score can be computed in a straightforward way. For comparison data, however, the marginals P[oU(i) = k] are not available, without having the distribution. Fortunately, the score above can be computed from the data directly in the following form: S(i) = n 1 Z~ir(i) < u(j)] n-1 j~i = 1 n-i ci j=Aj using the following lemma: Lemma 1. Given the definition of S(i) and S1(i) above, we have Si(i) = S(i) A proof of this lemma is provided in Section 2.5. One interesting aspect of this shortcut is that the equivalence between the different scores does not assume any particular distribution. It only assumes that the underlying distribution is consistent with the data. This suggests that the produced ranking should work with different distributions. One family of such distributions is the one based on the celebrated model of Thurstone [43][16, Ch 9], as we shall see in the next section. Why fi Ranking? Here we demonstrate the utility of our f 1 ranking by showing its equivalence to the ranking obtained using a Thurstone model. In a Thurstone model, preferences over n items come from a "hidden" process as follows: the "favorability" of each item i is a random variable Xi = ui + Zi, where ui is an unknown parameter (also 25 known as the skill parameter), and Z is a random variable with some distribution. Furthermore, the random variables Z1 , ..., Z, are identically distributed. If we take the tuple (X 1 , X 2 , ... , Xn) to be the outcome of some trial, then item position k if xj is ranked kth among the values xi, ... , x,. preferred to item j j is ranked in Equivalently, item i is if xi > xj. In a typical application of such models one observes these comparisons or positional rankings, and uses these observations to infer the values of the unkown parameters u1 , ..., un. These values are then used to find a ranking over all items. More precisely, items i > j > ... > k if ui > u3 > ... > Uk. As it turns out, the ranking obtained by following the algorithm based on f1 scores is equivalent to the ranking one would get by fitting a Thurstone model. The formal statement is as follows: Theorem 1. Let ui and uj be the (skill) parameters assigned to item i and j (respectively) in a Thurstone model, and let S(i) and S(j) be the score assigned to the same items using our method (f1 scores), then: j <;- ui S(j) < S(i) <-> Vi, j (2.5) A proof of this theorem is provided in Section 2.5. Thurstone models have been used in a wide range of applications such as revenue management in airline ticket sales, and player ranking in online gaming platforms (e.g. a variant of this model is used in Microsoft's TrueSkill [24]). 2.2.2 The Mode Given a distribution over permutations that is consistent with the data, in the context of object tracking (a la Huang et al. [25]) one would like to find the most likely permutation under said distribution, or the mode. It is easy to see that the mode of a distribution over permutations is hard to compute in general. To address this difficulty, one might want to follow some criteria for selecting a tractable class of 26 distributions to deal with. Ideally, we would like distributions from this class to obey the constraints given by the data, without imposing any additional structure. This intuitive requirement is captured by the Maximum Entropy criterion, whereby we choose a distribution that maximizes the information entropy while satisfying the data constraints. In the following section, we provide a formal derivation of the maximum entropy distribution along those lines. The Maximum Entropy Model Formally, the observations M or C impose the constraints that the distribution, y, should belong to class M: S [(-)1Eai)=k} = Vi, k E K (2.6) = cij, Vi, j E K (2.7) mik, (7ESn or class C: P(-)Sf1(i)<O(j)} tZESn with the the normalization and non-negativity constraints in both cases. S [(-) = 1, p(-) > 0, Vo- E Sn. (2.8) M (resp. C) is non-empty only if M (resp. C) is generated by a distribution over S, to begin with. For clarity of exposition, we will assume that this is the case. When this is not the case, The algorithm that we shall present is based on the solving the Lagrangian dual of an appropriate optimization problem in which the constraints imposed by M (resp. C) are "dualized". Therefore, by construction such algorithm is robust. Now ISal = n! and the data of type M (resp. C) imposes O(n2 ) constraints. Therefore, there could be multiple solutions. The max-ent principle suggests that we choose the one that has maximal entropy in the class M (resp. C). Philosophically, 27 we follow this approach since we wish to utilize the information provided by the data and nothing else, i.e. we do not wish to impose any additional structure beyond what data suggests. It is also well known that such a distribution provides maximum likelihood estimation over certain class of exponential family distributions (cf. [471). In effect, the goal is to find the distribution that solves the following optimization: max V-(a) log V(a) HER (V) OESN (2.9) v E M or C. It can be checked that the Lagrangian dual of this problem is as follows (since all entries of M, C in (0, 1)): let Aik be the dual variables associated with marginal consistency constraint for M in (2.6). Then, the dual takes the following form: max E Aikmik - log (Z exp (Z Aikli{f(j)=k} o i,k (2.10) ik It can be shown that this is a strictly concave optimization and has a unique optimal solution. Let it be A* = [A*j]. Then the corresponding primal optimal solution of (2.9) (with M) is given by p(o-) oc exp ( A - i{0o(i)=k}). (2.11) i,keK Similarly, for the comparison data, the dual optimization takes the form Ai<jcij - log max i,j Ai<j{O(i)<OU(j)})), exp a ij and the optimal primal of (2.9) given optimal dual A* = [A Aj- ff o(j)<o(j)} . p(a-) oc exp (2.12) ] is (2.13) iAjEma As can be seen, in either case the maximum entropy distribution is parameterized by 28 at most n 2 parameters, which is the same as the degrees of freedom of the received data. For future purposes and with a slight abuse of notation, we shall use F(A) to represent the objective of both Lagrangian dual optimization problems (2.10) and (2.12). Computing the Mode Having restricted our attention to the maximum entropy distribution, we now proceed to compute the mode. We begin by providing an algorithm for computing the mode exactly in the case of First-Order Marginal data. We then present a more efficient algorithm for approximating the same mode directly from the data without the need to learn the max-ent distribution. Finally, we present an algorithm that uses the max-ent distribution to compute a 2-approximation of the mode in the general case. Recall that under the maximum-entropy distribution, the logarithm of the probability of a permutation - is proportional to Eik Aikli{(i)=k} for first-order marginal data, and Eij Aij<Ji(,ji)<.(J)} for comparison data. Since the log function is monotone, finding the mode, in both cases, boils down to finding: -* E argmax UESn o-* E arg max (S Aiki{o(i)=k} (2.14) EAicjff 1(i)<0'U)}) (2.15) i,k Solving the problem in (2.14) exactly is equivalent to the following maximum weight matching problem: consider an n x n complete bipartite graph with edge between node i on left and node k on right having weight Aik. A matching is a subset (of size n) edges so that no two edges are incident on same vertext. Let the weight of the matching be the summation of the weights of the edge chosen by it. Then the maximum weight matching in this graph is precisely solving (2.14). This is a well known instance of the classical network flow problem and has strongly polynomial time algorithms [181. It also allows for distributed iterative algorithm for finding it including the auction algorithm of Bertsekas [7] and the recently popular (max29 product) belief propagation [6]. Thus, overall finding the mode of the distribution for the case of first-order marginal is easy and admits distributed algorithmic solution. Next, we describe a (heuristic) method for finding the mode without requiring the intermediate step of finding the max-ent parameters A in the case of first-order marginal data. Declare the solution of the following optimization as the mode: max I: mikfflo(i)=k}. i,k That is, in place of Aik, use mik. The intuition is that Aik is higher if mik is and vice versa. While there is no direct relation between this heuristic and mode of the max-ent approximation, we state the following result which establishes the heuristic to be a 'first-order' approximation. A proof is provided in Section 2.5. Theorem 2. For A = [Ask] in small enough neighborhood of 0 = [0], 1 n- 1 1 n For comparison data, the problem in (2.15) is also equivalent to a combinatorial problem with the space of objects being the matchings. However, it does not admit the nice representation as above. One way to represent the matchings in comparison form is n x n matrices, say B = [Bij] with (a) each entry Bij being +1 or -1 all 1 < i, j < n, (b) for all 1 < i, j < n, Bij + Bji Bij = Bik = 1, then Bik = 1 for all 1 = for 0 (anti-symmetric), and (c) if i, j, k < n. The goal is to find B so that E>7 Bij Ai2< is maximized. It is not clear if this is an easy problem. To address this problem, we have the following 2-approximation algorithm to compute the mode using the parameters of the max-ent distribution: choose L permutatations uniformly at random, compute their weights (defined as per (2.15)) and select the one with maximal weight among these L permutations. For L large enough, this is essentially with 1/2 weight of the maximum weight. This requires A to have all non-negative components. This is not an issue since given the structure of the permutations (each having equal number comparisons, 30 -(i) < u(j), correct) and hence an affine transformation of A by vector with all components being same constant does change the distribution. Therefore, in principle, we could require the subgradient algorithm to be restricted to the non-negative domain (projected verison). The formal statement about this algorithm is stated below. Theorem 3. Let A = [Ai,<:] be non-negative vector. Let OPT be the maximum of Eij Ai<jiI{c(i)<a(j)} among all permutation a E S,. domized algorithm, if we choose L > P [W(6) < Then in the above described ran- Iln, then (1 - 6)OPT] < c A proof of this theorem is included in Section 2.5. To complete the solution, we only need to estimate the parameters of the max-ent distribution. An algorithm is provided in Section 2.3. 2.2.3 Top-K Ranking Here the interest is in finding a ranking that emphasizes the top k objects (the favorites). To do this, we can compute the aggregate ranking, or the mode, and then declare the top k ranked objects in resulting list. We propose a natural way to emphasize the favorites. Intuitively, if an object is ranked among top k positions by a large fraction (probability-wise) of the permutations in the distribution, then it ought to be among favorites. This suggests that for a distribution A, each object i can be given a score Sk(i), defined as Sk(i) = PA[0-(i) < k], In the case of first-order marginal data, this score is nothing but E<krmni, and can be computed directly from the data. In the case of comparison data, this score can be inferred from the max-ent distribution, which can be learned by the procedure outlined in Section 2.3. Finally, once the score is computed, we can now declare the top k objects with highest scores as per Sk(-) as the result of top k. 31 2.3 Learning the Max-Ent Model Here we describe an iterative, distributed sub-gradient algorithm that solves the dual optimization problems (2.10), (2.12). First, we describe an idealized procedure that calls certain oracle that estimates marginals of distribution from exponential family. We can, in general, only hope to estimate these marginals approximately because the exact estimation, as we show later, is #P-hard. Therefore, the main result that we state is for a sub-gradient algorithm based on such an approximate oracle. In a later section, we shall describe how to design such an approximate oracle in a distributed manner along with its associated computational cost. MaxEnt Estimation: Using Ideal Oracle Input: Ranking data mik Vi, k. 1: Initialize: A = 0 Vi, k. 2: for t = 1 - T do (Ext [If{,(i)=kl is provided by an oracle) '- (mi - E\t []Jo,()=k}]) A" <- At 3: 4: end for 5: Choose T {1,.. . , T} at random so that P(T = t) oc I/Vf 6: return AT Here, Eyt [i{,(i)=k}] = EEiSPAt IPy (u) (0)R{(i)=k} where Z(At) = i'P kI{U(i=k}) with normalizing constant (partition function) defined as: Z(At) exp = ( aESn Ak I{a(i)=k}) i,k Instead of Et [ffj,(i)=k1l, we will use a randomized estimation, Eyt (i, k) such that the error vector e(t) = [eik(t)] where each component eik(t) = Eyt [ff ,(i)=k}] - Et (i, k) 32 = E5t [I[{ ](i)=k], is sufficiently small. We state the following result about the convergence of this algorithm. Theorem 4. Suppose that each iterationof the sub-gradientalgorithm uses an approximate estimate E.(-,.) such that |e(t)||1 < A(t)+\1 *1Io+ 1*2Is where A(t) = ( 1/Vs and A* is a solution of the optimization problem. Then, for any y > 0, for choice of T = (e--2-- 2 + |A*|+ +||A* ), we have E [F(Ar) > F(A*) - c, where F(.) is the objective of dual optimization (2.10). The identical result holds for the comparison information (2.12). A proof of this theorem is included in Section 2.5. An Approximate Oracle Theorem 4 relies on existence of an oracle that can produce an estimation of marginals approximately with appropriate accuracy for each time step t. Computing marginals exactly is computationally hard. For first-order marginal data, this follows from 13]. We prove a similar result for comparison data. Both results are summarized by the following theorem: Theorem 5. Given a max-ent distribution A, computing E.\[I{,(i)=k}] and E.\[E{o(i)<o(j)}] is #P-hard. A proof is included in Section 2.5. We now describe an approximate oracle. We shall restrict our description to the Markov Chain Monte Carlo (MCMC) based oracle. In principle, one may use heuristics like Belief Propagation to estimate these marginals instead of MCMC (of course, this may lead to the loss of the performance guarantee). Now the computation of marginals requires computing Pyt (-) for any o- E S,. From its form, the basic challenge is in computing the partition function Z(A t ). The partition function Z(A') is the same as computation of permanent of a non-negative 33 valued matrix A = [Aik] where Aik = e ik. In an amazing work, Jerrum, Sinclair and Vigoda [271 have designed Fully Polynomial Time Randomized Approximation Scheme (FPRAS) for computing permanent of any non-negative valued matrix. That is, Z(At) (hence IPAt(a)) can be computated within multiplicative accuracy (1 E) in time polynomial in 1/E, n, log(1/6) with probability at least 1 - 6. Therefore, it follows that the desired guarantee in Theorem 4 can be provided for all timesteps (using union bound) with probability at least 1 - 1/n within polynomial in n building upon the algorithm of [27]. For the case of comparison information, however no such FPRAS algorithm for computing the partition function is known. Therefore, we suggest a simple MCMC based algorithm and provide the obvious (exponential) bound for it. To that end, define WA(u) = Ei,kA' Aik *l{o(i)<a(j)}, and construct a Markov chain, 9:(A), whose state space is the set of all permutations, Sn, and whose transitions from a given state o- to a new state a' are given as follows: MaxEnt Model Estimation 1: With probability !2 let a' = a. 2: Otherwise, construct a' as follows: o Choose two elements i and j uniformly at random; set &(i) = -(j), &(j) = a(i) and &(k) = a(k) for all k # i, j. o Set a' = & with probability min{1, exp (WA() - WA-(a))}; else set o-' = a. Using this Markov chain, we estimate EA [fi{f,(i)<,(j)}] as follows: starting from any initial state, run the Markov chain for Tm steps and then record the state of the Markov chain, say um. if aTm(i) then record 1 else record 0. Repeat this <jUm(j), for S times and obtain the empirical average of the recorded 0/1 values. Declare this as the estimate of E.\[I{i)<,a(j)]. Indeed, one simultaneously obtains such estimates for all i, j. We have the following bound on Tc, which we establish in Section 2.5: 34 Theorem 6. The above stated Markov chain has stationary distribution P* so that p*(a-) oc exp (W,). Let p(t) be the distribution of the Markov chain after t steps startingfrom any initial condition. Then for any given 6 > 0, there exists TC= e(exp ((nLxA l, + n logn)) log such that for t > Te, <6, ___1 2,p* "_(* where | -2,, is the x 2 distance. 2 Now the total variation distance between p(t) and p* is smaller than the x distance between them. Therefore, by Theorem 6, it follows that the estimation error of P,(-) using p(t) will be at most 6. From Chernoff's bound, by selecting S (mentioned above) to be O(6-2 log n) (with large enough constant), it will follow that the estimated empirical marginals for all i, j components must be within error 0(6) with probability 1 - 1/poly(n). Given that increment in each component of A as part of the sub-gradient algorithm is O(VT') by time T, from Theorem 4, it follows that the IJAIK = 0((n + IIA*lo + IA*)112)1+) (for any choice of -y > 0 in Theorem 4). Finally, the smallest 6 required in Theorem 4 is an inverse polynomial in n, e, from above discussion it follows that the overall cost of the approximate oracle required for the comparisons effectively scales exponentially in n3+ (ignoring other smaller order terms). 35 Evaluation and Scalability 2.4 Here we provide results from a simple experiment to demonstrate that the ranking produced by our 1 algorithm converges to the right ranking for the Multinomial Logit Model, an instance of Thurstone model (choose Zis to be i.i.d. logit distribution). Specifically, we sample distinct items i and j from 1, ... , n uniformly at random as per the distribution. We then consult an MNL model, defined using n parameters, for the value of fIfa(i)< (j)}. All the samples are then combined into a matrix [cij], which is used to find the 1 ranking. In Figure 2-1, we show a plot of the error measured using the normalized number of discordant pairs versus the number of samples used for n = 10. As we can see, beyond 500 samples, the error induced is extremely small. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500 3000 Number of samples 3500 4000 4500 5000 Figure 2-1: Plot of the error induced by 1 ranking algorithm for the MNL model, a specific instance of Thurstone's model. To test the scalability of our method, we implemented a voting/survey tool that enables a large number of participants to vote on any number of items in real time. By doing so, we had the following two questions in mind: in addition to being theoritically interesting, can our algorithm be applied in real time? is comparison based voting practical and simple enough for adoption? through this exeperiment, we believe the answer to both questions to be affirmative. Our tool was installed in voting booths that were made available to the visitors of the MIT150 event [1], a university-wide public open house. Voting categories included movies, actors, musicians, atheletes, among others, and the results at any time were 36 continuously displayed on a large screen. The participation was impressive, and the feedback was mostly positive, which makes us believe that adopting comparison as form of voting is worth a serious consideration. 2.5 Proofs This section provides detailed proofs of all the results stated earlier in the chapter. Due to space constraints, proof of Theorems 4, 5, and 6 are omitted from this version. 2.5.1 Proof of Lemma 1 With some arithmetic manipulation, we get =SW 1 Z - E[U(i) < (j)- - u]P[u =ak] n-1 Olj ESn jA =i-1 E (I-u)(i) < Eoi)] I 1 -1 n -k) I[a7~i) -E =k] k=1 -S,(i) 2.5.2 Proof of Theorem 1 Recall that, under Thurstone's model, each item i has "skill" parameter ui associated with it. The random "favorability" Xi = ui + Z where Zi are i.i.d. random variables with some distribution. Our algorithm, with access to exact partial marginal data (first-order or comparison), computes scores for each item i: Si(i) using first-order 37 data and S(i) using comparison data. As proved in Lemma 1, these two scores are equivalent. Therefore, if we establish that ui > uj if and only if S(i) > S(j), it is equivalent to being SI(i) > S 2 (j) as well. We shall establish this statement in two parts: (a) ui > uj, and (b) ui = uj. Let us start with the first case, ui > uj. Recall that, score for an item i is S(i) = n1Z P[Xi > Xk] k5i # j, S(i) - S(j) OC (: IF(x > Xk) - Therefore, for i (E Px > Xe) kAi P(X > xi]) + = (P([x > x3 ] - (Z (P[xi > Xe] - > Xe]) (1P[xi < X7X - i [X] <x ])+ (>E (P[xj fAi~j Xe] - P(Xi < Xf] (2.16) . = P[Xj Recall that Xi = ui + Zi and Xj = u1 + Zj where ui, uj are the "skill" parameters of i and j respectively while Zi, Zj are i.i.d. random variables with some distribu- tion. Define, Wij = Zi - Zj. Then for all i,j, Wij are identically distributed, say with distribution similar to a random variable W which has CDF given by Fw, i.e. Fw(x) = P[W < x]. Since W is difference of independent and identically distributed random variables, by definition it is 'symmetric' around 0. That is, for any x > 0, ED(W < -X] = ED(W 38 > X]. (2.17) Given these notations, it follows that P[X < X3 ] = P[Wij u3 - ui] = Fw(uj - ui). (2.18) Similarly, P [Xj < X] = Fw (ui - uj) P[Xi < X] = Fw (ut - uj) P[Xj < Xe] = Fw(uf - uj). (2.19) Since ui > usj, we have ut - uj > ut - ui for any f k i, j. Since Fw is a CDF and hence monotonically non-decreasing, i.e. Fw(x) < Fw(y) for all x < y, Fw(ue - uj) - Fw(ue - uj) > 0, (2.20) S(i) - S(j) c (Fw(6) - Fw(-6) (E + for all f. Also, let 6 = ui - uj > 0. Then, from above discussion, (2.16) becomes (Fw(ue - uj) - Fw(uf - uj) (2.21) Now, Fw(6) - Fw(-6) = P[W C (-6,6]] > P[WI < 6/2]. (2.22) As we shall show next, for any distribution of Zs, W is such that for any y > 0, P[HWI < -Y] > 0. 39 (2.23) From (2.20)-(2.23) (and -y = 6/2 in last equation), it follows that if ui > u, then (2.24) S(i) - S(j) > 0. Now we establish (2.23). For this note that due to Z (distributed as Zi, Zj) being a distribution, there exists (tightness) [-a, a] c R for some a > 0 so that P [Z E a] > 1. Given any -y > 0, partition this interval into at most N = [1a disjoint contiguous intervals, each of length -y/ 2 . One of these intervals must have probability [ at least 1/2N. Call this interval I. That is, P[Z E I] > 1/2N. Since Zi, Zj are distributed independently and identically distributed manner with distribution same as that of Z, we have that P[Zi E I, Z E i] > 4 2 (2.25) > 0. But when both Zi and Zj are in I, their difference W = Zi - Z must be withint [--y/2, -/2]. This completes the justification of (2.23). For the case (b), ui = uj, using identical arguments as above, one can argue that S(i) = S(j). This complete the proof of Theorem 1. D Proof of Theorem 2 2.5.3 Let A be in neighborhood of 0 = [0]. We shall establish claim by means of Taylor's expansion of m as function of A around 0. For simplicity, let us denote oij = IW~i} mij (A) S a ) exp exp where partition function Z(A) = E s mij (0) = Ak ) Then Akl0k1). (k,l For A = 0, we have for all i, j. By the first-order Taylor expansion, for A near 0. mij(A) ~ mij(0) + ZAk 40 AkI *=) . (2.26) By the property of exponential family (see [47] for example), it follows that &mij(A) = Ex [oijo-kI] EA [O-ig] - &Akl EA [Uki] (2.27) = E\ [o-i oUkl] - mij(A)mkl(A). From (2.26) and (2.27), it follows that for A near 0, mij (A) ~n+( AWiEA o-ij -k1]I Akl) (Z - (2.28) k,I k- We state the following proposition. Proposition 1. All distributions can be represented by A s.t. 0, Aik = Akj = 0, for 1<i,j <n. (2.29) k k Proof. Consider a A such that (2.29) is not satisfied. We will transform A to v which satisfies (2.29) but induces exactly the same distribution. Specifically, we shall prove that for each -, E Sn E Vk (O-kl - &ki) = E Ak(o-kI - &kl ). k,l kJ vij = Ai - 1 1-At. n - To that end, define 1 n 1 -A where n A. =Z k=1 n n Ak A. = ZAkj k=1 41 AkI. A..= k,1=1 (2.30) Then, n k=1 k=1 n Similarily, we can check (5 Z Vi. A.. A. + - A.. = 0, n n for all j. Now we have: AklUkl - Vklikl k l=1 k k,1 k,1 Ok1 .(Zu -+ h = UkI kl kA -4 n n Ak. k=1 1=1 1 2 kl) k,l k=1 n - Ak. A - k=1 k=1 k=1 1Vyk E_ k= ZAk A1 Ak= =A1. n + n k ki ki Therefore: 1 vkl(Ukl - &ki)= E A1(okl -akl) - 1 -A.. + -A'' k1 k1 E kl(kl - k) k1 This implies that the distributions induced by A and v are identical. L Given Proposition 1, we shall assume A satisfying (2.29) without loss of generality. Then, from (2.28) mij (A) = + IAkE[[ijukl] n k k,1 n kjl A.. - U kl 42 E [0ijkl] (2.31) Now, E[uijkl] = 1 n :k = i,j = 1 n(n-1) :k # i, j = I 0 : k = ij: l Then from (2.31) m =1 +'A n2 (2.32) Akl. + E 1 k5zi,I0j Focusing on the last term, we have Akq) ) 1 2n(ri- 1) ( k k=,i,lIAj q=1,qfj n +E( = Aq)] [ (Ak. q=1,qoi lAj 2= (n- Akj) - E (A., -Ail)] 10i [1k) 2r - [A.2 2n(n - 1) I - 2n(n - 1) +0i = Ai + A. (-2Aij) - - Aj I n(n - 1) ) 2n(n - Combining this with (2.32), we get m2 (A) - 1 ni 1 1 1 1 n-i~ n~ D as desired. 43 2.5.4 Proof of Theorem 3 Denote the difference between the weight of a permutation, -i, and the mode by Aj, defined as: W(U*) - W(o-) Aj Since the permutations u,..., uk are drawn uniformly at random, for each per- mutation o- we have: E[W(u-)] = ZAi<j E [oi<3 ] ;> W(o-*) Aij = zj And therefore, E[Z\i] = W(u-*) E [W(o-4)] 12 W(c-*) - Since & is chosen to have the maximum weight W(.) of all permutations, and since these permutations are drawn independently, we have: k1 P w(&) < (.- 6)W(-*)] = 2 W(Oi) < ( - 6)W(-*)] fP[i i=1 k = 1JP Ai > ( 2 +6)W(-*) i=1 Using the Markov inequality we get: k k P [W(&) < - 1 6)W(a*) 1+26 ~fJ(1 - 26) i=1 k I=e 26 < e 2 k i:=1 Where the approximation is valid for sufficiently small 6. Setting k > - log1 , we 44 have: P [W(&) < ( - 6)W(o-*) < 2.5.5 Proof of Theorem 4: subgradient algorithm We shall establish result for the first-order marginal. The proof for comparison is identical. To that end, recall that the optimization problem of interest is max Akmi - log F(A) ( exp (S Aik[{o(i)=k)). (2.33) ik i,k Let A* be an optimizer of the objective function with optimal value F(A*). Now F(-) is a concave function. As before, we shall use t as the index of algorithm's iteration, A' be parameter value in iteration t, gt be the subgradient of F(A') = m - Et [[i (i)=k}1]] and e(t) be the error in this subgradient. Then At+ _ 2 2 At + at(gt + e(t)) - A* 1 IAt - A* 1 2 + a I1gt + e(t)11 2 + 2at(gt, At - A*) + 2at(e(t), At - A*) 2 + cy11g t + e(t)11 2 A* 11- + 2at(F(A t ) - F(A*)) + 2at(e(t), At - A*) where the last inequality follows from the fact that gt is a subgradient of F at At. Applying this inequality recursively, and keeping in mind that 11 t 0 < IA0 - A*11 2 +2 Ea(F(A') - F(A*)) s=0 t t + a2 1g,+ e(s)11 2 + s=O 2Z a,(e(s), AS - A*) s=O 45 ; 0, we get: > Therefore s=O t a2 jig + e(s)|1 + 2 S=O t as (e(s), A' - A*) + 2 s=0 Let A be chosen to be As with probability ps 0 - Then on average, we have = A*11 2 + E= 0 a 2 |gs + e(s)f1 2 2Z= 0 a5 E[F(A*) - F(A)] < 2 Et8=0 as + 2 Z:=o as(e(s), As - A*) (*) 2 Es=O as To simplify the term in (*), note that g'+e(s) is a vector whose elements are in [-1, 1]. Therefore Jg' + e(s) 1 < n 2 , where n is the dimension of the vector. Furthermore, the term (e(s), A' - A*) can be bounded as follows: (e(s), As - A*) < I(e(s), As - A*)|I <|e(s) 1iHAs - A*I1 < |je(s)H|1(A' - A 0IIoo + HA0 - A*j) And, s S Ijs- A0 1joc < IaqjAqjj,, 5 1:j q=O q=O where Aq is the change in the value of A at step q. Combining this with the previous inequality, we get: 8 (e(s),A' - A*) < |e(s)1(E as + 1AO - A*Haoo) q=O 46 Combining this with (*), and letting B = max{|1A0 - A*1| 0 ,|IA 0 - A*11 2}, we get: B+ E [F(A*) - F(A)] < B + 2 a2n2 t=o _ Et. as 22 E2 as e(s)11(Z"s o as + B) 22,E s Using our approximation oracle, we can choose the value of Ile(s)II to be g_ which yields: B + I~t 0 a2n 2 + 2 Z> 0~a 2 Zs2 + 2 E[F(A*) - F(A)] < 2 E,=O as 2 B + (n + 2)Z 0 a2 (2.34) 2 Zs=o as Recall that a, = . E Therefore, as = E(fi) and numerator scales as log t. Therefore, the quantity above converges to zero, and F(At) converges to F(A*). Now (ignoring constants), the bound in (2.34) scales like (B + n 2 log t)/v/t. Therefore, for t > T, E [F(A*) - F(A)] < e, for any -y > 0, T = E(-2-(IIA*Io + IIA*112+ n 2)2+) B. Proof of Theorem 5 As mentioned earlier, this result follows by direct adapatation of the known results. We present it here for completeness. For the case of first-order marginal data, the partition function given by Z(A) as Z(A) = En, HE= 1 e~i, = 1,s, exp ( Zik AikR{o(i)=k}) can be rewritten which can be recognized as the permanent of a matrix Aik = [elik], denoted by Perm(A). To prove that computing Perm(A) is #P-hard, we 47 provide a reduction from the problem of computing the permanent of a (0, 1)-matrix, which is known to be #P-hard [451, as follows: given a (0, 1)-matrix, A, we construct the matrix A by setting Aik Aik = 1. = ln(n! + 1) when Aik 0 and = Aik = ln(n! + 2) when Combining the facts that Perm(A) < n! and M(n! + 1) mod (n! + 1) = 0, for any integer M, it is easy to see that: Perm(A) = Perm(A) mod (n! + 1) Perm(A) mod (n! + 1) Which concludes our first half of the proof. For the case of comparison data, we prove that computing the partition function, given by Z(A) = Es, exp (ij Ai<jli{(i)<a(j)}) is at least as hard as counting the number of Hamiltonian paths in a directed graph, which is #P-hard [48]. To that end, we provide the following reductions: given a directed graph G = (V, E), we set Ai<j = - ln(n!+1) for all (i, j) E E, and Ai<j = 0 for ' E. If we rewrite the partition function as Z(A) it becomes clear that the product H7'j):q(i)<,(j) to a hamiltonian path, and is at most = E s 1,j):()<() eAi(i) , all (i, j) eAi-(> is equal to 1 when o corresponds 1 otherwise. Since we have a total of n! terms inside the summation, computing the floor of the partition function, LZ(A)J, should give us the number Hamiltonian paths in the graph. This concludes the second half of the proof. Note that in order for the two reductions above to be valid, we need to make sure that any parameters used can be represented in a number of bits that is polynomial in the size of the problem n. This is indeed the case, since we only need O(n log n) bits to represent n!. C. Proof of Theorem 6 Definition 1. The x2 distance, denoted by 2 112,,, 2 E pi(- 48 - 1 ,is Definition 2. Consider an |Q| x |Q| nonnegative valued matrix A G R IxI' and a vector u e R1<. Then the matrix norm of A with respect to u is defined as follows: sup JA||U= A|Av|| 2 ,u 1VIl2,u v:Eu[v]=O where Eu [v] = Z uivi. Recall that the discrete-time Markov chain we are using is reversible and aperiodic, and therefore ergodic. Let p* : S, -+ [0, 1] be the stationary distribution over the chain. Furthermore, let p(t) be the distribution at time step t.Then, the dynamics of p(.) is given by: P(t) = P(t - 1)P = P(O)P, where P = [PpT] = Smin }] is the transition matrix specified norm, we obtain: previously. Using the properties of the matrix {1, exp(WA(p) -W(p)) p(t) it* 2,M* IP' < t 2 (2.35) Therefore, to bound the distance between p(t) and /a*, we need to get a bound on ||pt i. Lemma 2. The matrix norm of ||Pt|| is bounded as (i 1 exp (9(n 2 11A1o + ||P|I n log n))) Proof: Recall that the partition function, or normalization constant of p(-) is defined as (A) = E exp(A - o-) rE Sn 49 It follows that Z(A) nm! exp (n 2 HAIK) < exp(2(n2 + n log n)) 11All Therefore, for any - E Sn 1 = exp (o, A) p-)=Z(A) > exp (- + nlogn)) e(n 2HAHl For any two permutations -, p E S, that differ in two swapped elements, i.e. we can transition from - to p and vice versa in one step, we have: P,,p ;> exp ( - 0(n 2 ||AJO +-nlogn)) Given this, we can get a bound on the conductance <D of P as follows: <= Sn - S) min Q(S, sCsI P(S)P(Sn - S) > min a,peS. > exp (- where Q(A, B) = EEA,pB1[t(a)Pp. t(o-)Pp O(n 2 |Afllo, + nlogn)) By Cheeger's inequality, it is well known that Amax < 1 - - the second largest eigenvalue of P, Amax is bounded as 2 <1-exp (- 6(n 2||A|oo + n log n)) From Lemma 2, and using the fact that -1i 1 __ we have -1 <6 50 for t > Te I~min 0 exp (E(n 2HAHo)), where T, = 0 ( exp (n2jA 11,, + n log n) 51 52 Chapter 3 The Multiple Ranking Problem Our contribution to the problem of learning the MMNL model, as stated previously, can be outlined as follows: first, we provide a construction for a mixture choice model that demonstrates the difficulty of the problem. This difficulty suggests that one needs to impose further conditions on the model to guarantee learnability from data. Guided by common consumer behavior, we provide sufficient conditions under which this is possible, together with algorithms and error bounds for learning. To demonstrate the difficulty of the problem, we idenitify a fundamental limitation for any learning algorithm to learn a mixture choice model in general. Specifically, we show that for K = 6(n) and 1 = E(log n), there exists a pair of distinct MMNL models so that the partial preferences generated from both are identical in distribution. That is, even with an infinite number of samples, it is not possible to distinguish said models. This naturally suggests that one needs to impose further conditions for the learnability of MMNL models. Guided by consumer behavior, we provide sufficient conditions under which this is possible. Concretely, we account for the following behavioral patterns: (a) consumers have natural tendency to provide information about their extreme preferences (e.g., movies they love and hate); (b) the "amount" of liking (disliking) decays (grows) "quickly" as we go down a consumer's preference. This intuition can be captured through conditions on the structure of the parameters of the model. Under a sufficient set of such conditions, we establish that it is possible 53 to identify MMNL components with high probability 1 as long as 1 = 0(log n) and N = Q(n). Other works have made similar assumptions; for instance, in the context of collaborative filtering [46] consider positive-only feedback. We establish this result using a simple clustering algorithm as well as a subroutine for learning the parameters of the top items in each mixing component. The clustering algorithm, (see Common-Neighbors algorithm), is simple; it places two consumers in the same cluster iff they share more than half of their neighbors. As such, it does not require any knowledge of the number of components in the underlying model, or any other parameters for that matter. We then, within each cluster, learn the weights of the top 1 elements. Algorithm 0, which accomplishes that, works by simulating a Markov Chain, whose states are items, and whose stationary distribution is proportional to the weights of the elements. This approach is similar to that of Negahban et al [35]. 3.1 Difficulty of Learning the MMNL Model: a Lowerbound Recall that we are interested in learning the MMNL model from samples with 1 observed, with the possibility that 1< n, the total number of items. At a high level, one is justified in asking whether this task is always possible or not. Intuitively, and given the richness of the MMNL family, one would suspect that there are instances of the problem where this task would be difficult or impossible. Here we present a result that illustrates this difficulty. This result makes use of a construction that involves two mixture choice models that cannot be distinguished when the size of the sample 1is not large enough. For ease of exposition, and without loss of generality, we assume that the learning task is requires computing an assignment of each data point to the correct mixture component from which it was sampled. Thus, the inability to correctly assign samples to the correct mixture component translates into the inability to learn. This intuition 'Probability going to 1 as the parameters n and N scale to oc. 54 is captured in the following result Theorem 7. Let S E N+ and 6 > 0. Then, for any i > 0, there exists a pair of MMNL models (M1 and M2 ) with n = 2 j+1 items, K = 2i mixture components, such that no algorithm that has access to S random samples of length f = 2i +I1 can decide . with probability higher than 1 + 6 whether the samples are from M1 or M2 This theorem establishes the existence of mixtures with K = O(n) mixing components, where it is impossible to assign samples of length log n to their correct component, thus rendering the learning task also impossible. This is the more interesting when compared with the positive results provided later. 3.2 An Algorithm The lower bound presented in the previous section suggests that the task of learning MMNL model becomes difficult when the model in question is "close" enough to another distribution to make the two indistinguishable using partial data. This leads us to the question of when learning such models becomes possible. To provide an answer for this question, we start by presenting an algorithm for learning the MMNL model. The algorithm consists of a number of steps or sub-routines: a preprocessing step, a clustering step, and a component learning subroutine. Taken together, these subroutines provide a tractable algorithm for learning the MMNL model. They also provide an outline for establishing the results in Section 3.3. The details of the clustering and component estimation sub-routines are presented as the Common-Neighbors algorithm and Learn-Weights-of-Top-Elements. The Common-Neighbors algorithm partitions the data into clusters that correspond to the MNL components in the mixture. The Learn-Weights-of-Top-Elements is intended for learning the weights of the top 1 elements in each permutation. In Section 3.3 we provide strong probabilistic guarantees on their correctness. 55 3.2.1 Preprocessing Given a data set with samples {01, ... , &N}, we construct an undirected sample graph which is used as the input for our algorithm. Each node in the sample graph corresponds to one of the N samples. Since we're interested in clustering the nodes, and therefore the corresponding samples, the edges of the graph are constructed to reflect the 'similarity' between these samples with respect to the underlying model. Roughly speaking, if two samples come from the same MNL component, we would like have an edge between them, and if they come from different components, we would like to have no edge. More concretely, let W be an N x N matrix denoting the symmetric binary adjacency matrix of the sample graph. We would like specify the edge entries Wj to be consistent with the intuition above. To that end, we have the following definition W= 0, if overlap(60, 0-1) < ' 1, otherwise, 31 where s is a constant, overlap(&i, &i) denotes the fraction of items that is shared among the top 1 items of a^ and 8'. For instance, if 1 = 5 and &Z = (A, B, C, D, E) and 07 = (A, C, B, E, F), then overlap(&', &i) = 0.8. The parameter s in this construction is given and depends on the underlying generative model. The constant s is given by s = 0.75 0.751 (1.01). Whereas this may seem cryptic for the moment, it follows easily from the Stretching Factor Lemma, which is proved in the appendix. As we show in the results section, this definition of W, in conjunction with some assumptions on the underlying model, has the desired properties regarding the similarity and dissimilarity of sample nodes. Intuitively, since each sample 'emphasizes' the top items in the MNL component, and considering MNL components that are sufficiently 'far' apart, and where this emphasis is significant, we would expect the overlap between any two samples to reflect the relationship between their origins. 56 Clustering 3.2.2 In this section we describe the algorithm used to group samples to their original clusters. We will first define precisely what a correct clustering is. Definition 3. (Clustering of samples, Correct clustering) A clustering C F = = {8i}ic{1,...,N}. {C3 } of the samples is a finite partition of the set of all samples That is, it is a a finite collection of subsets Cj ; E, such that Ci n Cj = 0 for i # j and such that UjC = E. Furthermore, we call a clustering a correct clustering if the nodes associated with samples & and &i are clustered together if and only if they are samples of the same underlying component in the mixture distribution. The algorithm below uses the union-find data structure for partitions. The unionfind data structure allows one to easily create partitions by making two sets into one (via union) and to retrieve the block of the partition that a particular element if part of (via find); see [131 for more details. We begin with each node in its own set, and use the union operation (which will make a union of the sets containing both nodes) whenever their neighbors overlap by more than half. Below, N(v) denotes the set of neighbors of v, and E denotes the set of edges. Common-Neighbors Input: Edge list E 1: Initialize clustering C = {{o-}}. 2: for do(u, v) E E if thenN(u) n N(v)l > 1 min {IN(u)j, IN(v)I} 3: perform union(u, v) 4: end if 5: 6: end for Upon termination, the algorithm creates a clustering of the samples. In particular, if any two samples share over half of their neighbors, then they will be clustered together. We would like to provide the guarantee that with high probability nodes will be clustered together if and only if they come from the same component in the mixture distribution of the MNL model. In Section 3.3 Theorem 8, we provided one 57 such guarantee under assumptions on the underlying MMNL model and the length, l of the truncated top-i samples. 3.2.3 Learning Weights within Clusters Above we saw that with high probability we can distinguish between samples that come from different underlying permutations. Our initial goal, however, was not just clustering, but also learning the weights of the top items of each permutation. Once the weights are learned, a prediction system can cater top recommendations according to the cluster to which a particular user belongs. Our algorithm for learning the weights of the top 1 elements within each cluster work as follows: we will construct a Markov Chain, in which each state is an item and for which the transition probabilities are based on the history of pairwise comparisons between the items. Theorem 9 then tells us that the stationary distribution of this Markov Chain has stationary probabilities proportional to the weights of the items. This approach is inspired by and similar to that of Negahban et al. There are two main differences between what we do here is what is done in Negahban. The first main differences is that we have a full transition probability matrix, rather than a sparse one. The second main difference is that the transition probabilities have an extra source of noise, which due to the possible misclustering. Learn-Weights-of-Top-Elements Input: {i, ..., JNc} samples of length l that were placed in a cluster C together. 1: Collect the l/r* most seen items in {O, ..., UNc}, for a constant r* defined below. 2: Construct a transition matrix on those l/r* elements, with transition probabilities Pi = A" Nij lr* and Psi = 1 - Pi, where Ni be the number of times that i and . j are seen together, and Aij is the number of times that i is ranked higher than j among those Ni3 3: Compute the Stationary Distribution of the Markov Chain 4: Return the stationary probability of the l/(s*r*) most likely items, where s* is defined below. Algorithm Learn-Weights-of-Top-Elements works by constructing a Markov Chain whose stationary distribution is proportional to the weights of the top l/s*r* items in the MNL model. The constants s* and r* are given by s* = 0.99 58 O99"' (1.01) and r* = (1 - 0.99) i0(1.01). Whereas the presence of the constant s* and r* may seem unclear for now, the intuition behind them is the following: we would like to that the top l/(s*) ranked elements to come from the the top 1 elements of 7r, and would like to all of the top l/(s*r*) to be present within those top l/s* elements. Before we proceed to establish a guarantee for this algorithm, a few remarks are in order. First, the division of the algorithm into clustering and component estimation is inspired by similar approaches to learning mixture models. One such approach, which has become popular for learning the Gaussians Mixture Model (GMM), is the Expectation Maximization (EM) algorithm. In this problem, we are given data points that are assumed to originate from an underlying Gaussian mixture, and we are interested in learning the mixing distribution, and the parameters of the Gaussian components (i.e., the mean and the variance), that constitute a Maximum Likelihood (ML) estimate. Second, we view the assignment problem as that of clustering the data points in a way consistent with their component 'membership'. Finally, we note that while the clustering above is done using the Common-Neighbors algorithm, this need not be the case in application. One can use any off-the-shelf graph clustering algorithm (e.g., Spectral Clustering) as a heuristic fashion. 3.3 Algorithmic Guarantee In this section, we present our results regarding the correctness of Algorithms CommonNeighbors and Learn-Weights-of-Top-Elements. Recall that Theorem 7 sug- gests that one needs to make additional assumptions about the class of underlying MMNL model. As before, we consider an MMNL model over n items, with K mixing components, and we assume that our data points take the form of partial orders of length 1 = e (log n) sampled from the mixture using the standard sampling proce- dure. Next, we restrict the MMNL distributions of interest a family of models that we call the Power-law MMNL models defined as follows: Definition 4 (Power-law MNL Model). Let Pr(.; w) be an MNL model over n items, and let ir be the underlying permutation, and wj , ... 59 ,w7 be their weights. We call this model a power-law MNL model with parameter a if it is an MNL model, and if the weights satisfy w x i--. As a short hand, we will denote such Power-Law MNL model using MNL(7r, a). Similarly, we denote a mixture of such Power-Law MNL models by MIX-MNL(ll(n), a), where for each n we have {1,... ,n}. 1 (n) = { 1, ... , wK} is a set of K permutations over Sampling from a mixture of Power-MNL Models MIX-MNL(ll(n),a) means that one of the K underlying permutations, say 7ri, is chosen at random, and then we sample according to MNL(7ri, a). Intuitively, we expect learning to be possible when the underlying sequences and the corresponding mixing components, are not too similar to each other. This intuition is made more precise using the condition of Asymptotic Dis- { 7 1 , ... , 7rK}, tinctiveness, defined below. Definition 5 (Asymptotic Distinctiveness). We say that a sequence {U(n)}nez+, where H(n) = {71,...,i7K} is a set of K permutations over {1,... ,n}, satisfies the asymptotic separation criterion if the following holds for each pairri =,4 ,rj of underlying permutations: for each e > 0 and ' > 0 we have that lim supn |77 nir| < 'l, where 1 = log n as usual, and where ir7 denotes the set containing the top yl items of permutation ri. Finally, we state the notion of correct clustering that we will use in Theorem 8. Definition 6 (E-Correct Clustering). We call a clustering of o 1 , ... , oN a (1- ) correct clustering if moving eN samples from its current cluster to another cluster provides a correct clustering. We are now ready to state Theorem 8, which is the correctness guarantee of Algorithm Common-Neighbors. It tells us that as long as the underlying sequences are asymptotically distinct and the samples are generated from a relatively concentrated MNL model, then we are able to cluster most samples correctly with high probability. Theorem 8. Let {l(n)}nz+ be a sequence of asymptotically distinct underlying sequences, and let o-1, ..., O-N ~ MIX-MNL(H(n), 3) be the N samples. Then for / > 1 60 with probability 1-0(1/N), for any n > no (for a constant no larqe enough) Algorithm Common-Neighbors provides a (1 NjN) -correct clustering asymptotically as - N -+ oo. It is worth point out that the proof of Theorem 8 works by analyzing properties of samples independently of each other. This is in contrast to what one may expect from Algorithm Common-Neighbors, which works by placing edges based on events on pairs of samples. Proposition 2, which is proven in the appendix, provides the key insight to Theorem 8 which allows us to do this more elegant analysis. Algorithm Learn-Weights-of-Top-Elements takes in samples, ideally from the same underlying permutation, and returns the weights of the top 1 items. Theorem 9 shows that, despite the possibly imperfect clustering (and hence possibly seeing samples from different underlying permutations), we can still learn the weights of the top items of each underlying permutation. Theorem 9 (Learning Top Items). Let {ll(n) }nEz+ be a sequence of asymptotically distinct underlying sequences, and let O-, ... , oN ~ MIX-MNL(f("), p), for 0 > 1, be the N samples. Let Algorithm Common-Neighbors cluster the items, and input each cluster to Algorithm Learn- Weights-of- Top-Elements. Then for a number of samples N = Q(n), we get that Algorithm 0 asymptotically learns the weights of the top O(log(n)) items of each of the K clusters with probability 1 - 0 (log2 n/n). Illustration 3.3.1 Theorem 10 below tells us that the asymptotic distinctiveness criteria is not a very stringent one. In particular, we can create asymptotically distinct sequences by a reasonable random procedure: simply generate Wr, ..., 7rK according to Power-law MNL Theorem 10. Let {H1 (n)}Ez+ be a sequence of underlyingpermutations, where n(n) {71, ... , 7K} have 7r1 , ... , is a set of K permutationsover {1, ... 7rK ~ MNL(fIIJ, a). Letul, Then, for 0 < a < 1 < /, *- , ON~ - model with appropriate parameter. , n}, and where for each n E Z+ we MIX-MNL(H(n), 0) be the N samples. with probability 1 - 0(1/N), for any n > no (for a 61 constant no large enough) the algorithm provides a (1 - onN)-correct clustering asymptotically as N -- oc. Note that as a special case (where we take a = 0) of Theorem 10 we have the case in which the underlying permutations rj, ..., 7rK are uniformly random permutations of {1, ... , n}. More importantly, the important message that Theorem 10 tells us is that even if the underlying sequences are moderately correlated, we can still cluster nodes almost correctly. Whereas the requirement of Asymptotic Distinctiveness provides a technical condition on the underlying sequences, sequences 10, via the MNL generation with its explicit weights, gives us a more intuitive view of how similar the underlying sequences can be. Proofs 3.4 3.4.1 Proof of Theorem 7 Proof. We establish this result constructively. In particular, we will do the following steps: " Base Case: construct a simple example of a pair of MNL models that cannot be distinguished by noiseless samples of length 1 = 3 " Iterative Procedure: provide an iterative procedure to construct pairs of more complex MMNL M1 and M2 with the same property. " Setting weights: we choose the item weights so that, with probability 1 - 6, none of the S samples will contain an incorrect ordering (hence being effectively noiseless), where by a sample o containing incorrect ordering we mean ai < o-, whereas iri > ry " Conclusion: From this we conclude that no algorithm with access to S samples can tell Mi and M2 apart with probability at most 1 + 6. 62 Mixture 2 Mixture 1 A B A B B A B A C D D C D C C D 0.5 0.5 0.5 0.5 Figure 3-1: Example (n=4, k=2, 1=3) Base Case: consider the following MMNL model (the weights will be set later) over no = 4 items, with ko = 2 components, with a uniform mixing distribution, illustrated in Figure 3-1. For the deliberately constructed mixture above, any correct partial ordering of length l < 3 can come from either of the two mixtures; on the other hand, when l = 4, we get a full permutation, and since the two mixtures have different permutations, this is sufficient to distinguish between them. For convenience, we refer to one such pair as an (n, k, 1)-indistinguishable pair; thus, the pair in this example is (4, 2, 3)-indistinguishable. Iterative Procedure: using an (ni_1, ki_ 1 , li- 1 )-indistinguishable pair of mixtures similar to the ones in the aforementioned example, we can now construct a (2ni_1, 2ki_ 1 , li_ 1 + 2)-indistinguishable pair as follows: (a) Create 2 copies of the mixture with 2ki_ 1 components created by combin- ing all of the components present in both mixtures of the (ni_ 1 , ki_ 1 , l_1)indistinguishable pair in one mixture. Further, assign all the components in both copies uniform mixing probabilities. We refer to these new mixtures as copy 1 and copy 2, and to the components originating from M and M 2as base set 1 and base set 2 respectively. (b) In copy 1, add a permutation over ni_ 1 items, p, to component from base set 1; and the inverse of this permutation, p- 1 to components in base set 22. 2p = [pi,..., IPni_,] and p = [pni,,,...,,p1] 63 Copy 2 Copy 1 p S~p P- P-1 pi P-1 p- P p- p p A B A B A B A B B A B A B A B A C D D C C D D C D C C D D C C D Base set 1 0.25 0.25 Base set 1 Base set 2 0.25 0.25 0.25 0.25 Base set 2 0.25 0.25 Figure 3-2: Iteration Step of the Construction (c) In copy 2, add p- 1 to every component originating from base set 1, and p to every component originating from base set 2. This iteration, starting with our previous example, is illustrated in figure 3-2. Now, we claim that the resulting two mixtures can only be distinguished with samples of size at least 1i = li-1 + 3 (i.e., the pair is (2ni_1, 2ki- 1 , li-i + 2)-indistinguishable. To see why this is the case, note that both mixtures contain components from base set 1 and base set 1 , as well as the permutations p and p- 1 . The only difference between the pair is which base set, 1 or 2, p and p- 1 are added to. Therefore, to distinguish between the two mixtures, we need the samples to identify a combination of the base set (1 or 2) as well as the permutation attached to it (p or p- 1). To identify a mixture, we need to have at least 1 = li-1 + 1, and to identify the base set we need at least 2 additional items. Thus, we need at least 1i = li_1 + 3 to distinguish between the two pairs, as desired. Setting Weights: We now set the weights of each of the underlying MNL models such that we have w' < for each i < n. Conclusion: By the construction above, whenever we get samples with correct ordering, it is always equally likely be from M1 as from M2 . Hence, as long as all the samples have correct orderings, even an optimal algorithm (that calculates the 64 exact likelihoods) cannot distinguish between M1 and M2 . We now show that, with probability at least 1 - 6, no samples have incorrect orderings. Let A2 be the event that sample c-i has an incorrect ordering. choice of weights, P (A.) < f- _s Then, by our be the even that at least one sample = 1. S. Let A has an incorrect ordering. Then, by the union bound that the probability that we have an incorrect ordering in any of the samples is at most i IP (Al) S = 6. We conclude that since, with probability at least 1 - 6, the algorithm does not even = see samples with incorrect orderings (and since it can only distinguish M1 from M2 if it sees samples with incorrect orderings), it cannot distinguish M1 from M2 with probability better than j+ 6. For the proof of Theorem 8, we will proceed as follows: we will show that a majority of data points will be clustered correctly (with high probability). At a high level, we show this by first showing that (i) a majority of the data points are "similar" to the underlying permutation, 7r, of their corresponding to the mixing components from which they originate, and (ii) the fraction of points that do not satisfy this condition have a negligible effect on the overall clustering computed by the algorithm. This in turns is demonstrated by showing that for any set of data points, if these points are "similar" to their underlying permutation, 7, the corresponding graph will contain edges between these "simiar" nodes, and no edges to other nodes that are not "similar". Finally, we show that data points that do not satisfy this condition cannot affect the overall clustering. To translate this outline into a concrete proof, we start with a precise definition of our notion of "similarity" as Asymptotic Similarity as follows: Definition 7 (Asymptotic Similarity). Let { (,(n), -(n) ) }nCZ+, be a sequence of pairs of permutations over n items. We say that this sequence is asymptotically similar if there exists a large enough n such that * 1 of the top log(n) items of 7r(n) are present in the top s(B, 3/4) log(n) items of the -(n), and 65 * a fraction p' , for of the top log(n) items of 9(l) come from the top r(3, p') log(n) of the +r(n),for 0 < p' < 1. To establish the relationship between Asymptotic Similarity and the presence/absence of edges in the sample graph, we will need the following two results, whose proof is included later in the appendix: Lemma 3 (Stretching Factor Lemma). Let o- MNL(w, 3) for some underlying permutation 7r. Then, for 0 < p,p' < 1, / > 1, T, T' > 0 we get: " a proportionp of the top Tl items of 7r are in the top T1s( 3 , p) of a with probability 1 - 0(1/n), and " a proportionp' of the top T'l items of o are from the top T'lr(3, p') of 7w with probability 1 - 0(1/n), where s(/, p) = pft 1 1 (1.01) and r (, p) = (1 - p) (1.01). Proposition 2. Let {E(n)}nEz+ be a sequence of samples, where each E(n) = {U-i, contains N samples of permutations over {1, ... , UN} ... , n}. Also, say that each set E(n) con- tains independent samples each generated from MIX-MNL(fl(n), a) from asymptotically distinct {H(n) }, and such that the samples are each asymptotically similar to their respective underlying sequence. Then, asymptotically, for each i -fj, our algorithm will place an edge between o-i and oj if and only if they were generatedfrom the same underlying permutation. With this in mind, we can proceed to prove Theorem 8. 3.4.2 Proof of Theorem 8 Proof. For clarity, we break the proof into the following steps: 1. For any given mixing component, and for any 6 > 0, we have N(1-6) points sampled from that component w.h.p. 66 2. For any given a generated from a Power-Law MNL, MNL(7r, /), where # > 1, a and 7r are asymptotically similar with probability 1 - 0(1/n). 3. For any given mixing component, and for any E > 0, there exists an index no such that for any n > no at least 1 - E fraction of the samples satisfy the asymptotic similarity w.h.p. This result, in conjunction with Proposition 2, allows us to conclude that at least 1- E of the samples will be clustered correctly by the algorithm. 4. For any E < 1 , the fraction of nodes that fail to satisfy this condition cannot affect the clustering of the remaining nodes. 5. Finally, letting E scale as lgN and using the probability bound in (3), we obtain the desired result. Note that (1) can be established through a direct application of the Chernoff Bound to the mixing distribution to yield a probability bound of 1 - exp(-O(N)). To show (2), we need to show that the sequence in question satisfies both parts of Definition 7. The first, and second, part of the definition follows from the Stretching Factor Lemma by picking p = 3/4, and T = 1, and T' = s(o, 3/4), respectively, with probability 1 - 0(1/n) as desired. To show (3), let no be the smallest index such that for any n > no, any sample a and the corresponding 7r are asymptotically similar with probability at least 1 - 2 where the existence of no follows from (2). Now let Ai be the indicator random variable that sample ai and its parent permutation are asymptotically similar. Further, let A = 1 Ai. Then, since E[A] > (1 - e/2)N, we get by the Chernoff bound that P (A < (1 - E)N)) P (A < (1 - c/2)N < E[A])) < exp (-E(NE 2 )). Hence, by Proposition 2, we see that (1 - c)N of the samples are clustered properly with probability at least 1 - exp (-e(NE 2 )). Now, we can establish (4) by using (3) 67 with E < 1j-6-1 we have at most Nincorrect = = -jN nodes that are potentially EN disconnected from the nodes in their correct cluster. Using (1) and (3), we know that each of the correctly clustered (1 - E)N nodes is connected to at least Nk = (1-3)(1-E)-Ng nodes in its (correct cluster) w.h.p. Now, and without loss of generality, let us assume that the mis-clustered nodes have been connected to a single (wrong) cluster. Since Nincorrect !Nk, it is easy to see that these erroneous connections cannot effect the correctly clustered nodes. Further, since the overall clustering can be fixed by reassigning at most cN nodes, this also results in an E-correct clustering. Finally, letting EN get a (1 - - 1oKyN)-correct oN and plugging it in the probability bound in (3), we clustering with probability exp (-0 (NE 2 )) = -0 exp =1 N - 0 (1/N). Proof of Lemma 3 3.4.3 , Proof. For notational convenience, partition the item set into 3 sets: A A {iri, ... B A {7rp + 1 1 ,...wr}, and R A L (1pl) - '-) , and MR C Ef Then we have mB x Oc ,1, x-4dx (11-,) . Now consider the number of items I(1) that must be sampled until we see pl of the top 1 items. Then I(1) Si, = where si is the random variable representing how many items were sampled after seeing the (i-1)th item from the top 1 items of 7r and until we see the ith item from top 1 items of 7. We want to show that the quantity I(l) is with high probability linear in 1 (as used in the algorithm). To that end, define q(4, p) A 'B Then for i < pl we clearly have that si is (first-order stochastic) dominated by Geom(q(3, p)), and we also have that I(l) is (fist-order) stochastic dominated by E11 Geom(q(3, p)), which is just a negative binomial distribution (since it is a sum of independent geometric random variables of the same parameter). We can now use a concentration result on 68 the negative binomial and, for each c > 0, we get P1 P1 Since s(Op) = 1++6 21 we get that Hence, since E[Geomn(q(O))]j P (I(l) ;> P [ 2 s= e n H = Geom(q(, A) 2Geom(q(, p)) > (1 + E)E P 1 1 q (0, p) (1 + 0) < exp (-E(l)) = 0(1/n). = p -0+ 1 (1 + e), the first bullet point follows by picking q('' 0.01. Let us now prove that the second bullet point holds. Consider the event Ai that -i E 7r. Let A(l) be the number of the top 1 samples that come from the top rl of ir. Then A(l) = E1= f (Ai). B A Then we have mB O E'1x-dx {I+1, ... ,WrI}, and R A {7rl+1, .,n} ((l)1-4 (rl)1-3) - = - For notational convenience, partition the item set into 3 sets: A A {7, . . 71} r mR Xx 8dx ,- " (r13). (1 - r1--), and Hence we get mB mB = 1 - ((1 -1 _ ( 1 /r + MR = p')rN(1.01)) 1 - (1 - p')(1.01) 1-0 > p'. We now see that each Ai (first-order) stochastically dominates independent Bernoulli random variables, each with parameter q'(3, p') A mB+mR > p'. Hence A(l) is stochastically dominated by those random variables: E1= 1 Bern (q' (#, p')). Hence, by Chernoff bound, we get IP A Ai 69 < exp (0 (-l)) = 0(1/n). Now, since E [ Ai > p'l, we get P (A (l) < (1 - c) p'l) = 0(1/n), and hence we get that with the desired probability that a p' proportion of the top 1 items sampled are from the top rl of the underlying sequence 7r. Furthermore, the inclusion of the El factors T and r' in the statement of the Lemma follows easily. 3.4.4 Proof of Proposition 2 Proof. Let us first consider two samples coming from the same underlying permutation. We shall show that samples coming from the same underlying permutation will have an edge between them. Note that by the property of the definition of Asymptotic Similarity, if all nodes coming from the same underlying permutation have more than a 3/4 proportion of the top l/s(, 3/4) items of 7r within the top 1 sampled, then they must share at least l/(2s(3, 3/4)) of those items of 7r. Since our algorithm places an edge between nodes corresponding to samples that have at least l/(2s) items in common, we see that samples coming from the same underlying permutation will indeed have an edge between them. Let us now consider samples a1 and 02 coming from different underlying permutations 7r, and r 2 . According to the second bullet point of definition of Asymptotic Similarity, a p' proportion of the top 1 items of c-i come from the top ri/s of 7ri for i = 1, 2, and for r = r(3, p'). In addition, since 7r 1 and r2 are asymptotically distinct, by definition of Asymptotic Distinctiveness we have that for any c > 0 and -y > 0, we have 17r" n wr < c'l holds asymptotically. We will now show that for the following choices of p, y and E, the Lemma follows. 1, y = r, and c = 15 Take p' such that 3(1 - p') = 5s' 3 3 41 s(/ , / )r(O ,p'> 3 Now note the following relation holds: Jo n u4l < 3(1 - p')l + crl/s. This so because if -1 and c- 2 intersect, then they must intersect either where 7ri and r 2 might intersect (which is less than crl, by Asymptotic Distinctiveness), or where 7r, and -1 (or 7 2 and U2 ) intersect (which is less than 3(1 - p)). Now if we plug in our choice of the parameters, then we we get 1 1 1 n a' I1 253(1 - p')l + Erl =- -l/s + 5 l/s < -l/s. |ou 2 70 Hence there will be no edge between a- and o 2 if they come from different underlying E permutations. 3.4.5 Proof of Theorem 10 Proof. It is sufficient to show that, for 0 < a < 1, we have that {nf"z+ is asymptotically distinct with probability 1- 0(1/n). To that end, let 7i ~ MNL(I[, a) and 7rj ~ MNL(In, a), for 0 < a < 1, be sampled independently. Then we want to show that for any c > 0 and -y > 0, we have n 7r I < emi) = 1 - 0(1/n) -n P (r First note that it is clearly worst case when ri = 1. I. We shall now show that, as- suming that 7i = fI, that with high probability the criteria above is satisfied. Consider the event B that the wrj comes from the top -yl elements of the underlying permutation: {1, ... yl}. Then what we want to show is that asymptotically Eji Bi < cyl for any c > 0. Note that each Bi is (first-order) stochastic dominated by the independent random variables B5, where each A3 is a Bernoulli random variable with parameter ) where the weights wi are the weights of the underlying permu- 1 tation, and wi = i-'. Now note that, for 0 K a < 1 we have Ej i ( - 2> l (Y0)-a+1 n-a+1 -((2-yl)-0a+1 which is in turn = e ((1(") a) , _ (- 1)-a+l)' and it goes to zero as n increases. Hence for all n large enough we get EBi < c, which in turn implies that 71 EY1 < El. Now we can use Chernoff and we get Bi$ > E- l P =P < exp (-E(l)) L Bi > E [L Bi] = 0(1/n). The same proof works for the case a = 1, except that to show that the EBi < E we get that the sum of the wi is of the order of log n, and the same argument follows. 3.4.6 L Proof of Theorem 9 Proof. In the algorithm, we estimate a transition matrix P based on the samples, and then run the Markov Chain in order to get a stationary distribution of the top l/(s*r*) items of each underlying permutation. This proof contains two main steps: 1) we show that there is a matrix P whose unique stationary distribution is proportional to w, 2) we show that our matrix P is not too far from P, and that the small error does not significantly disturb the stationary distribution. Let us first show 1). Let P i that Thnsicewiii thesince Markov Chain defined byw%+w, = ='"3 =wjPji, we see Then, lr*-3 for i = j, and P =1 -i Z P is reversible, and that w is a stationary distribution. To see that w is the unique stationary distribution, it suffices to note that P is aperiodic and irreducible, since all transitions, including self-transitions, are always of positive probability in our case. Let us note show 2). As described in the algorithm, Pij j 1, where Nij be the are seen together in samples that are clustered as from wT, and let Aij be the number of times that i is ranked higher than j among those Ni3 . number of times that i and = A Intuitively, we expect P to converge to P as N increases. For the remainder of this proof, we shall just prove this precisely. Note also that all of the top l/(s*r*) items of 7 are being considered among those top l/s* most seen with high probability, due to an application of the Stretching Factor Lemma. Lemma 5 shows that the transition matrix we created is indeed full with high probability, which in turn implies we have the Markov Chain specified by P also a 72 unique stationary distribution. We will now use Lemma 4 (due to Negahban et al, 2012), where we can bound how far away the stationary distribution P is from that of P. After we have shown this we will done, since the stationary distribution of P determines the weights of the top items. Lemma 4 specifies two parameters that we must control: the error matrix A = irmax/rmin, which is related to the spectral gap of P P - P and p = Amax(P)+ IA/2 and to the error matrix. From Lemma 6 we get that with probability 1- 09(log 2 N/N) we have ||A11 2 , and in Lemma 8 that 1 - Amax(P) > = ( 1 =-(n). This gives us p = Amax(P) + IA 112 /Wrmax/'kmin 1 j 0 logo/ Also, since /irmax/*rmin probability 1 - ( 0 (logp/2(n)) and IIA112 = 0 = log2N) ( , we have with that 1 logn2 N + 0 log1- P iog/2) We wish to show that both terms in the right-hand-side of the equation in Lemma 4 go to zero. The second term we see that 1 1 p min- max ( log N N 3/2\ (n N0, / as we wished. It suffices now to show that the first term is no greater than the first one. That is, we must assert that t P -rPO IIMHI V I 1- a mx= 0 _ min 73 I JA11 Vmax /*umin) ( = Since /rmax/rmin (logp/2 (n) = 0 pt log 3/ 2 (n) log Alog2(n)) we must just guarantee that log N log 3 0 N /2 (n) which we do by just taking t to be large enough. Lemma 4 ([Negahban et al 2012). For any Markov chain P = P+A with a reversible Markov chain P, let pt be the distribution of the Markov chain P when started wit an initial distributionpo. Then, lipt - *1|| HIr ~ ||po - MH irmax 1 |+i rmin 1 - where -k is the stationary distribution of P, irmin and p = Amax () + I A|12 = lrmax 7rmin' mini *(i), irmax = max ii(i), irmax /irmin- Lemma 5. Let i and j be items among the top l/r* items seen, and let Nij be the number of times that i and j are seen together among the top 1 in samples that are clustered as from an underlying permutation 7r. Then Nij = O(N) with probability 1 - 0 (1/N). Proof. Let i be one of the top 1 items of one of the K underlying permutations 7r. There is an obvious cluster C in which the noiseless sample - = 7F would have been clustered (where, by Theorem 8, most of its samples are clustered). We shall call the number of samples clustered in C by NC. By the Stretching Factor Lemma, i is present among the top 1 elements in 0.99N, samples with probability 1 - 0(1/n). Furthermore, let top 1 items of 7r. Then i and j j be a different item from the are seen together among the top 1 elements in at least 0.98N, samples with probability 1 - 0(1/n). Now it is only left to show that Nc = E(N). By an application of the Chernoff bound, for any 0 < 6 < 1, at least N(1 - 6)N samples will come from -r with proba74 bility 1- exp (-E(N)). By Theorem 8, with probability 1 - 0(1/N), we have that at most 0 ( loj) samples from 7r will be misclustered outside C. Putting these ob- servations together, we get that with probability (1 - exp (-0 1 -6 1-0(1/N), at least o - ( (N))) (1 - 0 (1/N)) = = 0(N) samples from 7 will be mis- N)) clustered in C, as we wished. l Lemma 6. Let A = P - P be the error in the transition probabilities. Then for N = Q(n) we get ||A||2 = 0 log 2 N ( N with probability 1 - O(log 2 N/N). Proof. Consider an (one of the K) underlying permutation r, and an arbitrary pair - of items (i, j) from the top l items of 7r. Then, by Lemma 5, with probability 1 exp(- (N)), we have that Nij> i = 0 (N). There are two sources of contribution to Aij: 1) the misclusted samples: in Theorem 8 we learned that with probability 1- 0(1/N), the proportion items isO ( V N) , 6 N of misclustered and 2) the natural noise from the properly clustered samples: from the Chernoff bound we know that a sample mean (based on Ni samples) of Bernoulli random variable is at most 0 probability 1 - O(1/MN3 ). ( lN away from the true mean with 3 Putting the two together, we get that with probability 1 - 0 (1/N) the error AXj satisfies log N log N Aij = |P13 - PyI < EN += NJ We can now use the Union bound over all 0 (12) pairs, and we get that with probability 1 -0 (l 2 /N) that JAsj = 0 ( Vo for each pair i, j from the top l. rN) This allows us upper bound the spectral radius of A with its Frobenius norm, which gives us that, with probability 1 - 0 (12 /N), we have ) JJAI1 2 < HL\HF = 75 j i 1/2 1/2 log log N N) (120 Since N = Q(n) and 1 = e (n), we get [|All 2 0 = ( with probability N 1 - 0 (log 2 N/N), as desired. Lemma 7 (Comparison Lemma, Negahban et al 2012). Let Q, p and P, r be reversible Markov chains on a finite set [n] representing random walks on a graph G = ([n], E), i.e. P(i, j) = 0 and Q(i, j) = 0 if (i,j) and E. For a min(ij)EE{7R(i)Pij/I(i)QijI -_maxjir(i)/p(j)}, Amax(P > a - Amrax(Q) - # l Lemma 8. Let Amax(P) be the second largest eigenvalue (in absolute value) of P. Then 1 0_1 1 - Amax(P) Proof. Let us use the Comparison Lemma (Lemma 7 above), with Q being the tran- sition matrix on the complete graph with l/r* items. That is, Qjj = 1/(l/r*) and = 1(i) 1/# for all i. Then, for b = maxij !i) = E(1,), we get Wi (i)P > wi ;(wi + wj) b 1 = 1 b1 2 This allows us to bound a as a' = min(i,j) r(i)Pij/u(i)Qij > - b 1- 1 E)(). Also, i Hence, since 1 - Amax 1- (Q) = -r W~) max i fr(i) 7 1/y r - (1)_ - (l). , we get Amax(P) > (1Amax (Q)) > E) - #=Max ,,a(i) b() =e 1 E 76 Chapter 4 Conclusion and Future Work Taken together, the results provided in the previous three papers serve a dual purpose: they propose a model, or framework, for the efficient provision of personalized ranked recommendation. The hope is that the model and the analysis would provide a convincing case for using the corresponding algorithms, or the broader framework. These results also contribute to the existing literature on learning choice models and learning mixtures. Naturally, this leaves a lot to be done in many directions. This section briefly highlights some potential directions for further research. With respect to the model and the available data, and since the different components of the mixtures considered so far correspond to different user types; it would be interested to look at the 'hedonic pricing' setting where the item weights are a function of some item attributes. It would also be interesting to see if there could add anything to the analysis. With respect to the analysis, it would be interesting to look for a tighter characterization of the class of learning or unlearnable models. First, it would be interesting to seek a set of conditions where the number of available clusters is allowed to grow in some fashion. To that end, it might be useful to consider alternative algorithms that lend themselves to analysis when the number of clusters grow. Similarly, it would be interesting to reconsider these results where the mixing distribution is not uniform, but somehow maintains a bound on the size of the cluster. For instance, it might be useful consider the bayesian setting where this distribution comes from a Dirichlet 77 prior (with a > 1?). Finally, it would be interesting to apply some of the techniques included here, and in the broader literature, to more real data sets to seek further understanding of the systems that underly and produce this data. For example, it would be interesting to see if the groups of movie watchers, or restaurant goers, can consistently have some interpretation. 78 Bibliography [1] Mit150 celebrations. http://mit150.mit.edu. [21 S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the area under the roc curve. Journal of Machine Learning Research, 6(1):393, 2006. [31 S. Agrawal, Z. Wang, and Y. Ye. Parimutuel betting on permutations. Internet and Network Economics, pages 126-137, 2008. 14] Ammar Ammar and Devavrat Shah. Efficient rank aggregation using partial data. A CM SIGME TRICS Performance EvaluationReview, 40(1):355-366, 2012. [5] K.J. Arrow. Social choice and individual values. Number 12. Yale Univ Pr, 1963. [6] M. Bayati, D. Shah, and M. Sharma. Max-product for maximum weight matching: Convergence, correctness, and lp duality. Information Theory, IEEE Transactions on, 54(3):1241-1251, 2008. [71 D.P. Bertsekas. The auction algorithm: A distributed relaxation method for the assignment problem. Annals of Operations Research, 14(1):105-123, 1988. [8] V.S. Borkar. Stochastic approximation: a dynamical systems viewpoint. Cambridge Univ Pr, 2008. [9] J Hayden Boyd and Robert E Mellman. The effect of fuel economy standards on the us automotive market: an hedonic demand analysis. TransportationResearch Part A: General, 14(5):367-378, 1980. [10] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324-345, 1952. [11] Emmanuel J Cand6s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717-772, 2009. [12] N Scott Cardell and Frederick C Dunbar. Measuring the societal impacts of automobile downsizing. Transportation Research Part A: General, 14(5):423434, 1980. 79 [131 Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. [141 K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. The Journalof Machine Learning Research, 2:265292, 2002. [15] 0. Dekel, C. Manning, and Y. Singer. Log-linear models for label ranking. Advances in neural information processing systems, 16, 2003. [161 P. Diaconis. Group representations in probability and statistics, volume 11. Inst of Mathematical Statistic, 1988. 117] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the 10th internationalconference on World Wide Web, pages 613-622. ACM, 2001. [18] J. Edmonds and R.M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JA CM), 19(2):248-264, 1972. 1191 V. Farias, S. Jagabathula, and D. Shah. A data-driven approach to modeling choice. Advances in Neural Information Processing Systems, 22:504-512, 2009. [201 Vivek Farias, Srikanth Jagabathula, and Devavrat Shah. A data-driven approach to modeling choice. In Advances in Neural Information Processing Systems, pages 504-512, 2009. [211 Vivek F Farias, Srikanth Jagabathula, and Devavrat Shah. A nonparametric approach to modeling choice with limited data. arXiv preprint arXiv:0910.0063, 2009. [221 Vivek F Farias, Srikanth Jagabathula, and Devavrat Shah. Sparse choice models. In Information Sciences and Systems (CISS), 2012 46th Annual Conference on, pages 1-28. IEEE, 2012. [23] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. The Journal of Machine Learning Research, 4:933969, 2003. [241 R. Herbrich, T. Minka, and T. Graepel. Trueskilltm: A bayesian skill rating system. Advances in Neural Information Processing Systems, 20:569-576, 2007. [25] J. Huang, C. Guestrin, and L. Guibas. Efficient inference for distributions on permutations. Advances in neural information processing systems, 20:697-704, 2008. [261 S. Jagabathula and D. Shah. Inferring rankings under constrained sensing. Advances in Neural Information Processing Systems (NIPS), 2008. 80 [27] M. Jerrum, A. Sinclair, and E. Vigoda. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. Journal of the A CM (JA CM), 51(4):671-697, 2004. 128] L. Jiang, D. Shah, J. Shin, and J. Walrand. Distributed random access alg6rithm: scheduling and congestion control. Information Theory, IEEE Transactions on, 56(12):6182-6207, 2010. [291 Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. Information Theory, IEEE Transactions on, 56(6):29802998, 2010. [301 R Duncan Luce. Individual choice behavior: A theoretical analysis. DoverPublications. com, 2012. [31] Jacob Marschak. Binary-choice constraints and random utility indicators. In Proceedings of a Symposium on Mathematical Methods in the Social Sciences, 1960. [32] Daniel McFadden. Conditional logit analysis of qualitative choice behavior. 1973. [33] Daniel McFadden and Kenneth Train. Mixed mnl models for discrete response. Journal of applied Econometrics, 15(5):447-470, 2000. [34] I. Mitliagkas, A. Gopalan, C. Caramanis, and S. Vishwanath. User rankings from comparisons: Learning permutations in high dimensions. In Proceedings of Allerton Conference, 2011. [35] Sahand Negahban, Sewoong Oh, and Devavrat Shah. Iterative ranking from pair-wise comparisons. arXiv preprint arXiv:1209.1688, 2012. [36] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2):1069-1097, 2011. [371 Robin L Plackett. The analysis of permutations. Applied Statistics, pages 193202, 1975. [38] C. Rudin. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. The Journal of Machine Learning Research, 10:2233-2271, 2009. [39] C. Rudin and R.E. Schapire. Margin-based ranking and an equivalence between adaboost and rankboost. The Journal of Machine Learning Research, 10:21932232, 2009. [40] Paul A Samuelson. A note on the pure theory of consumer's behaviour. Economica, 5(17):61-71, 1938. 81 [41] S. Shalev-Shwartz and Y. Singer. Efficient learning of label ranking by soft projections onto polyhedra. The Journal of Machine Learning Research, 7:15671599, 2006. [42] Hossein Azari Soufiani, David C Parkes, and Lirong Xia. Random utility theory for social choice. In NIPS, pages 126-134, 2012. 1431 L.L. Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927. [441 N. Usunier, M.R. Amini, and P. Gallinari. A data-dependent generalisation error bound for the auc. In Proceedingsof the ICML 2005 Workshop on ROC Analysis in Machine Learning. Citeseer, 2005. [45] L.G. Valiant. The complexity of computing the permanent. Theoretical computer science, 8(2):189-201, 1979. [46] Koen Verstrepen and Bart Goethals. Unifying nearest neighbors collaborative filtering. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys '14, 2014. [471 M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends@ in Machine Learning, 1(1-2):1305, 2008. [48] D.J.A. Welsh. Complexity: knots, colourings and counting. Number 186. Cambridge Univ Pr, 1993. 82