Ranked Personalized Recommendations Using Ammar Ammar

advertisement
Ranked Personalized Recommendations Using
Discrete Choice Models
ARCHIVES
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
by
Ammar Ammar
NOV 0 2 2015
B.Sc., Massachusetts Institute of Technology (2009)
M.Eng., Massachusetts Institute of Technology (2010)
LIBRARIES
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2015
@ Massachusetts Institute of Technology 2015. All rights reserved.
Signature redacted
Author .......
Department of Electrical Engineering and Computer Science
August 28, 2015
C ertified by ...........................
Signature redacted
Devavrat Shah
Associate Professor
Thesis Supervisor
Signature redacted
Accepted by ....................
PrfessoILhlie A. Kolodziejski
Chair, Department Committee on Graduate Theses
2
Ranked Personalized Recommendations Using Discrete
Choice Models
by
Ammar Ammar
Submitted to the Department of Electrical Engineering and Computer Science
on August 28, 2015, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
Personalized recommendation modules have become an integral part of most consumer information systems. Whether you are looking for a movie to watch, a restaurant to dine, or a news article to read, the number of available option has exploded
significantly. Furthermore, the commensurate growth in data collection and processing has created a unique opportunity, where the successful identification of a
relevant/desired item in a timely and efficient manner can have serious ramifications
for the underlying business in terms of consumer satisfaction, operational efficiency,
or both. Taken together, these developments create a need for a principled, scalable,
and efficient approach for distilling the available consumer data into compact and accurate representations that can be utilized for making inference about future behavior
and preference.
In this work, we address the problem of providing such recommendations using
ranked data, both as system input and output . In particular, we consider two
concrete, and interrelated, scenarios, that capture a large number of applications in
a variety of domains. In the first scenario, we consider a set-up where the desired
goal is to identify a single global ranking, as we would in a tournament. This setup is
analogous to the problem of rank aggregation, historically studied in political science
and economics, and more recently in computer science and operations research. In
the second scenario, we extend the setup to include multiple 'prominent' rankings.
Taken together, these rankings reflect the intrinsic heterogeneity of the population,
where each ranking can be viewed as a profile for a subset of said population. In
both scenarios, the goal is to (i) devise a model to explain and compress the data,
(ii) provide efficient algorithms to identify the relevant ranking for a given user, and
(iii) provide a theoretical characterization of the difficulty of this task together with
conditions under which this difficulty can be avoided.
To that end, and drawing on ideas from econometrics and computer science, we
propose a model for the single ranking problem where the data is assumed to be generated from a Multi-Nomial Logit (MNL) model, a parametric probability distribution
over permutations used in applications ranging from the ranking of players in online
3
gaming platforms to the pricing of airline tickets. We then devise a simple algorithm
for learning the underlying ranking directly from data, and show that this algorithm
is consistent for a large subset of the so called Random Utility Models (RUM).
Building on the insight from the single ranking case, we handle the multiple ranking scenario using a mixture of Multi-Nomial Logit models. We then provide a theoretical illustration of the difficulty in learning models from this class, which is not
surprising given the richness of the model class, and the notorious difficulties inherent
in dealing with ranked data. Finally, we devise a simple algorithm for estimating the
model under plausible realistic conditions, together with theoretical guarantees on
the performance together with an experimental evaluation.
Thesis Supervisor: Devavrat Shah
Title: Associate Professor
4
Acknowledgments
This thesis is the outcome of my doctoral research requirement at the Laboratory of
Information and Decision Systems (LIDS) under the superb supervision of my advisor,
Dr. Devavrat Shah. I had the fortune of meeting Devavrat during an introductory
probability class that he was teaching during my freshman year at MIT, and his
charm, energy, and enthusiasm got me instantly interested in the use of probability
in modeling and algorithm design. His patience and guidance over the past few years
have been instrumental to the completion of this thesis.
I am also grateful to Dr. Sanjoy Mitter and Dr. Munther Dahleh for their support
as a part of my thesis committee. I am indebted to them for their generous feedback,
continuous encouragement, and invaluable advice during my time here at MIT. I
am also greatly appreciative to all my teachers and mentors at MIT and elsewhere.
Special thanks to Dr. Boris Katz, Dr. Franz-Josef Ulm, Dr. Nasser Rabbat, Dr.
Leila Farsakh, Dr. Nancy Murray, Hubert Murray (FAIA, RIBA), Christine Lane,
and Dr. George Verghese.
Special thanks also go to Lynne Dell, Jennifer Donovan, Brian Jones, Debbie
Wright, Petra Aliberti, Alina Man, and the rest the administrative team at LIDS,
past and present, for their kind help and support.
I am also lucky and grateful to have made a number of wonderful and kind friends
during my time here.
I am specially grateful to my officemates, colleagues, and
friends Srikanth Jagabathula, Tauhid Zaman, Yuan Zhong, Seewong Oh, Sahand
Negahban, Guy Bresler, Luis Voloch, Christina Lee, George Chen, Jehangir Amjad,
Hajir Roozbehani, Kimon Drakopoulos, Yola Katsargyri, Ali Faghih, and Noele Norris
for numerous fun and inspiring conversations.
Last, but definitely not least, I am lucky and extremely grateful to have my family.
I would not be here, or anywhere, without their infinite love and unwavering support.
5
6
.............37
Contents
1
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.1.1
A Systematic Resolution . . . . . . . . . . . . . . . . . . . . .
13
1.1.2
Mathematical Model . . . . . . . . . . . . . . . . . . . . . . .
14
1.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.3
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.1
2
9
Introduction
19
The Single Ranking Problem
2.1
Model and Problem Statement ......................
2.2
M ain Results
20
......
.........................
........................
23
24
2.2.1
Aggregate Ranking ......
2.2.2
The Mode .......
2.2.3
Top-K Ranking . . . . . . . . . . . . . . . . . . . . . . . . . .
31
.............................
26
2.3
Learning the Max-Ent Model
. . . . . . . . . . . . . . . . . . . . . .
32
2.4
Evaluation and Scalability . . . . . . . . . . . . . . . . . . . . . . . .
36
2.5
Proofs . . . . . . . . . . . . . . . . . . . . .
2.5.1
Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . .
37
2.5.2
Proof of Theorem I . . . . . . . . . . . . . . . . . . . . . . . .
37
2.5.3
Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . .
40
2.5.4
Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . .44
2.5.5
Proof of Theorem 4: subgradient algorithm . . . . . . . . . . .
7
45
53
The Multiple Ranking Problem
Difficulty of Learning the MMNL Model: a Lower-bound
. . . . . .
54
3.2
An Algorithm . . . . . . . . .
- .. .. .
55
Preprocessing . . . . .
- .. .. .
56
3.2.2
Clustering . . . . . . .
- . ... .
57
3.2.3
Learning Weights within Clusters
. . . . . .
58
Algorithmic Guarantee . . . .
. . . . . .
59
. . . . . .
. . . . . .
61
Proofs . . . . . . . . . . . . .
. .. .. .
62
3.4.1
Proof of Theorem 7 . .
. . . . . .
62
3.4.2
Proof of Theorem 8 . .
. . . . . .
66
3.4.3
Proof of Lemma 3 . . .
. . . . . .
68
3.4.4
Proof of Proposition 2
. . . . . .
70
3.4.5
Proof of Theorem 10
. . . . . .
71
3.4.6
Proof of Theorem 9
. . .
72
.
.
.
Illustration
.
.
.
.
3.3.1
3.4
.
3.2.1
.
3.3
.
3.1
.
.
3
77
4 Conclusion and Future Work
8
Chapter 1
Introduction
The need for personalized rankings of objects, from web pages and movies, to consumer products, has become an integral part of most consumer information systems.
In the context of web search, for instance, ranking provides a quality filter that prioritizes high quality pages pertaining to a certain query; making it feasible to retrieve
useful information from the overwhelming corpus that is the World Wide Web. In
other contexts, such as movie recommendation, the resulting ranking also provides a
filter on the relevance of the recommended objects to the consumer. In both types
of settings, ranking plays an important role in eliminating inefficiencies arising from
the mismatch between what the consumer prefers and what the consumer is being
offered.
Generally speaking, the ranking procedure takes information about past consumer
'preferences' and produces a ranking that ideally captures future ones. The input to
this procedure typically consists of partial, noisy, or incomplete expression of consumer preference over the available options. And the purpose of the procedure is
to stitch this input into a meaningful ranking as an output. A principled approach
towards that end is to assume an underlying model describing the ground truth, and
to assume that the available observations are incomplete noisy realizations of this
ground truth. From a system designer's perspective, this procedure can be decoupled into two interrelated steps: (a) devise a meaningful model of this underlying
truth, and (b) develop scalable inference algorithm that can produce a ranking that
9
leverages this model to extract most of the available and relevant information to the
consumer.
Naturally, the quality of this procedure, and its ability to provide meaningful
rankings, hinge on three factors. First, since our knowledge about the consumer
population comes from the data we observe, this data should be rich enough in order
to accurately capture consumer preference, at the individual and population levels.
Consequently, the devised model should have the sufficient resolution to accommodate
and leverage the richness of this data. At the individual level, the model should be
robust to noise in the form of varying or mood-dependent behavior. At the population
level, the model should be able to capture the diversity or heterogeneity inherent to
most modern datasets.
In this work, we develop a framework for providing personalized ranked recommendations using ordinal, or comparison, data; a fine grained data representation of
consumer preference that occur naturally in most applications. At the heart of our
framework is a high resolution model that enables us to accommodate noise, while
capturing and leveraging consumer diversity. As a part of this framework, we also
provide a tractable algorithms for learning the model from data, and for providing
recommendations, together with an analysis and an evaluation of the performance of
the framework.
1.1
Problem Statement
For a more concrete definition of the problem, consider a collection with n objects
(e.g., movies or restaurants), and a population of N consumers who provides feedback
about these objects. Given this, we would like to design a procedure that can take
this feedback as its input, and produces a 'meaningful' ranking as its output. While
the meaningfulness of the resultant ranking is, of course, a matter of application,
we can broadly identify two types of ranking problems exemplified by the following
scenarios:
Scenario 1. Given the scores/outcomes of games in a tournament, e.g. NFL, we
10
would like to rank order sports teams. In such a setting, the eventual goal is to come
up with a single ranking. And subsequently, decide which teams go to the next round
in a season and eventually who wins the tournament (the Super Bowl).
Scenario 2. Given consumer ratings of choices, e.g. movies on Netflix, we would
like to learn the different prevalent types of movie watchers, i.e. different prevalent
rankings or preferences of movies. And subsequently, use this to provide recommendations of movies to a given consumer based on her/his previously identified type.
While these two scenarios might seem different from a practical point of view, at a
high level, they share a similar design structure. In both of these scenarios, we have a
partial, or incomplete, expression of the ordering of the available objects or choices of
interest, and we are interested in producing a single, global or personalized, ranking.
This problem gives rise to a few fundamental questions. First, in what format and
at what level of granularity should the data in both of these scenarios collected?
what properties should the resultant ranking(s) have? and how do we compute such
ranking(s)?
For the first question, one popular solution relies on representing each piece consumer feedback as a numerical score on a fixed scale. Typically this score reflects the
number of 'stars' the user would give the choice in question (e.g. 4 out 5 stars). This
is the approach followed by a large number of on and offline retailers such Amazon
and Netflix, and numerous others. Other approaches adopt a special, and rather minimal, case of this by restricting the scale to two options in the form of 'thumbs up' or
'thumbs down'. Despite being simple and intuitive, these representation suffers from
a number of shortcomings. At the individual consumer level, for instance, the scale of
choice is often arbitrary and does not provide a guarantee against mood-dependent
behavior: a consumer might give an item 3 out of 5 stars on a rainy day, but give
the same item 4 stars if he/she happens to be in a better mood. At the population
level, it is not clear whether the scale of such ratings is the same for all consumers:
consumer A, an avid movie watcher, is less likely to give a movie 5 stars than an
average consumer.
These objections suggest that a one can be gained from adopting a richer and
11
mood-independent representation of the consumers preference. A natural solution to
this problem, inspired by the seminal work of Samuelson 140], is to adopt the axiom of
revealed preferences. According to this axiom, the innate preference of the consumer
is revealed as an ordering of the available choices. For example, the preference of a
consumer facing a choice between items A, B, and C is fully captured through the
order of preference over these items, mathematically represented as a permutation of
the items.
Note that the use of permutations here simultaneously eliminates the issue of scale
while providing a fine-grained representation of what the consumer prefers. On the flip
side, assuming that the each consumer's permutation is deterministic or fixed hinders
our ability to deal with scenarios where the data contains inconsistent consumer
behavior (e.g., alternating between preferring A to B and vice versa). We can deal
with this issue by bringing randomness into the picture. In this stochastic view,
consumers behavior is captured by stochastic model that fully specifies the probability
of observing such changes. Thus, the population as a whole can similarly be viewed
as a 'larger' probability model. Here, it is interesting to note that the richness of this
representation dramatically increases the number of possible observations 1, and the
very nature of ordinal data can make the problem significantly more complex.
For a simple illustration of this complexity, consider the setting with 3 items and
3 consumers with preferences given in Table 1.1. In this example
2,
no single item
can be chosen as the clear winner, due to the 'cyclical' nature of the preferences taken
together. As it turns out, that these types of problems can persist if we consider a
setting with more choices and users as shown by Arrow in his famous impossibility
theorem [5]. In this result, Arrow demonstrated the impossibility of combining individual rankings to obtain a single ranking with certain 'desirable properties'. Putting
the question of these properties aside for a moment, it is important to note that this
is the problem precisely captured by Scenario 1 outlined above.
1Note that n items can be ordered in
2
n! ways
This example is commonly known as the voting, or Condorcet, paradox.
12
Consumer
Consumer 1
Consumer 2
Consumer 3
1st Preference
A
B
C
2nd Preference
B
C
A
3rd Preference
C
A
B
Table 1.1: The Voting Paradox
So, what properties should the resultant ranking have? In Arrow's setup, this
ranking is required to be 'fair' and representative of all preferences.
Further, the
ordering of any two choices A and B in the resultant ranking should be solely determined by the individual preferences between these two choices, and cannot be
affected by preferences pertaining to any other choice C3. Putting the significance
and necessity of these assumptions aside for a moment, it is important to note that
this is precisely the setup outlined in Scenario 1, and by extension to Scenario 2.
1.1.1
A Systematic Resolution
A systematic way out of this maze is to explain the data using a probability distribution
over permutations. By doing this, we absolve ourselves of the need to seek one
consistent ranking at the outset. This kind of decoupling is not new, and has been
fruitfully applied in econometric forecasting, most notably in the theory of discrete
choice models (cf. 132],[10I and [30]).
For the tournament scenario this distribution over permutation can be viewed as a
'noisy' observation of an unknown underlying permutation. And if the noise model is
chosen appropriately, then we would expect the distribution to have one 'prominent'
ranking that reflects the result of the tournament. Thus, the question of ranking
boils down to finding this prominent ranking, for a given setting of the noise. For the
second scenario, the probability distribution should also allow for multiple prominent
rankings.
For the main results in this work, we adopt this distribution over permutation
view using two popular models that satisfy our requirements: the Multi-Nomial Logit
3
This property is commonly known as the Independence of Irrelevant Alternatives (I.I.A.) property dictates
13
(MNL) model, and the mixed Multinomial Logit model (MMNL). In addition to
satisfying the requirements, these models also while allow us to develop algorithms
with analyzable performance.
1.1.2
Mathematical Model
Before diving into the details, it might be useful to provide a quick mathematical
definition of both the MNL and the MMNL models. By doing so, we hope to provide
a useful outline as well as a clear idea of these models that at the core of our contributions. These models are included here together with a statement of the learning
problems, central to our framework.
The Multi-Nomial Logit, or MNL, model is a probability distribution over permutations '. The probabilities provided by the model are fully specified using a set of n
parameters (w 1 , ...
, wn)
E R+, corresponding to one of n available items, or choices.
For a given set of choices, C, the probability of choosing an item i E C is given by the
choice probability Prc(i; w) =
.
Sampling a permutation,
-, from the model
can be done using the following simple sequential procedure: choose an item i in
the first position with probability proportional to wi (as per the previous definition).
Then, choose a item
j
with probability proportional to ws from the remaining n - 1
items, and so on.
Similarily, the Mixed Multi-Nomial Logit, or MMNL, model is mixture of several MNL model (or mixing components). When the number of components in the
mixture is K, the probability of a given permutation o- takes the form Pr(u)
_ a-
=
Pr(u; wk), where each ak > 0 is used to denote the mixing probability of
component k, and weights wk c Rn are the parameters for the kth component MNL,
and Pr(-; wk) is the probability of sampling o- given MNL with parameter wk. Sampling a permutation,
k E {1, ...
, K},
-, from the model can be done by first choosing a component,
then sampling the permutation from the chosen component using the
sequential sampling procedure mentioned above.
In this work, we consider two (related) learning problems: (a) the problem of
4 These models are commonly known as a choice model
14
learning a single prominent ranking model from pairwise comparison and first-order
marqinal data, and (b) the problem of learning the MMNL model from ordered tuples
of length 1. In the case of comparison data, each data points consists of a comparison
between two of the n items. For first-order marginal data, users express the position,
k, that they believe an item i should fall in. Furthermore, both types of these types
of data are assumed to come from a single underlying distribution. We refer to this
problem as the ranking problem.
In the second learning problem, referred to as the multiple ranking, or recommendation problem, each of N user expresses his/her preference in the form of tuples of
some length 1 < n, where 1 is not necessarily the same for all users. Further, this
data is assumed to come from an underlying MMNL distribution with K components.
Here, we would then like to learn the order of the number of components, their underlying weights, and the order of the top elements in each of the components. The
emphasis on the top elements is simultaneously motivated by practical application as
well as considerations of tractability in the model, as we show in the corresponding
chapter.
1.2
Contributions
The main contribution of this thesis is to provide a framework for end-to-end personalized ranked recommendation. For ease of exposition, we present the problem
in two chapters corresponding to the single ranking and multiple ranking. The first
chapter deals with the problem of learning a single ranking distribution from data to
ultimately provide this ranking. The second chapter deals with the problem of learn-,
ing the MMNL distribution using the solution of the first problem as a sub-module.
These two problems are presented in the following two chapters, respectively. These
chapters provide concise formulations with algorithms and guarantees for these problems.
Our contribution to the problem of single-ranking problem consist of multiple
algorithms for using pairwise and first-order marginal data to obtain a single ranking
15
in a variety of ways. Most notably, we provide an algorithm for correctly identifying
the prominent ranking of the Thurstone model, and more generally RUM models,
directly from data. We also provide a formulation and a solution of the problem as
an Entropy Maximization problem, together with algorithms and guarantees.
Our contribution to the multiple ranking problem is mainly a contribution to the
problem of learning the MMNL model. To that, we first, we provide a construction for
a mixture choice model that demonstrates the difficulty of the problem. In particular
we identify a fundamental limitation for any learning algorithm to learn a mixture
choice model in general. Specifically, we show that for K = 6(n) and 1 = e(log n),
there exists a pair of distinct MMNL models so that the partial preferences generated
from both are identical in distribution. That is, even with an infinite number of
samples, it is not possible to distinguish said models. This difficulty suggests that
one needs to impose further conditions on the model to guarantee learnability from
data. Guided by common consumer behavior, we provide sufficient conditions under
which this is possible, together with algorithms and error bounds for learning.
1.3
Related Work
Here, we provide a overview of some of the work that has been done on ranking and
recommendation, respectively. Given the pervasive presence of these problems in a
variety of applications, the references included here consists mainly of works from
computer science and econometrics that we believe are relevant to the problem of
personalized ranked recommendation. This overview is far from complete, and we
highly recommend that the reader refer to some of the works mentioned here for
further pointers.
In the context of discrete choice, we build on the influential line of research exemplified by the work of McFadden [32], where the data is assumed to come from an
underlying family of parametric distributions such as the one proposed by Bradley
and Terry 110], Plackett [371, and Luce [30], and commonly known as the Multinomial
Logit Model family (cf. McFadden [32]). In these models, the problem of ranking is
16
equivalent to learning the model parameters; this task can be done using the maximum likelihood criterion, as in McFadden [32], or via an iterative procedure on the
data as in Ammar and Shah
[4]
and Negahban et al. [351, where the later uses a
Markov Chain style iterative algorithm.
We also take note of works utilizing the
Random Utility Models (RUM) originally proposed by Marschak 1311, such as the
recent work of Azari et al.[42].
For multiple rankings and recommendation, we note the recently popularized matrix completion methods (cf. [361, 1291, and [111), where the incomplete consumermovie rating matrix is presumed to be of low rank and its missing entries are filled by
finding the best low-rank factorization of the said matrix. In the context of discrete
choice, we take note of, and build on, the work pertaining to the MMNL model by
Boyd and Mellman [9] and Cardell and Dunbar [12]. We also draw inspiration from
the seminal work of McFadden and Train [331 which present a compelling case for
MMNL models by demonstrating the ability of such models to approximate, to any
level of precision, any distribution in the RUM family.
In [261, the question of learning sparse choice models from exact marginals is introduced, and precise conditions for learnability for all sets of partial preferences is
characterized. This is done by connecting it to the dimensionality of partial preferences determined via spectral representation of the permutation group. We also take
note of the recent work of Farias et al. [22][20] [21] which assumes that the underlying
model is a sparse distribution over permutation, and propose algorithms for fitting
the model to data an optimization framework.
In other contexts, the task of ranking objects or assigning scores has been of great
interest over the past decade or so with similar concerns. There is a long list of works,
primarily in the context of bipartite ranking, including the RankBoost by Freund et
al. [23], label ranking by Dekel et al. [15], Crammer and Singer [14], Shalev-Shwartz
and Singer [41] as well as analytic, learning results on bipartite ranking including
those of Agarwal et al. [2], Usunier et al. 144] and Rudin and Schapire [39]. The
algorithm that will be closest to our proposal is the p-norm push algorithm by Rudin
[381 which uses
4,
norm of information to achieve ranking.
17
The question of learning a single ranking distribution over permutations from
partial or limited information has been well studied in the recent literature. Notably,
in the work of Huang, Guestrin and Guibas
[251,
the task of interest is to infer the
most likely permutation of identities of objects that are being tracked through noisy
sensing by maintaining distribution over permutations. To deal with the 'factorial
blowup', authors propose to maintain only the first-order marginal information of
the distribution (essentially corresponding to certain Fourier coefficients), then use
the Fourier inversion formula to recover the distribution and subsequently predict its
mode as the likely assignment.
The algorithmic view on rank aggregation was revived in work by Dwork et al [17]
where they consider design of approximation algorithms to find 'optimal' ranking with
respect to a specific metric on permutations. Very recently, a high-dimensional statistical inference view for learning distribution over permutations based on comparison
data has been introduced by Mitliagkas et al [341.
The maximum entropy approach for learning distribution is a classical one dating back to the work of Boltzman. The maximum entropy (max-ent) distribution,
a member of an appropriate exponential distribution family, is maximum likelihood
estimation of the parameters in that family (cf. see [47]). Indeed, the use of exponential family distribution over rankings has been around for more than few decades now
(cf. see [16, Chapter 91). We provide a careful analysis of a stochastic sub-gradient
algorithm for learning the parameters of this max-entropy distribution. This algorithm is distributed and iterative. It directly builds upon the algorithm used in [28]
for distributed wireless scheduling.
18
Chapter 2
The Single Ranking Problem
For a quick refresher, we start this chapter with a more detailed overview of our
contributions single ranking problem. The input data for this problem comes in two
different flavors: (a) pair-wise comparison data (e.g. item i is preferred to item j),
and (b) first-order marginal data (e.g. item i is ranked in position k). Given data
in either form, we focus our attention on three aggregation problems: (1) finding an
aggregate ranking over a collection of items (e.g. NetFlix movies), (2) finding the
most likely ordering of the items (e.g. object tracking a la [25]), and (3) identifying
the top-k items in a collection We solve (1) by introducing a general method which
gives each item a score that reflects its importance according to the distribution.
We then present a specific instance of this method which allows us to compute the
desired scores from the data (comparison or first-order marginal) directly, without
the need to learn the distribution.
More importantly, we show that the ranking
induced by this scoring method is equivalent to the ranking obtained from the family
of Thurstone (1927, [43]; also see [16, Ch 91) models, a popular family of parametric
distributions used in a wide range of applications (e.g. online gaming and airline
ticket pricing). For (2), we use the principle of maximum entropy to derive a concise
parameterization of an underlying distribution (most) consistent with the data. Given
the form of the max-ent distribution (an exponential family), computing the mode
reduces to solving a maximum-weight matching problem on a bipartite graph with
weights induced from the parameters. For the case of first-order marginals, this is
19
an easy instance of the network-flow problem (can be solved, for example, using
belief propagation [61). Furthermore, we propose a heuristic for mode computation
as well that bypasses the step of learning the max-ent parameters but uses directly
the available partial preference data. Such a heuristic, for example, can speed up
computation of [25] drastically.
Somewhat curiously, we show that this heuristic
is first-order approximation of the mode finding of the max-ent distribution.
For
pair-wise comparisons representation, the problem is not known to be solvable in
polynomial time. We propose a simple randomized scheme that is a 2-approximation
of it. We solve problem (3) using another distribution-based scoring scheme where
the scores can be computed directly from the data for first order marginals, or by
learning a max-ent distribution in the case of comparisons.
We present a stochastic gradient algorithm for learning the max-ent distribution
needed for some of the aforementioned problems.
This algorithm is derived from
[28], however the proof is different (and simpler). It provides explicit rate of convergence for both data types (comparisons and first-order marginals). In both cases,
the algorithm uses an oracle to compute intermediate marginal expectations of the
max-ent distibution. We prove that the exact computation of such marginal expectations is #P-hard.
Using standard MCMC methods and their known mixing time
bounds, our analysis suggests that for a collection of n items, the computation time
scales exponentially in n and polynomially in n respectively for the pair-wise comparisons and first-order marginals respectively. Two remarks are in order: first, the
result for first-order marginals also suggests a distributed scheduling algorithm for
input-queued switches with polynomial time learning complexity (unlike exponential
for wireless network model). Second, the standard stochastic approximation based
approaches cf. [8] do not apply as is (due to compactness of domain related issues).
2.1
Model and Problem Statement
Model: We consider a universe of n available items,
K
=
{1, 2, ... , n}. Each user has
preference order, represented as permutation, over these n items. Specifically, if - is
20
the permutation, the user prefers item i over
j
if -(i) < c-(j). We assume that there
is a distribution, say p, over the space of permutations of n items, Sn, that defines
the collective preferences of the entire user population.
Data: We consider scenarios where we have access to partial or limited information
about p. Specifically, we shall restrict our attention to two popular types of data:
first-order ranking and comparisons. Each of these two types correspond to some sort
of marginal distribution of p as follows:
First-ordermarginals: For any 1 < i, k < n, the fraction of population that ranks
item i as their kth choice is the first-order marginal information for distribution p.
Specifically,
mik =
where
]I{E}
Pt[{O(i) = k}]
p(O-)(2.1(i)=k}
=
(.)
denotes the indicator variable for event E. Collectively, we have the n x n
matrix [mij] of the first-order marginals, that we shall denote by M. This is the type
of information that was maintained for tracking agents in the framework introduced
by Huang, Guestrin and Guibas [251.
Comparison Data: For any 1 < i, j < n, the fraction of population that prefers item
i over item
j
is the comparison marginal information. Specifically,
A1P,[{o-(i) < c-(j)}]
=
>
p(o-){a(s)<e(J)}.
(2.2)
Collectively, we have access to the n x n matrix [cij] of comparison marginals, denoted
by C. Such data is available through customer transactions in many businesses, cf.
1191.
Remarks. First, while we assume mik (resp. cij) available for all i, k (resp. i, j),
if only a subset of it is available, the algorithm with that information works equally
well, with the obvious caveat that the quality of the output is dependant on the
richness of our data.
Second, we shall assume that mik E (0, 1) for all i, k (resp.
21
ci E (0, 1) for all i, j). Finally, in practice one may have a noisy version of M or C
data. However, the procedures we describe are inherently robust (as they are simple
continuous functions of the observed data) with respect to small noise in the data.
Therefore, for the purpose of conceptual development, such an idealized assumption
is reasonable.
Goal: Roughly speaking, the goal is to utilize data of type M or C to obtain various
useful rankings of the objects of interest. Specifically, we are interested in (a) finding
an aggregate or representative ranking over the items in question, (b) finding 'most
likely' (or mode) ranking, and (c) finding a ranking that emphasizes the top k objects.
To address these questions, we propose the following approach: (a) assume that
the data originates from some underlying distribution over permutations.
(b) Use
the data to answer the question directly, without learning the distribution, whenever
possible. (c) Otherwise, learn a distribution that is consistent with the data (M or C),
and use said distribution to answer the question. In principle, there could be multiple,
possibly infinitely many, distributions that are consistent with the observed data (M
or C, assuming it is generated by a consistent underlying unknown distribution). As
mentioned earlier, we shall choose the max-ent distribution that is consistent with
the observed data.
Result Outline: Here we provide somewhat detailed explanation of the results
summarized in the table presented earlier. Specifically, we solve problems (a), (b),
and (c) as follows.
To find an aggregate ranking (Section 2.2.1), we assign each
item a score derived from the distribution over permutations. We then propose an
efficient algorithm to compute said score directly from the data, without learning the
distribution . We show that the ranking induced by the computed scores is equivalent
to the ranking induced by the parametric family of Thurstone models [43116, Ch 9]
(Section 2.2.1), a popular family of distributions used in applications ranging from
online gaming to airline ticket pricing. Effectively, our result implies that if one learns
any of the distributions in this family from the data, and uses the learned parameters
to obtain a ranking, then this ranking is identical to the ranking we obtain directly
22
from the data (i.e. no need to learn the distribution)!
As for the mode of the distribution, we assume a maximum entropy underlying
model, and derive a concise parameterization of the model using 0(n 2 ) parameters
(Section 2.2.2). We also show that finding the mode of the distribution is equivalent
to solving an optimization problem on these parameters. In the case of First-Order
Marginals (see [251), this problem is easy and can be solved using max-weight matching on a bipartite graph. We also provide an efficient heuristic for finding the mode
in the case of first-order marginal data directly from the data, without learning the
max-ent distribution. In the case of comparison data, the problem is more challenging. In this case, we provide an 2-approximation algorithm for finding the mode. In
Section 2.2.3, for the top-k ranking problem, we propose a score that emphasizes the
top-k items. We show that this score can be computed exactly and directly from
First-Order Marginal data, and approximated using the max-ent distribution in the
case of comparison data.
2.2
Main Results
Before we get into the details of estimation the distribution, lets consider the problem
of ranking given said distribution. More precisely, lets assume that we are given a
distribution over permutations p, and asked to obtain an ordered list of the items of
interest that reflects the collective preference implied by the distribution. A classical
approach in this setting is the axiomatic one. In this approach one comes up with
a set of axioms that the ranked list should satisfy, and then try to come up with a
ranking function or algorithm that satisfies these axioms. Unfortunately, seemingly
natural axioms cannot be satisfied by any algorithm [5].
In this section, we opt for a non-axiomatic approach to aggregate preferences. We
address the problems of finding: (a) an overall aggregate ranking of all items, (b) the
mode of the distribution, and a (c) top-k ranking. These problems demonstrate the
utility of having or assuming an underlying distribution, and give rise to situations
23
where one can bypass the learning step and use the data directly for ranking. In the
latter situation, one can make the conceptual use of assuming a distribution, without
performing complicated computations to obtain a ranking.
2.2.1
Aggregate Ranking
Here we propose a method to obtain an entire ranking of all objects. Building on
the intuition followed by popular voting rules, the basic premise is that the objects
that are ranked higher more frequently should be getting higher ranking. This can be
formalized as follows: for any monotonically strictly increasing non-negative function
f
: N -+ [0, oo], define score Sf(i) for object i as
n
Sf(i)
=
(2.3)
E3f (n - k)IP(o-(i) = k).
k=1
The choice of f(x) = xP assigns the pth norm of the distribution of o-(i) as score to
object i.
One can take this line of reasoning further by noting that the exponential function
for a given
e
> 0, fe(x)
=
exp(ex), effectively captures the combined effect of all
p-norms. Therefore, we propose, what we call the E-ranking, with scores defined as:
n
Se(i) = Iexp(-Ok)P(o-(i) = k).
(2.4)
k=1
By selection
e
~ ln k, the scores are effectively capturing the occurrence of objects
in top k positions only; and for
E
near 0 they are capturing the effect of lower p
moments more prominently. Furthermore, intermediate choices of
e
give effective
ranking for various scenarios.
We focus our attention on the case where p = 1, and present a score that can take
us directly from the data to the ranking, without the intermediate step of learning
the distribution. We refer to this ranking as the f1 ranking.
24
f 1 Ranking
The fi score is given by:
n
Si(i) = Z(n - k) -P[-(i) = k]
k=1
In the case of first-order marginal data, this score can be computed in a straightforward way. For comparison data, however, the marginals P[oU(i) = k] are not available,
without having the distribution. Fortunately, the score above can be computed from
the data directly in the following form:
S(i) = n 1
Z~ir(i) < u(j)]
n-1
j~i
=
1
n-i
ci
j=Aj
using the following lemma:
Lemma 1. Given the definition of S(i) and S1(i) above, we have
Si(i)
=
S(i)
A proof of this lemma is provided in Section 2.5. One interesting aspect of this
shortcut is that the equivalence between the different scores does not assume any
particular distribution. It only assumes that the underlying distribution is consistent
with the data. This suggests that the produced ranking should work with different
distributions. One family of such distributions is the one based on the celebrated
model of Thurstone [43][16, Ch 9], as we shall see in the next section.
Why fi Ranking?
Here we demonstrate the utility of our f 1 ranking by showing its equivalence to
the ranking obtained using a Thurstone model. In a Thurstone model, preferences
over n items come from a "hidden" process as follows: the "favorability" of each
item i is a random variable Xi = ui + Zi, where ui is an unknown parameter (also
25
known as the skill parameter), and Z is a random variable with some distribution.
Furthermore, the random variables Z1 , ..., Z, are identically distributed. If we take
the tuple (X 1 , X 2 , ... , Xn) to be the outcome of some trial, then item
position k if xj is ranked kth among the values xi, ... , x,.
preferred to item
j
j
is ranked in
Equivalently, item i is
if xi > xj. In a typical application of such models one observes
these comparisons or positional rankings, and uses these observations to infer the
values of the unkown parameters u1 , ..., un. These values are then used to find a
ranking over all items. More precisely, items i >
j
> ... > k if ui > u3 >
...
>
Uk.
As it turns out, the ranking obtained by following the algorithm based on f1 scores
is equivalent to the ranking one would get by fitting a Thurstone model. The formal
statement is as follows:
Theorem 1. Let ui and uj be the (skill) parameters assigned to item i and j (respectively) in a Thurstone model, and let S(i) and S(j) be the score assigned to the same
items using our method (f1 scores), then:
j
<;-
ui
S(j) < S(i)
<->
Vi, j
(2.5)
A proof of this theorem is provided in Section 2.5. Thurstone models have been
used in a wide range of applications such as revenue management in airline ticket
sales, and player ranking in online gaming platforms (e.g. a variant of this model is
used in Microsoft's TrueSkill [24]).
2.2.2
The Mode
Given a distribution over permutations that is consistent with the data, in the context
of object tracking (a la Huang et al.
[25]) one would like to find the most likely
permutation under said distribution, or the mode. It is easy to see that the mode
of a distribution over permutations is hard to compute in general. To address this
difficulty, one might want to follow some criteria for selecting a tractable class of
26
distributions to deal with. Ideally, we would like distributions from this class to obey
the constraints given by the data, without imposing any additional structure. This
intuitive requirement is captured by the Maximum Entropy criterion, whereby we
choose a distribution that maximizes the information entropy while satisfying the
data constraints.
In the following section, we provide a formal derivation of the
maximum entropy distribution along those lines.
The Maximum Entropy Model
Formally, the observations M or C impose the constraints that the distribution, y,
should belong to class M:
S
[(-)1Eai)=k} =
Vi, k E K
(2.6)
= cij, Vi, j E K
(2.7)
mik,
(7ESn
or class C:
P(-)Sf1(i)<O(j)}
tZESn
with the the normalization and non-negativity constraints in both cases.
S
[(-)
=
1,
p(-) > 0,
Vo- E Sn.
(2.8)
M (resp. C) is non-empty only if M (resp. C) is generated by a distribution over
S, to begin with. For clarity of exposition, we will assume that this is the case.
When this is not the case, The algorithm that we shall present is based on the solving
the Lagrangian dual of an appropriate optimization problem in which the constraints
imposed by M (resp. C) are "dualized". Therefore, by construction such algorithm is
robust.
Now ISal = n! and the data of type M (resp. C) imposes O(n2 ) constraints.
Therefore, there could be multiple solutions. The max-ent principle suggests that we
choose the one that has maximal entropy in the class M (resp. C). Philosophically,
27
we follow this approach since we wish to utilize the information provided by the data
and nothing else, i.e. we do not wish to impose any additional structure beyond
what data suggests. It is also well known that such a distribution provides maximum
likelihood estimation over certain class of exponential family distributions (cf. [471).
In effect, the goal is to find the distribution that solves the following optimization:
max
V-(a) log V(a)
HER (V)
OESN
(2.9)
v E M or C.
It can be checked that the Lagrangian dual of this problem is as follows (since all
entries of M, C in (0, 1)): let
Aik
be the dual variables associated with marginal
consistency constraint for M in (2.6). Then, the dual takes the following form:
max E Aikmik - log
(Z
exp
(Z Aikli{f(j)=k}
o
i,k
(2.10)
ik
It can be shown that this is a strictly concave optimization and has a unique optimal
solution. Let it be A* = [A*j].
Then the corresponding primal optimal solution of
(2.9) (with M) is given by
p(o-) oc exp
(
A - i{0o(i)=k}).
(2.11)
i,keK
Similarly, for the comparison data, the dual optimization takes the form
Ai<jcij - log
max
i,j
Ai<j{O(i)<OU(j)})),
exp
a
ij
and the optimal primal of (2.9) given optimal dual A* = [A
Aj- ff o(j)<o(j)} .
p(a-) oc exp
(2.12)
] is
(2.13)
iAjEma
As can be seen, in either case the maximum entropy distribution is parameterized by
28
at most n 2 parameters, which is the same as the degrees of freedom of the received
data.
For future purposes and with a slight abuse of notation, we shall use F(A)
to represent the objective of both Lagrangian dual optimization problems (2.10) and
(2.12).
Computing the Mode
Having restricted our attention to the maximum entropy distribution, we now proceed
to compute the mode. We begin by providing an algorithm for computing the mode
exactly in the case of First-Order Marginal data. We then present a more efficient
algorithm for approximating the same mode directly from the data without the need
to learn the max-ent distribution. Finally, we present an algorithm that uses the
max-ent distribution to compute a 2-approximation of the mode in the general case.
Recall that under the maximum-entropy distribution, the logarithm of the probability of a permutation - is proportional to Eik Aikli{(i)=k} for first-order marginal
data, and Eij Aij<Ji(,ji)<.(J)} for comparison data. Since the log function is monotone,
finding the mode, in both cases, boils down to finding:
-* E argmax
UESn
o-* E arg max
(S
Aiki{o(i)=k}
(2.14)
EAicjff 1(i)<0'U)})
(2.15)
i,k
Solving the problem in (2.14) exactly is equivalent to the following maximum weight
matching problem: consider an n x n complete bipartite graph with edge between
node i on left and node k on right having weight Aik. A matching is a subset (of
size n) edges so that no two edges are incident on same vertext. Let the weight of
the matching be the summation of the weights of the edge chosen by it. Then the
maximum weight matching in this graph is precisely solving (2.14). This is a well
known instance of the classical network flow problem and has strongly polynomial
time algorithms [181.
It also allows for distributed iterative algorithm for finding
it including the auction algorithm of Bertsekas [7] and the recently popular (max29
product) belief propagation [6]. Thus, overall finding the mode of the distribution for
the case of first-order marginal is easy and admits distributed algorithmic solution.
Next, we describe a (heuristic) method for finding the mode without requiring
the intermediate step of finding the max-ent parameters A in the case of first-order
marginal data. Declare the solution of the following optimization as the mode:
max I:
mikfflo(i)=k}.
i,k
That is, in place of
Aik,
use
mik.
The intuition is that Aik is higher if
mik
is and
vice versa. While there is no direct relation between this heuristic and mode of the
max-ent approximation, we state the following result which establishes the heuristic
to be a 'first-order' approximation. A proof is provided in Section 2.5.
Theorem 2. For A
= [Ask]
in small enough neighborhood of 0 = [0],
1
n- 1
1
n
For comparison data, the problem in (2.15) is also equivalent to a combinatorial
problem with the space of objects being the matchings. However, it does not admit
the nice representation as above. One way to represent the matchings in comparison
form is n x n matrices, say B = [Bij] with (a) each entry Bij being +1 or -1
all 1 < i, j < n, (b) for all 1 < i, j < n, Bij + Bji
Bij = Bik = 1, then Bik = 1 for all 1
=
for
0 (anti-symmetric), and (c) if
i, j, k < n. The goal is to find B so that
E>7 Bij Ai2< is maximized. It is not clear if this is an easy problem.
To address this problem, we have the following 2-approximation algorithm to
compute the mode using the parameters of the max-ent distribution: choose L permutatations uniformly at random, compute their weights (defined as per (2.15)) and
select the one with maximal weight among these L permutations. For L large enough,
this is essentially with 1/2 weight of the maximum weight. This requires A to have all
non-negative components. This is not an issue since given the structure of the permutations (each having equal number comparisons,
30
-(i) < u(j), correct) and hence
an affine transformation of A by vector with all components being same constant does
change the distribution. Therefore, in principle, we could require the subgradient algorithm to be restricted to the non-negative domain (projected verison). The formal
statement about this algorithm is stated below.
Theorem 3. Let A = [Ai,<:] be non-negative vector. Let OPT be the maximum of
Eij Ai<jiI{c(i)<a(j)}
among all permutation a E S,.
domized algorithm, if we choose L >
P [W(6) <
Then in the above described ran-
Iln, then
(1 - 6)OPT] < c
A proof of this theorem is included in Section 2.5. To complete the solution, we
only need to estimate the parameters of the max-ent distribution. An algorithm is
provided in Section 2.3.
2.2.3
Top-K Ranking
Here the interest is in finding a ranking that emphasizes the top k objects (the
favorites). To do this, we can compute the aggregate ranking, or the mode, and
then declare the top k ranked objects in resulting list. We propose a natural way
to emphasize the favorites. Intuitively, if an object is ranked among top k positions
by a large fraction (probability-wise) of the permutations in the distribution, then it
ought to be among favorites. This suggests that for a distribution A, each object i
can be given a score Sk(i), defined as
Sk(i) = PA[0-(i) <
k],
In the case of first-order marginal data, this score is nothing but E<krmni, and can
be computed directly from the data. In the case of comparison data, this score can
be inferred from the max-ent distribution, which can be learned by the procedure
outlined in Section 2.3. Finally, once the score is computed, we can now declare the
top k objects with highest scores as per Sk(-) as the result of top k.
31
2.3
Learning the Max-Ent Model
Here we describe an iterative, distributed sub-gradient algorithm that solves the dual
optimization problems (2.10), (2.12). First, we describe an idealized procedure that
calls certain oracle that estimates marginals of distribution from exponential family.
We can, in general, only hope to estimate these marginals approximately because the
exact estimation, as we show later, is #P-hard. Therefore, the main result that we
state is for a sub-gradient algorithm based on such an approximate oracle. In a later
section, we shall describe how to design such an approximate oracle in a distributed
manner along with its associated computational cost.
MaxEnt Estimation: Using Ideal Oracle
Input: Ranking data mik Vi, k.
1: Initialize: A = 0 Vi, k.
2: for t = 1 - T do
(Ext [If{,(i)=kl is provided by an oracle)
'- (mi - E\t []Jo,()=k}])
A" <- At
3:
4: end for
5: Choose T
{1,.. . , T} at random so that P(T = t) oc I/Vf
6: return AT
Here, Eyt [i{,(i)=k}]
= EEiSPAt
IPy (u)
(0)R{(i)=k} where
Z(At)
=
i'P
kI{U(i=k})
with normalizing constant (partition function) defined as:
Z(At)
exp
=
(
aESn
Ak I{a(i)=k})
i,k
Instead of Et [ffj,(i)=k1l, we will use a randomized estimation, Eyt (i, k)
such that the error vector e(t) = [eik(t)] where each component
eik(t)
=
Eyt [ff ,(i)=k}] - Et (i, k)
32
=
E5t [I[{
](i)=k],
is sufficiently small.
We state the following result about the convergence of this
algorithm.
Theorem 4. Suppose that each iterationof the sub-gradientalgorithm uses an approximate estimate E.(-,.) such that |e(t)||1 < A(t)+\1 *1Io+
1*2Is
where A(t)
=
(
1/Vs
and A* is a solution of the optimization problem. Then, for any y > 0, for choice of
T
=
(e--2--
2 + |A*|+
+||A*
), we have
E [F(Ar) > F(A*) - c,
where F(.) is the objective of dual optimization (2.10). The identical result holds for
the comparison information (2.12).
A proof of this theorem is included in Section 2.5.
An Approximate Oracle
Theorem 4 relies on existence of an oracle that can produce an estimation of marginals
approximately with appropriate accuracy for each time step t. Computing marginals
exactly is computationally hard. For first-order marginal data, this follows from
13].
We prove a similar result for comparison data. Both results are summarized by the
following theorem:
Theorem 5. Given a max-ent distribution A, computing
E.\[I{,(i)=k}] and E.\[E{o(i)<o(j)}] is #P-hard.
A proof is included in Section 2.5. We now describe an approximate oracle. We
shall restrict our description to the Markov Chain Monte Carlo (MCMC) based oracle.
In principle, one may use heuristics like Belief Propagation to estimate these marginals
instead of MCMC (of course, this may lead to the loss of the performance guarantee).
Now the computation of marginals requires computing Pyt
(-)
for any o- E S,.
From its form, the basic challenge is in computing the partition function Z(A t ). The
partition function Z(A') is the same as computation of permanent of a non-negative
33
valued matrix A = [Aik] where Aik = e ik. In an amazing work, Jerrum, Sinclair
and Vigoda [271 have designed Fully Polynomial Time Randomized Approximation
Scheme (FPRAS) for computing permanent of any non-negative valued matrix. That
is, Z(At) (hence IPAt(a)) can be computated within multiplicative accuracy (1
E)
in time polynomial in 1/E, n, log(1/6) with probability at least 1 - 6. Therefore,
it follows that the desired guarantee in Theorem 4 can be provided for all timesteps
(using union bound) with probability at least 1 - 1/n within polynomial in n building
upon the algorithm of [27].
For the case of comparison information, however no such FPRAS algorithm for
computing the partition function is known. Therefore, we suggest a simple MCMC
based algorithm and provide the obvious (exponential) bound for it. To that end,
define WA(u)
= Ei,kA'
Aik
*l{o(i)<a(j)}, and construct a Markov chain, 9:(A), whose
state space is the set of all permutations, Sn, and whose transitions from a given state
o- to a new state a' are given as follows:
MaxEnt Model Estimation
1: With probability !2 let a' = a.
2: Otherwise, construct a' as follows:
o Choose two elements i and j uniformly at random; set &(i) = -(j), &(j) =
a(i) and &(k) = a(k) for all k # i, j.
o Set
a'
=
& with probability min{1, exp (WA()
-
WA-(a))}; else set o-' = a.
Using this Markov chain, we estimate EA [fi{f,(i)<,(j)}] as follows: starting from any
initial state, run the Markov chain for Tm steps and then record the state of the
Markov chain, say
um.
if aTm(i)
then record 1 else record 0. Repeat this
<jUm(j),
for S times and obtain the empirical average of the recorded 0/1 values. Declare this
as the estimate of E.\[I{i)<,a(j)]. Indeed, one simultaneously obtains such estimates
for all i, j. We have the following bound on Tc, which we establish in Section 2.5:
34
Theorem 6. The above stated Markov chain has stationary distribution P* so that
p*(a-) oc exp (W,).
Let p(t) be the distribution of the Markov chain after t steps startingfrom any initial
condition. Then for any given 6 > 0, there exists
TC= e(exp ((nLxA
l, + n logn)) log
such that for t > Te,
<6,
___1
2,p*
"_(*
where | -2,,
is the x 2 distance.
2
Now the total variation distance between p(t) and p* is smaller than the x
distance between them. Therefore, by Theorem 6, it follows that the estimation
error of P,(-) using p(t) will be at most 6. From Chernoff's bound, by selecting S
(mentioned above) to be O(6-2 log n) (with large enough constant), it will follow that
the estimated empirical marginals for all i, j components must be within error 0(6)
with probability 1 - 1/poly(n). Given that increment in each component of A as part
of the sub-gradient algorithm is O(VT') by time T, from Theorem 4, it follows that
the
IJAIK
= 0((n +
IIA*lo + IA*)112)1+)
(for any choice of -y > 0 in Theorem 4).
Finally, the smallest 6 required in Theorem 4 is an inverse polynomial in n, e, from
above discussion it follows that the overall cost of the approximate oracle required for
the comparisons effectively scales exponentially in n3+ (ignoring other smaller order
terms).
35
Evaluation and Scalability
2.4
Here we provide results from a simple experiment to demonstrate that the ranking
produced by our 1 algorithm converges to the right ranking for the Multinomial Logit
Model, an instance of Thurstone model (choose Zis to be i.i.d. logit distribution).
Specifically, we sample distinct items i and j from 1,
...
, n uniformly at random as per
the distribution. We then consult an MNL model, defined using n parameters, for
the value of fIfa(i)< (j)}. All the samples are then combined into a matrix [cij], which
is used to find the 1 ranking. In Figure 2-1, we show a plot of the error measured
using the normalized number of discordant pairs versus the number of samples used
for n
=
10. As we can see, beyond 500 samples, the error induced is extremely small.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
500
1000
1500
2000 2500 3000
Number of samples
3500
4000
4500
5000
Figure 2-1: Plot of the error induced by 1 ranking algorithm for the MNL model, a
specific instance of Thurstone's model.
To test the scalability of our method, we implemented a voting/survey tool that
enables a large number of participants to vote on any number of items in real time. By
doing so, we had the following two questions in mind: in addition to being theoritically
interesting, can our algorithm be applied in real time? is comparison based voting
practical and simple enough for adoption? through this exeperiment, we believe the
answer to both questions to be affirmative.
Our tool was installed in voting booths that were made available to the visitors of
the MIT150 event [1], a university-wide public open house. Voting categories included
movies, actors, musicians, atheletes, among others, and the results at any time were
36
continuously displayed on a large screen. The participation was impressive, and the
feedback was mostly positive, which makes us believe that adopting comparison as
form of voting is worth a serious consideration.
2.5
Proofs
This section provides detailed proofs of all the results stated earlier in the chapter.
Due to space constraints, proof of Theorems 4, 5, and 6 are omitted from this version.
2.5.1
Proof of Lemma 1
With some arithmetic manipulation, we get
=SW
1
Z - E[U(i)
< (j)-
- u]P[u
=ak]
n-1
Olj ESn jA
=i-1
E
(I-u)(i)
<
Eoi)]
I
1
-1
n
-k)
I[a7~i)
-E
=k]
k=1
-S,(i)
2.5.2
Proof of Theorem 1
Recall that, under Thurstone's model, each item i has "skill" parameter ui associated
with it. The random "favorability" Xi = ui + Z where Zi are i.i.d. random variables
with some distribution. Our algorithm, with access to exact partial marginal data
(first-order or comparison), computes scores for each item i: Si(i) using first-order
37
data and S(i) using comparison data. As proved in Lemma 1, these two scores are
equivalent. Therefore, if we establish that ui > uj if and only if S(i) > S(j), it is
equivalent to being SI(i) > S 2 (j) as well. We shall establish this statement in two
parts: (a) ui > uj, and (b) ui = uj.
Let us start with the first case, ui > uj. Recall that, score for an item i is
S(i) = n1Z
P[Xi > Xk]
k5i
#
j,
S(i) - S(j) OC
(:
IF(x
> Xk)
-
Therefore, for i
(E Px
> Xe)
kAi
P(X > xi])
+
= (P([x > x3 ] -
(Z (P[xi > Xe] -
> Xe])
(1P[xi < X7X - i [X] <x ])+
(>E (P[xj
fAi~j
Xe]
- P(Xi < Xf]
(2.16)
.
=
P[Xj
Recall that Xi = ui + Zi and Xj = u1 + Zj where ui, uj are the "skill" parameters
of i and
j
respectively while Zi, Zj are i.i.d. random variables with some distribu-
tion. Define, Wij = Zi - Zj. Then for all i,j, Wij are identically distributed, say
with distribution similar to a random variable W which has CDF given by Fw, i.e.
Fw(x) = P[W < x]. Since W is difference of independent and identically distributed
random variables, by definition it is 'symmetric' around 0. That is, for any x > 0,
ED(W <
-X]
= ED(W
38
> X].
(2.17)
Given these notations, it follows that
P[X < X3 ] = P[Wij
u3 - ui]
= Fw(uj - ui).
(2.18)
Similarly,
P [Xj < X] = Fw (ui - uj)
P[Xi < X]
=
Fw (ut - uj)
P[Xj < Xe] = Fw(uf - uj).
(2.19)
Since ui > usj, we have ut - uj > ut - ui for any f k i, j. Since Fw is a CDF and
hence monotonically non-decreasing, i.e. Fw(x) < Fw(y) for all x < y,
Fw(ue - uj) - Fw(ue - uj) > 0,
(2.20)
S(i) - S(j) c (Fw(6) - Fw(-6)
(E
+
for all f. Also, let 6 = ui - uj > 0. Then, from above discussion, (2.16) becomes
(Fw(ue - uj) - Fw(uf - uj)
(2.21)
Now,
Fw(6) - Fw(-6) = P[W C (-6,6]]
> P[WI < 6/2].
(2.22)
As we shall show next, for any distribution of Zs, W is such that for any y > 0,
P[HWI < -Y] > 0.
39
(2.23)
From (2.20)-(2.23) (and -y = 6/2 in last equation), it follows that if ui > u, then
(2.24)
S(i) - S(j) > 0.
Now we establish (2.23). For this note that due to Z (distributed as Zi, Zj) being
a distribution, there exists (tightness) [-a, a] c R for some a > 0 so that P [Z E
a] > 1. Given any -y > 0, partition this interval into at most N = [1a disjoint
contiguous intervals, each of length -y/ 2 . One of these intervals must have probability
[
at least 1/2N.
Call this interval I. That is, P[Z E I] > 1/2N. Since Zi, Zj are
distributed independently and identically distributed manner with distribution same
as that of Z, we have that
P[Zi E I, Z E i] > 4
2
(2.25)
> 0.
But when both Zi and Zj are in I, their difference W = Zi - Z must be withint
[--y/2, -/2]. This completes the justification of (2.23).
For the case (b), ui = uj, using identical arguments as above, one can argue that
S(i) = S(j). This complete the proof of Theorem 1.
D
Proof of Theorem 2
2.5.3
Let A be in neighborhood of 0 = [0]. We shall establish claim by means of Taylor's
expansion of m as function of A around 0. For simplicity, let us denote oij = IW~i}
mij (A)
S
a
) exp
exp
where partition function Z(A) = E s
mij (0)
=
Ak
)
Then
Akl0k1).
(k,l
For A = 0, we have
for all i, j. By the first-order Taylor expansion, for A near 0.
mij(A) ~ mij(0) +
ZAk
40
AkI
*=) .
(2.26)
By the property of exponential family (see [47] for example), it follows that
&mij(A)
=
Ex [oijo-kI]
EA [O-ig]
-
&Akl
EA [Uki]
(2.27)
= E\ [o-i oUkl] - mij(A)mkl(A).
From (2.26) and (2.27), it follows that for A near 0,
mij (A)
~n+(
AWiEA o-ij -k1]I
Akl)
(Z
-
(2.28)
k,I
k-
We state the following proposition.
Proposition 1. All distributions can be represented by A s.t.
0,
Aik =
Akj = 0,
for 1<i,j <n.
(2.29)
k
k
Proof. Consider a A such that (2.29) is not satisfied. We will transform A to v which
satisfies (2.29) but induces exactly the same distribution. Specifically, we shall prove
that for each -,
E Sn
E
Vk (O-kl - &ki) =
E
Ak(o-kI
-
&kl ).
k,l
kJ
vij = Ai
-
1
1-At.
n
-
To that end, define
1
n
1
-A
where
n
A. =Z
k=1
n
n
Ak
A. =
ZAkj
k=1
41
AkI.
A..=
k,1=1
(2.30)
Then,
n
k=1
k=1
n
Similarily, we can check
(5
Z
Vi.
A..
A. + - A.. = 0,
n
n
for all j. Now we have:
AklUkl -
Vklikl
k
l=1
k
k,1
k,1
Ok1
.(Zu
-+ h
=
UkI kl kA -4
n
n
Ak.
k=1
1=1
1
2
kl)
k,l
k=1
n
-
Ak.
A
-
k=1
k=1
k=1
1Vyk
E_
k= ZAk
A1
Ak=
=A1.
n
+
n
k
ki
ki
Therefore:
1
vkl(Ukl - &ki)= E
A1(okl -akl)
-
1
-A.. + -A''
k1
k1
E
kl(kl
-
k)
k1
This implies that the distributions induced by A and v are identical.
L
Given Proposition 1, we shall assume A satisfying (2.29) without loss of generality.
Then, from (2.28)
mij (A)
=
+ IAkE[[ijukl]
n
k
k,1
n
kjl
A..
-
U
kl
42
E [0ijkl]
(2.31)
Now,
E[uijkl]
=
1
n
:k = i,j = 1
n(n-1)
:k # i, j = I
0
: k = ij:
l
Then from (2.31)
m
=1 +'A
n2
(2.32)
Akl.
+ E
1
k5zi,I0j
Focusing on the last term, we have
Akq)
)
1
2n(ri- 1)
(
k
k=,i,lIAj
q=1,qfj
n
+E(
=
Aq)]
[
(Ak.
q=1,qoi
lAj
2= (n-
Akj)
-
E (A., -Ail)]
10i
[1k)
2r - [A.2
2n(n - 1)
I
-
2n(n - 1)
+0i
=
Ai + A.
(-2Aij) -
-
Aj
I
n(n - 1)
)
2n(n -
Combining this with (2.32), we get
m2 (A) - 1
ni
1
1
1
1
n-i~
n~
D
as desired.
43
2.5.4
Proof of Theorem 3
Denote the difference between the weight of a permutation, -i, and the mode by Aj,
defined as:
W(U*) - W(o-)
Aj
Since the permutations u,...,
uk
are drawn uniformly at random, for each per-
mutation o- we have:
E[W(u-)]
=
ZAi<j E [oi<3 ]
;> W(o-*)
Aij
=
zj
And therefore,
E[Z\i]
=
W(u-*)
E [W(o-4)] 12 W(c-*)
-
Since & is chosen to have the maximum weight W(.) of all permutations, and since
these permutations are drawn independently, we have:
k1
P w(&) < (.- 6)W(-*)]
=
2
W(Oi) < ( - 6)W(-*)]
fP[i
i=1
k
=
1JP Ai > ( 2 +6)W(-*)
i=1
Using the Markov inequality we get:
k
k
P [W(&) <
-
1
6)W(a*)
1+26
~fJ(1 - 26)
i=1
k
I=e
26
<
e
2
k
i:=1
Where the approximation is valid for sufficiently small 6. Setting k > - log1 , we
44
have:
P [W(&) < ( - 6)W(o-*) <
2.5.5
Proof of Theorem 4: subgradient algorithm
We shall establish result for the first-order marginal. The proof for comparison is
identical. To that end, recall that the optimization problem of interest is
max
Akmi - log
F(A)
(
exp
(S
Aik[{o(i)=k)).
(2.33)
ik
i,k
Let A* be an optimizer of the objective function with optimal value F(A*). Now F(-)
is a concave function. As before, we shall use t as the index of algorithm's iteration, A'
be parameter value in iteration t, gt be the subgradient of F(A') = m - Et [[i
(i)=k}1]]
and e(t) be the error in this subgradient. Then
At+
_
2
2
At + at(gt + e(t)) - A* 1
IAt - A* 1 2 + a I1gt + e(t)11 2
+ 2at(gt, At - A*) + 2at(e(t), At - A*)
2 + cy11g t + e(t)11 2
A* 11-
+ 2at(F(A t ) - F(A*)) + 2at(e(t), At - A*)
where the last inequality follows from the fact that gt is a subgradient of F at At.
Applying this inequality recursively, and keeping in mind that 11
t
0 < IA0 - A*11 2 +2 Ea(F(A') - F(A*))
s=0
t
t
+ a2 1g,+
e(s)11 2 +
s=O
2Z a,(e(s), AS - A*)
s=O
45
; 0, we get:
>
Therefore
s=O
t
a2 jig + e(s)|1
+
2
S=O
t
as (e(s), A' - A*)
+ 2
s=0
Let A be chosen to be As with probability ps
0
-
Then on average, we have
=
A*11 2 + E= 0 a 2 |gs + e(s)f1 2
2Z= 0 a5
E[F(A*) - F(A)] <
2 Et8=0 as
+
2 Z:=o as(e(s), As - A*)
(*)
2 Es=O as
To simplify the term in (*), note that g'+e(s) is a vector whose elements are in [-1, 1].
Therefore Jg' + e(s) 1 < n 2 , where n is the dimension of the vector. Furthermore, the
term (e(s), A' - A*) can be bounded as follows:
(e(s), As - A*) < I(e(s), As - A*)|I <|e(s) 1iHAs - A*I1
< |je(s)H|1(A' - A 0IIoo + HA0 - A*j)
And,
s
S
Ijs-
A0 1joc
< IaqjAqjj,,
5 1:j
q=O
q=O
where Aq is the change in the value of A at step q. Combining this with the previous
inequality, we get:
8
(e(s),A' - A*) < |e(s)1(E as + 1AO - A*Haoo)
q=O
46
Combining this with (*), and letting
B = max{|1A0 - A*1| 0 ,|IA 0 - A*11 2}, we get:
B+
E [F(A*) - F(A)] < B +
2
a2n2
t=o
_
Et. as
22 E2 as e(s)11(Z"s o as + B)
22,E s
Using our approximation oracle, we can choose the value of Ile(s)II to be g_
which yields:
B + I~t 0 a2n 2 + 2 Z> 0~a
2 Zs2 + 2
E[F(A*) - F(A)] <
2 E,=O as
2
B + (n + 2)Z 0 a2
(2.34)
2 Zs=o as
Recall that a, =
.
E
Therefore,
as =
E(fi)
and numerator scales as log t.
Therefore, the quantity above converges to zero, and F(At) converges to F(A*). Now
(ignoring constants), the bound in (2.34) scales like (B + n 2 log t)/v/t. Therefore, for
t > T,
E [F(A*) - F(A)] < e,
for any -y > 0,
T = E(-2-(IIA*Io + IIA*112+ n 2)2+)
B.
Proof of Theorem 5
As mentioned earlier, this result follows by direct adapatation of the known results.
We present it here for completeness. For the case of first-order marginal data, the
partition function given by Z(A)
as Z(A)
=
En,
HE= 1 e~i,
=
1,s,
exp (
Zik
AikR{o(i)=k})
can be rewritten
which can be recognized as the permanent of a matrix
Aik = [elik], denoted by Perm(A). To prove that computing Perm(A) is #P-hard, we
47
provide a reduction from the problem of computing the permanent of a (0, 1)-matrix,
which is known to be #P-hard [451, as follows: given a (0, 1)-matrix, A, we construct
the matrix A by setting Aik
Aik = 1.
=
ln(n! + 1) when Aik
0 and
=
Aik
= ln(n! + 2) when
Combining the facts that Perm(A) < n! and M(n! + 1) mod (n! + 1) = 0,
for any integer M, it is easy to see that:
Perm(A)
=
Perm(A)
mod (n! + 1)
Perm(A)
mod (n! + 1)
Which concludes our first half of the proof. For the case of comparison data, we prove
that computing the partition function, given by Z(A) = Es,
exp
(ij
Ai<jli{(i)<a(j)})
is at least as hard as counting the number of Hamiltonian paths in a directed graph,
which is #P-hard [48]. To that end, we provide the following reductions: given a directed graph G = (V, E), we set Ai<j = - ln(n!+1) for all (i, j) E E, and Ai<j = 0 for
'
E. If we rewrite the partition function as Z(A)
it becomes clear that the product
H7'j):q(i)<,(j)
to a hamiltonian path, and is at most
=
E
s
1,j):()<()
eAi(i)
,
all (i, j)
eAi-(> is equal to 1 when o corresponds
1 otherwise. Since we have a total of n! terms
inside the summation, computing the floor of the partition function, LZ(A)J, should
give us the number Hamiltonian paths in the graph. This concludes the second half
of the proof. Note that in order for the two reductions above to be valid, we need to
make sure that any parameters used can be represented in a number of bits that is
polynomial in the size of the problem n. This is indeed the case, since we only need
O(n log n) bits to represent n!.
C. Proof of Theorem 6
Definition 1. The x2 distance, denoted by 2
112,,,
2
E pi(-
48
-
1
,is
Definition 2. Consider an |Q| x |Q| nonnegative valued matrix A G R IxI' and a
vector u e R1<. Then the matrix norm of A with respect to u is defined as follows:
sup
JA||U=
A|Av|| 2 ,u
1VIl2,u
v:Eu[v]=O
where Eu [v] = Z
uivi.
Recall that the discrete-time Markov chain we are using is reversible and aperiodic,
and therefore ergodic. Let p* : S, -+
[0, 1] be the stationary distribution over the
chain. Furthermore, let p(t) be the distribution at time step t.Then, the dynamics of
p(.) is given by:
P(t) = P(t - 1)P = P(O)P,
where P = [PpT] =
Smin
}]
is the transition matrix specified
norm,
we obtain:
previously. Using the properties of the matrix
{1, exp(WA(p) -W(p))
p(t)
it*
2,M*
IP'
<
t
2
(2.35)
Therefore, to bound the distance between p(t) and /a*, we need to get a bound on
||pt i.
Lemma 2. The matrix norm of ||Pt|| is bounded as
(i
1
exp (9(n 2 11A1o
+
||P|I
n log n)))
Proof: Recall that the partition function, or normalization constant of p(-) is
defined as
(A) = E
exp(A - o-)
rE Sn
49
It follows that
Z(A)
nm! exp (n 2 HAIK) < exp(2(n2
+ n log n))
11All
Therefore, for any - E Sn
1
= exp (o, A)
p-)=Z(A)
> exp
(-
+ nlogn))
e(n 2HAHl
For any two permutations -, p E S, that differ in two swapped elements, i.e. we
can transition from - to p and vice versa in one step, we have:
P,,p ;> exp ( - 0(n 2 ||AJO +-nlogn))
Given this, we can get a bound on the conductance <D of P as follows:
<=
Sn - S)
min Q(S,
sCsI P(S)P(Sn - S)
> min
a,peS.
> exp
(-
where Q(A, B) = EEA,pB1[t(a)Pp.
t(o-)Pp
O(n 2 |Afllo, + nlogn))
By Cheeger's inequality, it is well known that
Amax < 1 -
-
the second largest eigenvalue of P, Amax is bounded as
2
<1-exp (- 6(n 2||A|oo + n log n))
From Lemma 2, and using the fact that
-1i
1
__
we have
-1
<6
50
for
t > Te
I~min
0
exp (E(n 2HAHo)),
where T, = 0 ( exp (n2jA 11,, + n log n)
51
52
Chapter 3
The Multiple Ranking Problem
Our contribution to the problem of learning the MMNL model, as stated previously,
can be outlined as follows: first, we provide a construction for a mixture choice model
that demonstrates the difficulty of the problem. This difficulty suggests that one
needs to impose further conditions on the model to guarantee learnability from data.
Guided by common consumer behavior, we provide sufficient conditions under which
this is possible, together with algorithms and error bounds for learning.
To demonstrate the difficulty of the problem, we idenitify a fundamental limitation
for any learning algorithm to learn a mixture choice model in general. Specifically, we
show that for K = 6(n) and 1 = E(log n), there exists a pair of distinct MMNL models
so that the partial preferences generated from both are identical in distribution. That
is, even with an infinite number of samples, it is not possible to distinguish said
models.
This naturally suggests that one needs to impose further conditions for the learnability of MMNL models. Guided by consumer behavior, we provide sufficient conditions under which this is possible. Concretely, we account for the following behavioral patterns: (a) consumers have natural tendency to provide information about
their extreme preferences (e.g., movies they love and hate); (b) the "amount" of liking
(disliking) decays (grows) "quickly" as we go down a consumer's preference. This
intuition can be captured through conditions on the structure of the parameters of
the model. Under a sufficient set of such conditions, we establish that it is possible
53
to identify MMNL components with high probability 1 as long as 1 = 0(log n) and
N = Q(n). Other works have made similar assumptions; for instance, in the context
of collaborative filtering [46] consider positive-only feedback.
We establish this result using a simple clustering algorithm as well as a subroutine for learning the parameters of the top items in each mixing component. The
clustering algorithm, (see Common-Neighbors algorithm), is simple; it places two
consumers in the same cluster iff they share more than half of their neighbors. As
such, it does not require any knowledge of the number of components in the underlying
model, or any other parameters for that matter. We then, within each cluster, learn
the weights of the top 1 elements. Algorithm 0, which accomplishes that, works by
simulating a Markov Chain, whose states are items, and whose stationary distribution
is proportional to the weights of the elements. This approach is similar to that of
Negahban et al [35].
3.1
Difficulty of Learning the MMNL Model: a Lowerbound
Recall that we are interested in learning the MMNL model from samples with 1
observed, with the possibility that 1< n, the total number of items. At a high level,
one is justified in asking whether this task is always possible or not. Intuitively, and
given the richness of the MMNL family, one would suspect that there are instances of
the problem where this task would be difficult or impossible. Here we present a result
that illustrates this difficulty. This result makes use of a construction that involves
two mixture choice models that cannot be distinguished when the size of the sample
1is not large enough.
For ease of exposition, and without loss of generality, we assume that the learning
task is requires computing an assignment of each data point to the correct mixture
component from which it was sampled. Thus, the inability to correctly assign samples
to the correct mixture component translates into the inability to learn. This intuition
'Probability going to 1 as the parameters
n and N scale to oc.
54
is captured in the following result
Theorem 7. Let S
E N+
and 6 > 0. Then, for any i > 0, there exists a pair of
MMNL models (M1 and M2 ) with n =
2 j+1
items, K = 2i mixture components, such
that no algorithm that has access to S random samples of length f
=
2i +I1 can decide
.
with probability higher than 1 + 6 whether the samples are from M1 or M2
This theorem establishes the existence of mixtures with K = O(n) mixing components, where it is impossible to assign samples of length log n to their correct component, thus rendering the learning task also impossible. This is the more interesting
when compared with the positive results provided later.
3.2
An Algorithm
The lower bound presented in the previous section suggests that the task of learning
MMNL model becomes difficult when the model in question is "close" enough to another distribution to make the two indistinguishable using partial data. This leads us
to the question of when learning such models becomes possible. To provide an answer
for this question, we start by presenting an algorithm for learning the MMNL model.
The algorithm consists of a number of steps or sub-routines: a preprocessing step,
a clustering step, and a component learning subroutine. Taken together, these subroutines provide a tractable algorithm for learning the MMNL model. They also provide an outline for establishing the results in Section 3.3. The details of the clustering
and component estimation sub-routines are presented as the Common-Neighbors
algorithm and Learn-Weights-of-Top-Elements. The Common-Neighbors algorithm partitions the data into clusters that correspond to the MNL components
in the mixture. The Learn-Weights-of-Top-Elements is intended for learning the
weights of the top 1 elements in each permutation. In Section 3.3 we provide strong
probabilistic guarantees on their correctness.
55
3.2.1
Preprocessing
Given a data set with samples {01,
...
, &N}, we construct an undirected sample graph
which is used as the input for our algorithm. Each node in the sample graph corresponds to one of the N samples. Since we're interested in clustering the nodes, and
therefore the corresponding samples, the edges of the graph are constructed to reflect
the 'similarity' between these samples with respect to the underlying model. Roughly
speaking, if two samples come from the same MNL component, we would like have
an edge between them, and if they come from different components, we would like to
have no edge.
More concretely, let W be an N x N matrix denoting the symmetric binary adjacency matrix of the sample graph. We would like specify the edge entries Wj to be
consistent with the intuition above. To that end, we have the following definition
W=
0,
if overlap(60, 0-1) < '
1,
otherwise,
31
where s is a constant, overlap(&i, &i) denotes the fraction of items that is shared
among the top 1 items of a^ and 8'. For instance, if 1 = 5 and &Z = (A, B, C, D, E) and
07
=
(A, C, B, E, F), then overlap(&', &i) = 0.8. The parameter s in this construction
is given and depends on the underlying generative model. The constant s is given by
s = 0.75 0.751
(1.01). Whereas this may seem cryptic for the moment, it follows
easily from the Stretching Factor Lemma, which is proved in the appendix.
As we show in the results section, this definition of W, in conjunction with some
assumptions on the underlying model, has the desired properties regarding the similarity and dissimilarity of sample nodes. Intuitively, since each sample 'emphasizes'
the top items in the MNL component, and considering MNL components that are
sufficiently 'far' apart, and where this emphasis is significant, we would expect the
overlap between any two samples to reflect the relationship between their origins.
56
Clustering
3.2.2
In this section we describe the algorithm used to group samples to their original
clusters. We will first define precisely what a correct clustering is.
Definition 3. (Clustering of samples, Correct clustering)
A clustering C
F
=
=
{8i}ic{1,...,N}.
{C3 } of the samples is a finite partition of the set of all samples
That is, it is a a finite collection of subsets Cj
; E, such that
Ci n Cj = 0 for i # j and such that UjC = E. Furthermore, we call a clustering a
correct clustering if the nodes associated with samples & and &i are clustered together
if and only if they are samples of the same underlying component in the mixture
distribution.
The algorithm below uses the union-find data structure for partitions. The unionfind data structure allows one to easily create partitions by making two sets into one
(via union) and to retrieve the block of the partition that a particular element if part
of (via find); see [131 for more details. We begin with each node in its own set, and
use the union operation (which will make a union of the sets containing both nodes)
whenever their neighbors overlap by more than half. Below, N(v) denotes the set of
neighbors of v, and E denotes the set of edges.
Common-Neighbors
Input: Edge list E
1: Initialize clustering C =
{{o-}}.
2: for do(u, v) E E
if thenN(u) n N(v)l > 1 min {IN(u)j, IN(v)I}
3:
perform union(u, v)
4:
end if
5:
6: end for
Upon termination, the algorithm creates a clustering of the samples. In particular,
if any two samples share over half of their neighbors, then they will be clustered
together. We would like to provide the guarantee that with high probability nodes
will be clustered together if and only if they come from the same component in the
mixture distribution of the MNL model. In Section 3.3 Theorem 8, we provided one
57
such guarantee under assumptions on the underlying MMNL model and the length,
l of the truncated top-i samples.
3.2.3
Learning Weights within Clusters
Above we saw that with high probability we can distinguish between samples that
come from different underlying permutations. Our initial goal, however, was not just
clustering, but also learning the weights of the top items of each permutation. Once
the weights are learned, a prediction system can cater top recommendations according
to the cluster to which a particular user belongs.
Our algorithm for learning the weights of the top 1 elements within each cluster
work as follows: we will construct a Markov Chain, in which each state is an item and
for which the transition probabilities are based on the history of pairwise comparisons
between the items. Theorem 9 then tells us that the stationary distribution of this
Markov Chain has stationary probabilities proportional to the weights of the items.
This approach is inspired by and similar to that of Negahban et al. There are two
main differences between what we do here is what is done in Negahban. The first
main differences is that we have a full transition probability matrix, rather than a
sparse one. The second main difference is that the transition probabilities have an
extra source of noise, which due to the possible misclustering.
Learn-Weights-of-Top-Elements
Input: {i, ..., JNc} samples of length l that were placed in a cluster C together.
1: Collect the l/r* most seen items in {O, ..., UNc}, for a constant r* defined below.
2: Construct a transition matrix on those l/r* elements, with transition probabilities
Pi = A"
Nij lr* and Psi = 1 -
Pi, where Ni be the number of times that i and
.
j are seen together, and Aij is the number of times that i is ranked higher than
j among those Ni3
3: Compute the Stationary Distribution of the Markov Chain
4: Return the stationary probability of the l/(s*r*) most likely items, where s* is
defined below.
Algorithm Learn-Weights-of-Top-Elements works by constructing a Markov
Chain whose stationary distribution is proportional to the weights of the top l/s*r*
items in the MNL model. The constants s* and r* are given by s* = 0.99
58
O99"'
(1.01)
and r* = (1 - 0.99) i0(1.01). Whereas the presence of the constant s* and r* may
seem unclear for now, the intuition behind them is the following: we would like to
that the top l/(s*) ranked elements to come from the the top 1 elements of 7r, and
would like to all of the top l/(s*r*) to be present within those top l/s* elements.
Before we proceed to establish a guarantee for this algorithm, a few remarks are in
order. First, the division of the algorithm into clustering and component estimation
is inspired by similar approaches to learning mixture models. One such approach,
which has become popular for learning the Gaussians Mixture Model (GMM), is
the Expectation Maximization (EM) algorithm. In this problem, we are given data
points that are assumed to originate from an underlying Gaussian mixture, and we
are interested in learning the mixing distribution, and the parameters of the Gaussian
components (i.e., the mean and the variance), that constitute a Maximum Likelihood
(ML) estimate. Second, we view the assignment problem as that of clustering the
data points in a way consistent with their component 'membership'. Finally, we note
that while the clustering above is done using the Common-Neighbors algorithm,
this need not be the case in application. One can use any off-the-shelf graph clustering
algorithm (e.g., Spectral Clustering) as a heuristic fashion.
3.3
Algorithmic Guarantee
In this section, we present our results regarding the correctness of Algorithms CommonNeighbors and Learn-Weights-of-Top-Elements.
Recall that Theorem 7 sug-
gests that one needs to make additional assumptions about the class of underlying
MMNL model. As before, we consider an MMNL model over n items, with K mixing
components, and we assume that our data points take the form of partial orders of
length 1 =
e (log n)
sampled from the mixture using the standard sampling proce-
dure. Next, we restrict the MMNL distributions of interest a family of models that
we call the Power-law MMNL models defined as follows:
Definition 4 (Power-law MNL Model). Let Pr(.; w) be an MNL model over n items,
and let ir be the underlying permutation, and wj , ...
59
,w7
be their weights. We call
this model a power-law MNL model with parameter a if it is an MNL model, and if
the weights satisfy w
x i--.
As a short hand, we will denote such Power-Law MNL model using MNL(7r, a).
Similarly, we denote a mixture of such Power-Law MNL models by MIX-MNL(ll(n), a),
where for each n we have
{1,...
,n}.
1
(n) =
{
1,
... ,
wK} is a set of K permutations over
Sampling from a mixture of Power-MNL Models MIX-MNL(ll(n),a)
means that one of the K underlying permutations, say 7ri, is chosen at random,
and then we sample according to MNL(7ri, a).
Intuitively, we expect learning to be possible when the underlying sequences
and the corresponding mixing components, are not too similar to each
other. This intuition is made more precise using the condition of Asymptotic Dis-
{ 7 1 , ...
, 7rK},
tinctiveness, defined below.
Definition 5 (Asymptotic Distinctiveness). We say that a sequence {U(n)}nez+,
where H(n) = {71,...,i7K} is a set of K permutations over {1,...
,n},
satisfies the
asymptotic separation criterion if the following holds for each pairri =,4 ,rj of underlying permutations: for each e > 0 and ' > 0 we have that lim supn |77 nir|
<
'l,
where 1 = log n as usual, and where ir7 denotes the set containing the top yl items of
permutation ri.
Finally, we state the notion of correct clustering that we will use in Theorem 8.
Definition 6 (E-Correct Clustering). We call a clustering of o 1 , ... , oN a (1- ) correct
clustering if moving eN samples from its current cluster to another cluster provides
a correct clustering.
We are now ready to state Theorem 8, which is the correctness guarantee of Algorithm Common-Neighbors. It tells us that as long as the underlying sequences are
asymptotically distinct and the samples are generated from a relatively concentrated
MNL model, then we are able to cluster most samples correctly with high probability.
Theorem 8. Let {l(n)}nz+ be a sequence of asymptotically distinct underlying sequences, and let o-1,
..., O-N
~ MIX-MNL(H(n), 3) be the N samples. Then for / > 1
60
with probability 1-0(1/N), for any n > no (for a constant no larqe enough) Algorithm
Common-Neighbors provides a (1
NjN) -correct clustering asymptotically as
-
N -+ oo.
It is worth point out that the proof of Theorem 8 works by analyzing properties of
samples independently of each other. This is in contrast to what one may expect from
Algorithm Common-Neighbors, which works by placing edges based on events on
pairs of samples. Proposition 2, which is proven in the appendix, provides the key
insight to Theorem 8 which allows us to do this more elegant analysis.
Algorithm Learn-Weights-of-Top-Elements takes in samples, ideally from the
same underlying permutation, and returns the weights of the top 1 items. Theorem
9 shows that, despite the possibly imperfect clustering (and hence possibly seeing
samples from different underlying permutations), we can still learn the weights of the
top items of each underlying permutation.
Theorem 9 (Learning Top Items). Let {ll(n) }nEz+ be a sequence of asymptotically
distinct underlying sequences, and let O-,
... ,
oN ~ MIX-MNL(f("), p), for 0 > 1, be
the N samples. Let Algorithm Common-Neighbors cluster the items, and input
each cluster to Algorithm Learn- Weights-of- Top-Elements.
Then for a number
of samples N = Q(n), we get that Algorithm 0 asymptotically learns the weights of
the top O(log(n)) items of each of the K clusters with probability 1 - 0 (log2 n/n).
Illustration
3.3.1
Theorem 10 below tells us that the asymptotic distinctiveness criteria is not a very
stringent one. In particular, we can create asymptotically distinct sequences by a reasonable random procedure: simply generate
Wr, ...,
7rK according to Power-law MNL
Theorem 10. Let {H1 (n)}Ez+ be a sequence of underlyingpermutations, where n(n)
{71,
...
, 7K}
have 7r1 ,
... ,
is a set of K permutationsover {1, ...
7rK ~
MNL(fIIJ, a). Letul,
Then, for 0 < a < 1 <
/,
*-
,
ON~
-
model with appropriate parameter.
, n}, and where for each n E Z+ we
MIX-MNL(H(n), 0) be the N samples.
with probability 1 - 0(1/N), for any n > no (for a
61
constant no large enough) the algorithm provides a (1 -
onN)-correct clustering
asymptotically as N -- oc.
Note that as a special case (where we take a = 0) of Theorem 10 we have the case
in which the underlying permutations rj,
..., 7rK
are uniformly random permutations
of {1, ... , n}. More importantly, the important message that Theorem 10 tells us is
that even if the underlying sequences are moderately correlated, we can still cluster
nodes almost correctly. Whereas the requirement of Asymptotic Distinctiveness provides a technical condition on the underlying sequences, sequences 10, via the MNL
generation with its explicit weights, gives us a more intuitive view of how similar the
underlying sequences can be.
Proofs
3.4
3.4.1
Proof of Theorem 7
Proof. We establish this result constructively. In particular, we will do the following
steps:
" Base Case: construct a simple example of a pair of MNL models that cannot
be distinguished by noiseless samples of length 1 = 3
" Iterative Procedure: provide an iterative procedure to construct pairs of
more complex MMNL M1 and M2 with the same property.
" Setting weights: we choose the item weights so that, with probability 1 - 6,
none of the S samples will contain an incorrect ordering (hence being effectively
noiseless), where by a sample o containing incorrect ordering we mean ai < o-,
whereas iri > ry
" Conclusion: From this we conclude that no algorithm with access to S samples
can tell Mi and M2 apart with probability at most 1 + 6.
62
Mixture 2
Mixture 1
A
B
A
B
B
A
B
A
C
D
D
C
D
C
C
D
0.5
0.5
0.5
0.5
Figure 3-1: Example (n=4, k=2, 1=3)
Base Case: consider the following MMNL model (the weights will be set later)
over no = 4 items, with ko = 2 components, with a uniform mixing distribution,
illustrated in Figure 3-1. For the deliberately constructed mixture above, any correct
partial ordering of length l < 3 can come from either of the two mixtures; on the other
hand, when l = 4, we get a full permutation, and since the two mixtures have different
permutations, this is sufficient to distinguish between them. For convenience, we refer
to one such pair as an (n, k, 1)-indistinguishable pair; thus, the pair in this example
is (4, 2, 3)-indistinguishable.
Iterative Procedure: using an (ni_1, ki_ 1 , li- 1 )-indistinguishable pair of mixtures similar to the ones in the aforementioned example, we can now construct a
(2ni_1, 2ki_ 1 , li_ 1 + 2)-indistinguishable pair as follows:
(a) Create 2 copies of the mixture with 2ki_
1
components created by combin-
ing all of the components present in both mixtures of the (ni_ 1 , ki_ 1 , l_1)indistinguishable pair in one mixture. Further, assign all the components in
both copies uniform mixing probabilities. We refer to these new mixtures as
copy 1 and copy 2, and to the components originating from M and M 2as base
set 1 and base set 2 respectively.
(b) In copy 1, add a permutation over ni_
1
items, p, to component from base set 1;
and the inverse of this permutation, p- 1 to components in base set 22.
2p = [pi,..., IPni_,] and p = [pni,,,...,,p1]
63
Copy 2
Copy 1
p
S~p
P-
P-1
pi
P-1
p-
P
p-
p
p
A
B
A
B
A
B
A
B
B
A
B
A
B
A
B
A
C
D
D
C
C
D
D
C
D
C
C
D
D
C
C
D
Base set 1
0.25
0.25
Base set 1
Base set 2
0.25
0.25
0.25
0.25
Base set 2
0.25
0.25
Figure 3-2: Iteration Step of the Construction
(c) In copy 2, add p- 1 to every component originating from base set 1, and p to
every component originating from base set 2.
This iteration, starting with our previous example, is illustrated in figure
3-2.
Now, we claim that the resulting two mixtures can only be distinguished with samples
of size at least 1i = li-1 + 3 (i.e., the pair is (2ni_1, 2ki- 1 , li-i + 2)-indistinguishable.
To see why this is the case, note that both mixtures contain components from base set
1 and base set 1 , as well as the permutations p and p- 1 . The only difference between
the pair is which base set, 1 or 2, p and p-
1
are added to. Therefore, to distinguish
between the two mixtures, we need the samples to identify a combination of the base
set (1 or 2) as well as the permutation attached to it (p or p- 1). To identify a mixture,
we need to have at least 1 = li-1 + 1, and to identify the base set we need at least 2
additional items. Thus, we need at least 1i = li_1 + 3 to distinguish between the two
pairs, as desired.
Setting Weights: We now set the weights of each of the underlying MNL models
such that we have w'
<
for each i < n.
Conclusion: By the construction above, whenever we get samples with correct
ordering, it is always equally likely be from M1 as from M2 . Hence, as long as all
the samples have correct orderings, even an optimal algorithm (that calculates the
64
exact likelihoods) cannot distinguish between M1 and M2 . We now show that, with
probability at least 1 - 6, no samples have incorrect orderings.
Let A2 be the event that sample c-i has an incorrect ordering.
choice of weights, P (A.) < f-
_s
Then, by our
be the even that at least one sample
= 1.
S. Let A
has an incorrect ordering. Then, by the union bound that the probability that we
have an incorrect ordering in any of the samples is at most
i IP (Al)
S = 6.
We conclude that since, with probability at least 1 - 6, the algorithm does not even
=
see samples with incorrect orderings (and since it can only distinguish M1 from M2
if it sees samples with incorrect orderings), it cannot distinguish M1 from M2 with
probability better than j+ 6.
For the proof of Theorem 8, we will proceed as follows: we will show that a
majority of data points will be clustered correctly (with high probability). At a high
level, we show this by first showing that (i) a majority of the data points are "similar"
to the underlying permutation, 7r, of their corresponding to the mixing components
from which they originate, and (ii) the fraction of points that do not satisfy this
condition have a negligible effect on the overall clustering computed by the algorithm.
This in turns is demonstrated by showing that for any set of data points, if these
points are "similar" to their underlying permutation, 7, the corresponding graph will
contain edges between these "simiar" nodes, and no edges to other nodes that are not
"similar". Finally, we show that data points that do not satisfy this condition cannot
affect the overall clustering.
To translate this outline into a concrete proof, we start with a precise definition
of our notion of "similarity" as Asymptotic Similarity as follows:
Definition 7 (Asymptotic Similarity). Let { (,(n), -(n) ) }nCZ+, be a sequence of pairs
of permutations over n items. We say that this sequence is asymptotically similar if
there exists a large enough n such that
* 1 of the top log(n) items of 7r(n) are present in the top s(B, 3/4) log(n) items of
the -(n), and
65
* a fraction p' , for of the top log(n) items of 9(l) come from the top r(3, p') log(n)
of the
+r(n),for
0 < p' < 1.
To establish the relationship between Asymptotic Similarity and the presence/absence
of edges in the sample graph, we will need the following two results, whose proof is
included later in the appendix:
Lemma 3 (Stretching Factor Lemma). Let o-
MNL(w, 3) for some underlying
permutation 7r. Then, for 0 < p,p' < 1, / > 1, T, T' > 0 we get:
" a proportionp of the top Tl items of 7r are in the top T1s( 3 , p) of a with probability 1 - 0(1/n), and
" a proportionp' of the top T'l items of o are from the top T'lr(3, p') of 7w with
probability 1 - 0(1/n),
where s(/, p) = pft
1
1
(1.01) and r (, p) = (1 - p)
(1.01).
Proposition 2. Let {E(n)}nEz+ be a sequence of samples, where each E(n) = {U-i,
contains N samples of permutations over
{1,
... ,
UN}
... , n}. Also, say that each set E(n) con-
tains independent samples each generated from MIX-MNL(fl(n), a) from asymptotically distinct {H(n) }, and such that the samples are each asymptotically similar to
their respective underlying sequence. Then, asymptotically, for each i -fj, our algorithm will place an edge between o-i and oj if and only if they were generatedfrom the
same underlying permutation.
With this in mind, we can proceed to prove Theorem 8.
3.4.2
Proof of Theorem 8
Proof. For clarity, we break the proof into the following steps:
1. For any given mixing component, and for any 6 > 0, we have N(1-6) points
sampled from that component w.h.p.
66
2. For any given a generated from a Power-Law MNL, MNL(7r, /), where
#
> 1,
a and 7r are asymptotically similar with probability 1 - 0(1/n).
3. For any given mixing component, and for any E > 0, there exists an index
no such that for any n > no at least 1 - E fraction of the samples satisfy the
asymptotic similarity w.h.p. This result, in conjunction with Proposition 2,
allows us to conclude that at least 1- E of the samples will be clustered correctly
by the algorithm.
4. For any E < 1
, the fraction of nodes that fail to satisfy this condition cannot
affect the clustering of the remaining nodes.
5. Finally, letting E scale as
lgN
and using the probability bound in (3), we
obtain the desired result.
Note that (1) can be established through a direct application of the Chernoff
Bound to the mixing distribution to yield a probability bound of 1 - exp(-O(N)).
To show (2), we need to show that the sequence in question satisfies both parts of
Definition 7. The first, and second, part of the definition follows from the Stretching
Factor Lemma by picking p = 3/4, and T = 1, and T' = s(o, 3/4), respectively, with
probability 1 - 0(1/n) as desired.
To show (3), let no be the smallest index such that for any n > no, any sample a
and the corresponding 7r are asymptotically similar with probability at least 1 - 2
where the existence of no follows from (2). Now let Ai be the indicator random variable
that sample ai and its parent permutation are asymptotically similar. Further, let
A
=
1 Ai.
Then, since E[A] > (1 - e/2)N, we get by the Chernoff bound that
P (A < (1 - E)N))
P (A < (1 - c/2)N < E[A]))
< exp (-E(NE 2 )).
Hence, by Proposition 2, we see that (1 - c)N of the samples are clustered properly
with probability at least 1 - exp (-e(NE 2 )). Now, we can establish (4) by using (3)
67
with E < 1j-6-1 we have at most Nincorrect =
= -jN nodes that are potentially
EN
disconnected from the nodes in their correct cluster. Using (1) and (3), we know
that each of the correctly clustered (1 - E)N nodes is connected to at least Nk
=
(1-3)(1-E)-Ng nodes in its (correct cluster) w.h.p. Now, and without loss of generality,
let us assume that the mis-clustered nodes have been connected to a single (wrong)
cluster. Since Nincorrect
!Nk, it is easy to see that these erroneous connections
cannot effect the correctly clustered nodes. Further, since the overall clustering can
be fixed by reassigning at most cN nodes, this also results in an E-correct clustering.
Finally, letting EN
get a (1
-
-
1oKyN)-correct
oN and plugging it in the probability bound in (3), we
clustering with probability
exp (-0 (NE 2 ))
=
-0
exp
=1
N
- 0 (1/N).
Proof of Lemma 3
3.4.3
,
Proof. For notational convenience, partition the item set into 3 sets: A A {iri, ...
B A {7rp +
1 1 ,...wr}, and R A L
(1pl)
-
'-) , and MR C Ef
Then we have mB
x
Oc
,1, x-4dx
(11-,) . Now consider the number
of items I(1) that must be sampled until we see pl of the top 1 items. Then I(1)
Si,
=
where si is the random variable representing how many items were sampled
after seeing the (i-1)th item from the top 1 items of 7r and until we see the ith item from
top 1 items of 7. We want to show that the quantity I(l) is with high probability linear
in 1 (as used in the algorithm). To that end, define q(4, p) A
'B
Then for i < pl
we clearly have that si is (first-order stochastic) dominated by Geom(q(3, p)), and we
also have that I(l) is (fist-order) stochastic dominated by E11 Geom(q(3, p)), which
is just a negative binomial distribution (since it is a sum of independent geometric
random variables of the same parameter). We can now use a concentration result on
68
the negative binomial and, for each c > 0, we get
P1
P1
Since s(Op) =
1++6
21
we get that
Hence, since E[Geomn(q(O))]j
P (I(l) ;> P
[ 2
s=
e
n
H
=
Geom(q(, A)
2Geom(q(, p)) > (1 + E)E
P
1
1
q (0, p)
(1 + 0) < exp (-E(l))
=
0(1/n).
= p -0+ 1 (1 + e), the first bullet point follows by picking
q(''
0.01.
Let us now prove that the second bullet point holds. Consider the event Ai that
-i E 7r. Let A(l) be the number of the top 1 samples that come from the top rl of
ir. Then A(l) = E1= f (Ai).
B
A
Then we have mB O E'1x-dx
{I+1, ... ,WrI}, and R A {7rl+1, .,n}
((l)1-4
(rl)1-3)
-
=
-
For notational convenience, partition the item set into 3 sets: A A {7, . . 71}
r mR Xx 8dx ,- " (r13).
(1 - r1--), and
Hence we get
mB
mB
=
1
-
((1
-1
_
(
1
/r
+ MR
=
p')rN(1.01))
1 - (1 - p')(1.01) 1-0 > p'.
We now see that each Ai (first-order) stochastically dominates independent Bernoulli
random variables, each with parameter q'(3, p') A mB+mR > p'. Hence A(l) is stochastically dominated by those random variables: E1= 1 Bern (q' (#, p')).
Hence, by Chernoff bound, we get
IP
A
Ai
69
< exp (0 (-l)) = 0(1/n).
Now, since E [
Ai > p'l, we get P (A (l) < (1 - c) p'l) = 0(1/n), and hence we
get that with the desired probability that a p' proportion of the top 1 items sampled
are from the top rl of the underlying sequence 7r. Furthermore, the inclusion of the
El
factors T and r' in the statement of the Lemma follows easily.
3.4.4
Proof of Proposition 2
Proof. Let us first consider two samples coming from the same underlying permutation. We shall show that samples coming from the same underlying permutation will
have an edge between them. Note that by the property of the definition of Asymptotic Similarity, if all nodes coming from the same underlying permutation have more
than a 3/4 proportion of the top l/s(, 3/4) items of
7r
within the top 1 sampled,
then they must share at least l/(2s(3, 3/4)) of those items of 7r. Since our algorithm
places an edge between nodes corresponding to samples that have at least l/(2s) items
in common, we see that samples coming from the same underlying permutation will
indeed have an edge between them.
Let us now consider samples a1 and 02 coming from different underlying permutations 7r, and r 2 . According to the second bullet point of definition of Asymptotic
Similarity, a p' proportion of the top 1 items of c-i come from the top ri/s of 7ri for
i = 1, 2, and for r = r(3, p'). In addition, since 7r 1 and r2 are asymptotically distinct,
by definition of Asymptotic Distinctiveness we have that for any c > 0 and -y > 0, we
have
17r"
n wr
< c'l holds asymptotically.
We will now show that for the following choices of p, y and E, the Lemma follows.
1, y = r, and c = 15
Take p' such that 3(1 - p') = 5s'
3 3 41
s(/ , / )r(O ,p'>
3
Now note the following relation holds: Jo n u4l < 3(1 - p')l + crl/s. This so
because if -1 and c- 2 intersect, then they must intersect either where 7ri and r 2 might
intersect (which is less than crl, by Asymptotic Distinctiveness), or where 7r, and -1
(or 7 2 and U2 ) intersect (which is less than 3(1 - p)). Now if we plug in our choice of
the parameters, then we we get
1
1
1
n a' I1 253(1 - p')l + Erl =- -l/s + 5 l/s < -l/s.
|ou
2
70
Hence there will be no edge between a- and o 2 if they come from different underlying
E
permutations.
3.4.5
Proof of Theorem 10
Proof. It is sufficient to show that, for 0 < a < 1, we have that {nf"z+ is asymptotically distinct with probability 1- 0(1/n). To that end, let 7i ~ MNL(I[, a) and
7rj ~ MNL(In, a), for 0 < a < 1, be sampled independently. Then we want to show
that for any c > 0 and -y > 0, we have
n 7r I < emi) = 1 - 0(1/n) -n
P (r
First note that it is clearly worst case when ri
=
1.
I. We shall now show that, as-
suming that 7i = fI, that with high probability the criteria above is satisfied. Consider
the event B that the wrj comes from the top -yl elements of the underlying permutation: {1, ... yl}. Then what we want to show is that asymptotically Eji
Bi < cyl for
any c > 0. Note that each Bi is (first-order) stochastic dominated by the independent
random variables B5, where each A3 is a Bernoulli random variable with parameter
)
where the weights wi are the weights of the underlying permu-
1
tation, and wi = i-'.
Now note that, for 0 K a < 1 we have
Ej
i
(
-
2> l
(Y0)-a+1
n-a+1 -((2-yl)-0a+1
which is in turn
=
e
((1(")
a)
,
_ (- 1)-a+l)'
and it goes to zero as n increases. Hence for all
n large enough we get EBi < c, which in turn implies that
71
EY1
< El. Now we
can use Chernoff and we get
Bi$ > E- l
P
=P
< exp (-E(l))
L Bi > E [L Bi]
=
0(1/n).
The same proof works for the case a = 1, except that to show that the EBi < E we
get that the sum of the wi is of the order of log n, and the same argument follows.
3.4.6
L
Proof of Theorem 9
Proof. In the algorithm, we estimate a transition matrix P based on the samples, and
then run the Markov Chain in order to get a stationary distribution of the top l/(s*r*)
items of each underlying permutation. This proof contains two main steps: 1) we show
that there is a matrix P whose unique stationary distribution is proportional to w,
2) we show that our matrix P is not too far from P, and that the small error does
not significantly disturb the stationary distribution.
Let us first show 1).
Let P
i
that Thnsicewiii
thesince
Markov Chain defined byw%+w,
= ='"3 =wjPji, we see
Then,
lr*-3
for i
=
j, and P =1
-i
Z
P is reversible, and that w is a stationary distribution. To see that w is the unique
stationary distribution, it suffices to note that P is aperiodic and irreducible, since all
transitions, including self-transitions, are always of positive probability in our case.
Let us note show 2). As described in the algorithm, Pij
j
1,
where Nij be the
are seen together in samples that are clustered as from
wT, and let Aij be the number of times that i is ranked higher than
j
among those Ni3
.
number of times that i and
= A
Intuitively, we expect P to converge to P as N increases. For the remainder of this
proof, we shall just prove this precisely. Note also that all of the top l/(s*r*) items
of 7 are being considered among those top l/s* most seen with high probability, due
to an application of the Stretching Factor Lemma.
Lemma 5 shows that the transition matrix we created is indeed full with high
probability, which in turn implies we have the Markov Chain specified by P also a
72
unique stationary distribution. We will now use Lemma 4 (due to Negahban et al,
2012), where we can bound how far away the stationary distribution P is from that
of P. After we have shown this we will done, since the stationary distribution of P
determines the weights of the top items.
Lemma 4 specifies two parameters that we must control: the error matrix A
=
irmax/rmin, which is related to the spectral gap of P
P - P and p = Amax(P)+ IA/2
and to the error matrix. From Lemma 6 we get that with probability 1- 09(log 2 N/N)
we have ||A11 2
, and in Lemma 8 that 1 - Amax(P) >
= (
1
=-(n). This
gives us
p
=
Amax(P) + IA 112 /Wrmax/'kmin
1
j
0 logo/
Also, since
/irmax/*rmin
probability 1 - (
0 (logp/2(n)) and IIA112 = 0
=
log2N)
(
, we have with
that
1
logn2
N
+ 0
log1-
P
iog/2)
We wish to show that both terms in the right-hand-side of the equation in Lemma
4 go to zero. The second term we see that
1
1
p
min-
max
(
log N
N
3/2\
(n N0,
/
as we wished. It suffices now to show that the first term is no greater than the first
one. That is, we must assert that
t
P
-rPO
IIMHI
V
I 1-
a
mx= 0
_
min
73
I JA11 Vmax /*umin)
(
=
Since
/rmax/rmin
(logp/2 (n)
= 0
pt log 3/ 2 (n)
log Alog2(n))
we must just guarantee that
log N log 3
0
N
/2
(n)
which we do by just taking t to be large enough.
Lemma 4 ([Negahban et al 2012). For any Markov chain P = P+A with a reversible
Markov chain P, let pt be the distribution of the Markov chain P when started wit an
initial distributionpo. Then,
lipt -
*1||
HIr
~
||po - MH
irmax
1
|+i
rmin
1 -
where -k is the stationary distribution of P, irmin
and p = Amax ()
+
I A|12
=
lrmax
7rmin'
mini *(i), irmax = max ii(i),
irmax /irmin-
Lemma 5. Let i and j be items among the top l/r* items seen, and let Nij be the
number of times that i and j are seen together among the top 1 in samples that are
clustered as from an underlying permutation 7r. Then Nij = O(N) with probability
1 - 0 (1/N).
Proof. Let i be one of the top 1 items of one of the K underlying permutations 7r.
There is an obvious cluster C in which the noiseless sample - = 7F would have been
clustered (where, by Theorem 8, most of its samples are clustered). We shall call the
number of samples clustered in C by NC.
By the Stretching Factor Lemma, i is present among the top 1 elements in 0.99N,
samples with probability 1 - 0(1/n). Furthermore, let
top 1 items of 7r. Then i and
j
j
be a different item from the
are seen together among the top 1 elements in at least
0.98N, samples with probability 1 - 0(1/n).
Now it is only left to show that Nc = E(N). By an application of the Chernoff
bound, for any 0 < 6 < 1, at least N(1
-
6)N samples will come from -r with proba74
bility 1- exp (-E(N)). By Theorem 8, with probability 1 - 0(1/N), we have that at
most 0
(
loj)
samples from 7r will be misclustered outside C. Putting these ob-
servations together, we get that with probability (1 - exp (-0
1 -6
1-0(1/N), at least
o
-
(
(N))) (1 - 0 (1/N)) =
= 0(N) samples from 7 will be mis-
N))
clustered in C, as we wished.
l
Lemma 6. Let A = P - P be the error in the transition probabilities. Then for
N = Q(n) we get
||A||2 = 0
log 2 N
(
N
with probability 1 - O(log 2 N/N).
Proof. Consider an (one of the K) underlying permutation r, and an arbitrary pair
-
of items (i, j) from the top l items of 7r. Then, by Lemma 5, with probability 1
exp(- (N)), we have that Nij> i = 0 (N).
There are two sources of contribution to Aij: 1) the misclusted samples: in Theorem 8 we learned that with probability 1- 0(1/N), the proportion
items isO
(
V
N) ,
6
N
of misclustered
and 2) the natural noise from the properly clustered samples:
from the Chernoff bound we know that a sample mean (based on Ni samples) of
Bernoulli random variable is at most 0
probability 1 - O(1/MN3 ).
(
lN
away from the true mean with
3
Putting the two together, we get that with probability
1 - 0 (1/N) the error AXj satisfies
log N
log N
Aij = |P13 - PyI < EN +=
NJ
We can now use the Union bound over all 0 (12) pairs, and we get that with
probability 1 -0 (l 2 /N) that JAsj = 0
(
Vo
for each pair i, j from the top l.
rN)
This allows us upper bound the spectral radius of A with its Frobenius norm, which
gives us that, with probability 1
-
0 (12 /N), we have
)
JJAI1 2 < HL\HF
=
75
j
i
1/2
1/2
log
log N
N)
(120
Since N = Q(n) and 1 = e (n), we get [|All 2
0
=
(
with probability
N
1 - 0 (log 2 N/N), as desired.
Lemma 7 (Comparison Lemma, Negahban et al 2012). Let Q, p and P, r be reversible
Markov chains on a finite set [n] representing random walks on a graph G = ([n], E),
i.e. P(i, j) = 0 and Q(i, j) = 0 if (i,j)
and
E. For a
min(ij)EE{7R(i)Pij/I(i)QijI
-_maxjir(i)/p(j)},
Amax(P > a
- Amrax(Q) -
#
l
Lemma 8. Let Amax(P) be the second largest eigenvalue (in absolute value) of P.
Then
1
0_1
1 - Amax(P)
Proof. Let us use the Comparison Lemma (Lemma 7 above), with
Q being
the tran-
sition matrix on the complete graph with l/r* items. That is, Qjj = 1/(l/r*) and
=
1(i)
1/# for all i. Then, for b = maxij !i)
= E(1,), we get
Wi
(i)P
>
wi ;(wi + wj)
b 1
= 1
b1 2
This allows us to bound a as a' = min(i,j) r(i)Pij/u(i)Qij >
-
b 1- 1
E)(). Also,
i
Hence, since 1 - Amax
1-
(Q)
= -r
W~)
max
i
fr(i)
7
1/y
r
-
(1)_
-
(l).
, we get
Amax(P) > (1Amax (Q))
> E)
-
#=Max
,,a(i)
b()
=e
1
E
76
Chapter 4
Conclusion and Future Work
Taken together, the results provided in the previous three papers serve a dual purpose:
they propose a model, or framework, for the efficient provision of personalized ranked
recommendation.
The hope is that the model and the analysis would provide a
convincing case for using the corresponding algorithms, or the broader framework.
These results also contribute to the existing literature on learning choice models and
learning mixtures. Naturally, this leaves a lot to be done in many directions. This
section briefly highlights some potential directions for further research.
With respect to the model and the available data, and since the different components of the mixtures considered so far correspond to different user types; it would
be interested to look at the 'hedonic pricing' setting where the item weights are a
function of some item attributes. It would also be interesting to see if there could
add anything to the analysis.
With respect to the analysis, it would be interesting to look for a tighter characterization of the class of learning or unlearnable models. First, it would be interesting
to seek a set of conditions where the number of available clusters is allowed to grow in
some fashion. To that end, it might be useful to consider alternative algorithms that
lend themselves to analysis when the number of clusters grow. Similarly, it would be
interesting to reconsider these results where the mixing distribution is not uniform,
but somehow maintains a bound on the size of the cluster. For instance, it might be
useful consider the bayesian setting where this distribution comes from a Dirichlet
77
prior (with a > 1?).
Finally, it would be interesting to apply some of the techniques included here, and
in the broader literature, to more real data sets to seek further understanding of the
systems that underly and produce this data. For example, it would be interesting to
see if the groups of movie watchers, or restaurant goers, can consistently have some
interpretation.
78
Bibliography
[1] Mit150 celebrations. http://mit150.mit.edu.
[21 S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization
bounds for the area under the roc curve. Journal of Machine Learning Research,
6(1):393, 2006.
[31 S. Agrawal, Z. Wang, and Y. Ye. Parimutuel betting on permutations. Internet
and Network Economics, pages 126-137, 2008.
14]
Ammar Ammar and Devavrat Shah. Efficient rank aggregation using partial
data. A CM SIGME TRICS Performance EvaluationReview, 40(1):355-366, 2012.
[5] K.J. Arrow. Social choice and individual values. Number 12. Yale Univ Pr, 1963.
[6] M. Bayati, D. Shah, and M. Sharma. Max-product for maximum weight matching: Convergence, correctness, and lp duality. Information Theory, IEEE Transactions on, 54(3):1241-1251, 2008.
[71 D.P. Bertsekas. The auction algorithm: A distributed relaxation method for the
assignment problem. Annals of Operations Research, 14(1):105-123, 1988.
[8] V.S. Borkar. Stochastic approximation: a dynamical systems viewpoint. Cambridge Univ Pr, 2008.
[9] J Hayden Boyd and Robert E Mellman. The effect of fuel economy standards on
the us automotive market: an hedonic demand analysis. TransportationResearch
Part A: General, 14(5):367-378, 1980.
[10] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block
designs: I. the method of paired comparisons. Biometrika, 39(3/4):324-345,
1952.
[11] Emmanuel J Cand6s and Benjamin Recht. Exact matrix completion via convex
optimization. Foundations of Computational mathematics, 9(6):717-772, 2009.
[12] N Scott Cardell and Frederick C Dunbar. Measuring the societal impacts of
automobile downsizing. Transportation Research Part A: General, 14(5):423434, 1980.
79
[131 Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson.
Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.
[141 K. Crammer and Y. Singer. On the algorithmic implementation of multiclass
kernel-based vector machines. The Journalof Machine Learning Research, 2:265292, 2002.
[15] 0. Dekel, C. Manning, and Y. Singer. Log-linear models for label ranking.
Advances in neural information processing systems, 16, 2003.
[161 P. Diaconis. Group representations in probability and statistics, volume 11. Inst
of Mathematical Statistic, 1988.
117] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods
for the web. In Proceedings of the 10th internationalconference on World Wide
Web, pages 613-622. ACM, 2001.
[18] J. Edmonds and R.M. Karp. Theoretical improvements in algorithmic efficiency
for network flow problems. Journal of the ACM (JA CM), 19(2):248-264, 1972.
1191 V. Farias, S. Jagabathula, and D. Shah. A data-driven approach to modeling
choice. Advances in Neural Information Processing Systems, 22:504-512, 2009.
[201 Vivek Farias, Srikanth Jagabathula, and Devavrat Shah. A data-driven approach
to modeling choice. In Advances in Neural Information Processing Systems, pages
504-512, 2009.
[211 Vivek F Farias, Srikanth Jagabathula, and Devavrat Shah. A nonparametric
approach to modeling choice with limited data. arXiv preprint arXiv:0910.0063,
2009.
[221 Vivek F Farias, Srikanth Jagabathula, and Devavrat Shah. Sparse choice models.
In Information Sciences and Systems (CISS), 2012 46th Annual Conference on,
pages 1-28. IEEE, 2012.
[23] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm
for combining preferences. The Journal of Machine Learning Research, 4:933969, 2003.
[241 R. Herbrich, T. Minka, and T. Graepel. Trueskilltm: A bayesian skill rating
system. Advances in Neural Information Processing Systems, 20:569-576, 2007.
[25] J. Huang, C. Guestrin, and L. Guibas. Efficient inference for distributions on
permutations. Advances in neural information processing systems, 20:697-704,
2008.
[261 S. Jagabathula and D. Shah. Inferring rankings under constrained sensing. Advances in Neural Information Processing Systems (NIPS), 2008.
80
[27] M. Jerrum, A. Sinclair, and E. Vigoda. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. Journal of the
A CM (JA CM), 51(4):671-697, 2004.
128] L. Jiang, D. Shah, J. Shin, and J. Walrand. Distributed random access alg6rithm:
scheduling and congestion control. Information Theory, IEEE Transactions on,
56(12):6182-6207, 2010.
[291 Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. Information Theory, IEEE Transactions on, 56(6):29802998, 2010.
[301 R Duncan Luce. Individual choice behavior: A theoretical analysis. DoverPublications. com, 2012.
[31] Jacob Marschak. Binary-choice constraints and random utility indicators. In
Proceedings of a Symposium on Mathematical Methods in the Social Sciences,
1960.
[32] Daniel McFadden. Conditional logit analysis of qualitative choice behavior. 1973.
[33] Daniel McFadden and Kenneth Train. Mixed mnl models for discrete response.
Journal of applied Econometrics, 15(5):447-470, 2000.
[34] I. Mitliagkas, A. Gopalan, C. Caramanis, and S. Vishwanath. User rankings
from comparisons: Learning permutations in high dimensions. In Proceedings of
Allerton Conference, 2011.
[35] Sahand Negahban, Sewoong Oh, and Devavrat Shah. Iterative ranking from
pair-wise comparisons. arXiv preprint arXiv:1209.1688, 2012.
[36] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank
matrices with noise and high-dimensional scaling. The Annals of Statistics,
39(2):1069-1097, 2011.
[371 Robin L Plackett. The analysis of permutations. Applied Statistics, pages 193202, 1975.
[38] C. Rudin. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. The Journal of Machine Learning Research,
10:2233-2271, 2009.
[39] C. Rudin and R.E. Schapire. Margin-based ranking and an equivalence between
adaboost and rankboost. The Journal of Machine Learning Research, 10:21932232, 2009.
[40] Paul A Samuelson. A note on the pure theory of consumer's behaviour. Economica, 5(17):61-71, 1938.
81
[41] S. Shalev-Shwartz and Y. Singer. Efficient learning of label ranking by soft
projections onto polyhedra. The Journal of Machine Learning Research, 7:15671599, 2006.
[42] Hossein Azari Soufiani, David C Parkes, and Lirong Xia. Random utility theory
for social choice. In NIPS, pages 126-134, 2012.
1431
L.L. Thurstone. A law of comparative judgment. Psychological review, 34(4):273,
1927.
[441 N. Usunier, M.R. Amini, and P. Gallinari. A data-dependent generalisation error
bound for the auc. In Proceedingsof the ICML 2005 Workshop on ROC Analysis
in Machine Learning. Citeseer, 2005.
[45]
L.G. Valiant. The complexity of computing the permanent. Theoretical computer
science, 8(2):189-201, 1979.
[46] Koen Verstrepen and Bart Goethals. Unifying nearest neighbors collaborative
filtering. In Proceedings of the 8th ACM Conference on Recommender Systems,
RecSys '14, 2014.
[471 M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and
variational inference. Foundations and Trends@ in Machine Learning, 1(1-2):1305, 2008.
[48] D.J.A. Welsh. Complexity: knots, colourings and counting. Number 186. Cambridge Univ Pr, 1993.
82
Download